I wanted to demonstrate an easy to use API for Cache Augment Generation. For any open source LLM available on Llama cpp, we can store the KV-cache and model state after it has processed a large corpus of documents, and then load that state in every time we query the document.
This leads to a drastic reduction in latency as well as compute/energy used by the model.
This demo is part of a larger system that I'm building called DataBridge[0] - with a focus on implementing new and useful techniques for knowledge retrieval - allowing developers to use the latest research in production.
I'd love to hear your feedback on DataBridge, and the CAG feature. If you have papers or particular techniques you'd like to see implemented, I'd love to hear about it :)