The challenge isn't LLMs but dbt pipelines, which are too large (e.g., >1K models) to fit in the context window.
Traditional vector RAG only works well for texts but poorly for SQLs.
To solve this, we built a novel RAG using lineage.
I tested it on dbt projects with 1000+ models, and it works very well.
Some cool use cases the such chatbot does well: - Models discovery (I have a high-level question, which tables to use?)
- Safe model edits (I want to edit the models, what downstream models are affected?)
- Model debugging (This column looks wrong, how is it computed upstream?)
- New pipelines prototyping (I want to add a new metric, how are similar metrics computed?)
- Natural language querying (I want to understand customer better, recommend some queries)
- Pipelines optimizations (This model is slow, any inefficiency in the pipeline?) - etc.
Live demo on RAG for the Shopify dbt project (built by Fivetran): https://cocoon-data-transformation.github.io/page/pipeline
Enter your question, and it will generate a response live (refresh the page for the latest messages).
Video Demo: https://www.youtube.com/watch?v=kv5mwTkpfY0
Notebook to RAG your dbt: https://colab.research.google.com/github/Cocoon-Data-Transfo...
You'll need to provide LLM APIs (Claude 3.5 strongly recommended) and a dbt project (just need target/manifest.json).
The project is open-sourced: https://github.com/Cocoon-Data-Transformation/cocoon