7 months ago
Tues Sep 10, 2024 2:07pm PST
Show HN: RAG Large Data Pipeline with 1000 Models
Hi, I'm a PhD in data + LLMs. I'm building an LLM chatbot for dbt.

The challenge isn't LLMs but dbt pipelines, which are too large (e.g., >1K models) to fit in the context window.

Traditional vector RAG only works well for texts but poorly for SQLs.

To solve this, we built a novel RAG using lineage.

I tested it on dbt projects with 1000+ models, and it works very well.

Some cool use cases the such chatbot does well: - Models discovery (I have a high-level question, which tables to use?)

- Safe model edits (I want to edit the models, what downstream models are affected?)

- Model debugging (This column looks wrong, how is it computed upstream?)

- New pipelines prototyping (I want to add a new metric, how are similar metrics computed?)

- Natural language querying (I want to understand customer better, recommend some queries)

- Pipelines optimizations (This model is slow, any inefficiency in the pipeline?) - etc.

Live demo on RAG for the Shopify dbt project (built by Fivetran): https://cocoon-data-transformation.github.io/page/pipeline

Enter your question, and it will generate a response live (refresh the page for the latest messages).

Video Demo: https://www.youtube.com/watch?v=kv5mwTkpfY0

Notebook to RAG your dbt: https://colab.research.google.com/github/Cocoon-Data-Transfo...

You'll need to provide LLM APIs (Claude 3.5 strongly recommended) and a dbt project (just need target/manifest.json).

The project is open-sourced: https://github.com/Cocoon-Data-Transformation/cocoon

read article
comments:
add comment
loading comments...