1 day ago

Tues Apr 1, 2025 4:28pm PST

RAG Without Vectors – PageIndex: Reasoning-Based Document Indexing

We were frustrated by vector-based RAG systems that rely on semantic similarity and often fail on long, domain-specific documents. In these contexts, domain-specific terminology tends to be semantically similar, making it hard to retrieve the exact content users need. It’s also difficult to incorporate expert knowledge or user preferences effectively. So we started exploring a more reasoning-driven approach to RAG. Inspired by the tree search algorithm in AlphaGo, we came up with a reasoning-based RAG system that uses tree search to guide retrieval.

We open-sourced one of the key components: PageIndex, a hierarchical indexing system that transforms large documents (like financial reports, regulatory documents, or textbooks) into semantic trees optimized for reasoning-based RAG.

Some highlights:

- Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.

- Precise Referencing: Each node includes a summary and exact physical page numbers.

- Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.

We've used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy compared to vector-based systems.

Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!

comments:

add comment

loading comments...