We open-sourced one of the key components: PageIndex, a hierarchical indexing system that transforms large documents (like financial reports, regulatory documents, or textbooks) into semantic trees optimized for reasoning-based RAG.
Some highlights:
- Hierarchical Structure: Organizes lengthy PDFs into LLM-friendly trees — like a smart table of contents.
- Precise Referencing: Each node includes a summary and exact physical page numbers.
- Natural Segmentation: Nodes align with document sections, preserving context — no arbitrary chunking.
We've used PageIndex for financial document analysis with reasoning-based RAG and saw significant improvements in retrieval accuracy compared to vector-based systems.
Would love any feedback — especially thoughts on reasoning-based RAG, or ideas for where PageIndex could be applied!