When testing complex PDFs (with diagrams, formulas, etc.), I found Gemini would straight up hallucinate, while GPT-4o asked me to run a script to parse the file: "It looks like you uploaded a PDF, but I can't directly analyze it unless I extract its contents. Do you want me to process the document and find the relevant data related to IRR (Internal Rate of Return)?"—but that script only handled plain text, missing tables, and images where the actual answer was.
We're building DataBridge, an open-source parser and retriever using specialized multi-modal embeddings (ColPali-inspired), aiming to solve this efficiently and cheaply. It handles PDFs, videos, DOCs, TXT files, and can also take in rules to extract or transform content. In my experience, higher-quality ingestion significantly improves retrieval accuracy for LLMs.
db.ingest_file(file="/path/file.pdf", filename="report2025", use_colpali=True, rules=[MetadataExtraction(schema=json_object)])
This single command embeds text, visual elements, and metadata into one vector space. This article explains Colpali vs traditional parsing pipelines better than I can: https://medium.com/@shashankvats/colpali-explained-bridging-.... Or for those more technically inclined: https://arxiv.org/abs/2407.01449
We’re also currently working on an MCP server integration so tools like Claude Desktop can directly benefit from this enhanced context.
Would love your thoughts and any feedback.