4 weeks ago
Sat Mar 29, 2025 12:15pm PST
Show HN: I built an open-source vision RAG that embeds images and text together
My brother and I have been working on Morphik: an open-source multimodal database. (Github: https://github.com/morphik-org/morphik-core)

After experimenting with various AI models, we realized that they were particularly bad at answering questions which required retrieving over images and other multimodal data.

That is, if I uploaded a 10-20 page PDF to ChatGPT, and ask it to get me a result from a particular diagram in the PDF, it would fail and hallucinate instead. I faced the same issue with Claude (not with Gemini, though).

Turns out, the issue was with how these systems ingest documents. It seems like both Claude and GPT embed larger PDFs by parsing them into text, and then adding the entire thing to the context of the chat. While this works for text-heavy documents, it fails for queries/documents relating to diagrams, graphs, or infographics.

Something that can help solve this is directly embedding the document as a list of images, and performing retrieval over that - getting the closest images to the query, and feeding the LLM exactly those images. This helps reduce the amount of tokens an LLM consumes while also increasing the visual reasoning ability of the model.

We've implemented a one-line solution that does exactly this with Morphik. You can check out the specifics in the attached blog, or get started with it through our quick start guide: https://docs.morphik.ai/getting-started

Here is an example ingestion pathway:

``` from databridge import DataBridge

db = DataBridge()

db.ingest_file("report_with_images_and_graphs.pdf", use_colpali=True) ```

And here is an example query pathway:

``` db.query("At what time-step did we see the highest GDP growth rate?", use_colpali=True) ```

Would love to hear your thoughts!

read article
comments:
add comment
loading comments...