This started out as a weekend hack with gpt-4-mini, using the very basic strategy of "just ask the ai to ocr the document". But this turned out to be better performing than our current implementation of Unstructured/Textract. At pretty much the same cost.
In particular, we've seen the vision models do a great job on charts, infographics, and handwritten text. Documents are a visual format after all, so a vision model makes sense!
I posted the first experiments on HN, and since then, we've had some great contributors who have helped turn this into a full package. We have two versions now:
- pip package [https://pypi.org/project/py-zerox/] - npm package [https://www.npmjs.com/package/zerox]
Next steps for us are working on building an open source dataset for fine tuning. We've seen some early success with a charts=>markdown fine tuning data set, and excited to keep building.
Github: https://github.com/getomni-ai/zerox
You can try out a hosted version here: https://getomni.ai/ocr-demo