Kreuzberg is an MIT licensed Python library that extracts text from a wide range of documents (PDFs, images, office files etc.) without depending on external APIs dependencies.
Its different from other libraries and commercial offerings in this space by being designed to be (1) lightweight, (2) CPU orientated, (3) simple to user and (4) have async support as a first class citizen.
The v3.0 release completely reworks the architecture for extensibility. Kreuzberg now now supports:
- Multiple OCR backends (Tesseract, PaddleOCR, EasyOCR), with OCR itself being completely optional. - Support custom extractors and overriding of builtin extractors. - Post-processing and validation hooks. - Extensive PDF metadata extraction. - Optional support for semantic chunking.
There is also a brand new documentation site at https://goldziher.github.io/kreuzberg.
I also published a roadmap for the project, which you can see here: https://github.com/Goldziher/kreuzberg/discussions/24
You can see the repo at https://github.com/Goldziher/kreuzberg - please star it if you find it valuable, since this motivates me!