A fast, helpful, and open-source document parser
-
Updated
Jun 2, 2026 - Rust
A fast, helpful, and open-source document parser
A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.
Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML
Module for automatic summarization of text documents and HTML pages.
Golang PDF library for creating and processing PDF files (pure go)
Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.
Fast Rust library for PDF inspection, classification, and text extraction. Intelligently detects scanned vs text-based PDFs to enable smart routing decisions.
A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集 シーンテキストの位置認識と識別のための論文リソースの要約
Zero-copy PDF text extraction library written in Zig. High-performance, memory-mapped parsing with SIMD acceleration.
PDF to markdown using vision LLMs — tables, layouts, and structure preserved
Heuristic based boilerplate removal tool
The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.
High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.
A self‑hosted search engine for documents
This repository has moved! https://github.com/unidoc/unipdf
Text Extraction, Rendering and Converting of PDF Documents
A simple library and set of tools for parsing, modifying, and composing SRT files.
Parse PDFs into markdown using Vision LLMs
A very simple news crawler with a funny name
[UNMANTEINED] Extract values from strings and fill your structs with nlp.
Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.
To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."