#

text-extraction

Here are 590 public repositories matching this topic...

run-llama / liteparse

A fast, helpful, and open-source document parser

pdf ocr text-extraction ocr-recognition pdf-parser document-processing document-ocr

Updated Jun 2, 2026
Rust

kreuzberg

kreuzberg-dev / kreuzberg

A polyglot document intelligence framework with a Rust core. Extract text, metadata, images, and structured information from PDFs, Office documents, images, and 97+ formats. Available for Rust, Python, Ruby, Java, Go, PHP, Elixir, C#, R, C, TypeScript (Node/Bun/Wasm/Deno)- or use via CLI, REST API, or MCP server.

Updated Jun 2, 2026
Rust

adbar / trafilatura

Python & Command-line tool to gather text and metadata on the Web: Crawling, scraping, extraction, output as CSV, JSON, HTML, MD, TXT, XML

Updated Jun 2, 2026
Python

miso-belica / sumy

Module for automatic summarization of text documents and HTML pages.

python nlp pagerank-algorithm text-extraction reduction summarization html-page summary lsa sumy textteaser summarizer html-extraction html-extractor

Updated Mar 31, 2026
Python

unidoc / unipdf

Golang PDF library for creating and processing PDF files (pure go)

golang pdf signing text-extraction pdf-generator pdf-generation pdf-reader pdf-manipulation pdf-library pdf-document-processor pdf-compression pdf-sign pdf-reports

Updated May 27, 2026
Go

chrismattmann / tika-python

Tika-Python is a Python binding to the Apache Tika™ REST services allowing Tika to be called natively in the Python community.

Updated May 27, 2026
Python

firecrawl / pdf-inspector

Fast Rust library for PDF inspection, classification, and text extraction. Intelligently detects scanned vs text-based PDFs to enable smart routing decisions.

nodejs python markdown rust pdf text-extraction pdf-parser pdf-extraction ocr-routing pdf-classification

Updated Jun 1, 2026
Rust

whitelok / image-text-localization-recognition

A general list of resources to image text localization and recognition 场景文本位置感知与识别的论文资源与实现合集シーンテキストの位置認識と識別のための論文リソースの要約

machine-learning awesome ocr deep-learning text-extraction text-recognition deep-learning-algorithms convolutional-neural-networks text-detection scene-texts

Updated Sep 17, 2023

Lulzx / zpdf

Zero-copy PDF text extraction library written in Zig. High-performance, memory-mapped parsing with SIMD acceleration.

pdf parser high-performance zig zero-copy text-extraction simd zero-dependency

Updated Mar 1, 2026
Zig

yigitkonur / api-llm-ocr

PDF to markdown using vision LLMs — tables, layouts, and structure preserved

python ocr text-extraction table-extraction fastapi document-ai pdf-to-markdown vision-llm

Updated Feb 21, 2026
Python

miso-belica / jusText

Heuristic based boilerplate removal tool

python text-extraction html-parser html-parsing

Updated Feb 25, 2025
Python

yfedoseev / pdf_oxide

The fastest PDF library for Python and Rust. Text extraction, image extraction, markdown conversion, PDF creation & editing. 0.8ms mean, 5× faster than industry leaders, 100% pass rate on 3,830 PDFs. MIT/Apache-2.0.

python markdown rust fast pdf text-extraction data-extraction pdf-generation pdf-to-text pdf-library pdf-parser document-processing rag pyo3 pdf-editor image-extraction llm pdf-to-markdown

Updated Jun 2, 2026
Rust

html-to-markdown

kreuzberg-dev / html-to-markdown

High performance and CommonMark compliant HTML to Markdown converter. Maintained by the Kreuzberg team. Kreuzberg is a fast, polyglot document intelligence engine with a Rust core. It extracts structured data from 56+ document formats using streaming parsers and built-in OCR.

html markdown text-extraction html-converter text-processing hocr markdown-converter rag

Updated Jun 2, 2026
HTML

datashare

ICIJ / datashare

A self‑hosted search engine for documents

docker elasticsearch extract text-extraction named-entity-recognition web-gui datashare investigative-journalism

Updated Jun 2, 2026
Java

unidoc / unidoc

This repository has moved! https://github.com/unidoc/unipdf

golang pdf text-extraction pdf-files pdf-invoice unidoc pdf-library

Updated May 23, 2019
Go

ropensci / pdftools

Text Extraction, Rendering and Converting of PDF Documents

r text-extraction rstats pdf-files r-package poppler pdf-format poppler-library pdftools

Updated May 13, 2026
C++

cdown / srt

A simple library and set of tools for parsing, modifying, and composing SRT files.

python library tools command-line text-extraction subtitles subtitle srt subtitles-parsing mit-license command-line-tool subtitle-parser subtitle-fixer

Updated Mar 19, 2024
Python

iamarunbrahma / vision-parse

Parse PDFs into markdown using Vision LLMs

text-extraction pdf-parser document-parser pdf-to-markdown

Updated Oct 4, 2025
Python

flairNLP / fundus

A very simple news crawler with a funny name

python nlp rss sitemap crawler scraper corpus text-extraction web-scraping image-classification datasets news-crawler corpus-tools commoncrawl web-corpus news-scraping cc-news image-extraction

Updated May 28, 2026
Python

shixzie / nlp

[UNMANTEINED] Extract values from strings and fill your structs with nlp.

nlp go golang natural-language-processing parse text text-extraction

Updated Sep 18, 2017
Go

Improve this page

Add a description, image, and links to the text-extraction topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the text-extraction topic, visit your repo's landing page and select "manage topics."