feat(parsers): pluggable document parsers — MinerU / Mistral / VLM (closes #77)#81
Open
KylinMountain wants to merge 26 commits into
Open
feat(parsers): pluggable document parsers — MinerU / Mistral / VLM (closes #77)#81KylinMountain wants to merge 26 commits into
KylinMountain wants to merge 26 commits into
Conversation
| """Tests for LocalParser — preserves legacy md/pdf/markitdown behavior.""" | ||
| from __future__ import annotations | ||
|
|
||
| from pathlib import Path |
|
|
||
|
|
||
| def test_result_from_zip_does_not_rewrite_links(tmp_path): | ||
| import io, zipfile |
|
|
||
|
|
||
| def test_cloud_empty_extract_result_then_done(monkeypatch, tmp_path): | ||
| import io, sys, types, zipfile |
|
|
||
|
|
||
| def test_full_md_basename_preferred_over_endswith(tmp_path): | ||
| import io, zipfile |
|
|
||
|
|
||
| def test_image_basename_collision_warns(tmp_path, caplog): | ||
| import io, zipfile, logging as _logging |
| zf.writestr("images/fig.png", b"A") | ||
| zf.writestr("sub/fig.png", b"B") | ||
| with caplog.at_level(_logging.WARNING): | ||
| result = _result_from_zip(buf.getvalue()) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a pluggable parser abstraction so the file → Markdown step can be routed
through higher-accuracy online/self-hosted parsers, addressing #77 (markitdown gives
breadth but not precision on complex docs like papers). The local parser remains the
zero-config default with no new required dependencies.
openkb/parsers/package:ParserABC +ParseResult, aget_parserregistry, and adapters:local(default) — existing pymupdf/markitdown behavior, refactored behind the abstraction (behavior-preserving).mineru— MinerU over HTTP: hosted cloud (submit→poll→zip) or self-hosted server.mistral— Mistral OCR via the officialmistralaiSDK (sync).vlm— any vision LLM via the existinglitellmdep (Gemini, GPT-4o, Claude, …)..openkb/config.yamlparser:+ aparsers:per-provider block; per-run CLI overrideopenkb add --parser <name>(validated byclick.Choice).openkb[mistral],openkb[mineru],openkb[parsers];vlmneeds none. Adapters lazy-import and raise actionable errors when an extra/key is missing.localize_images(writes basenames, rewrites links; reusesextract_base64_images).Scope / non-goals
pageindex_threshold, default 20) still go to PageIndex — theparsersetting governs shorter docs and non-PDF formats.parsers/vlm_client.pyis factored so Feature: Optional image understanding / vision for inline and referenced images #74 can plug in later.Test plan
localize_images+ the converter refactor (HTTP/SDK fully mocked — no network). MinerU cloud poll+download flow covered.trafilaturaurl_ingest tests, unrelated).examples/docs/attention-is-all-you-need.pdf→ Markdown + image extraction).Review fixes included
Hardened from review: MinerU poll-loop guards (non-positive interval, empty result), anchored image-link rewrite, path-traversal-safe image filenames, VLM global-model fallback warning, narrowed
LLM_API_KEYpropagation (no longer defeats the Mistral key guard), and single-source-of-truth parser dispatch.