Skip to content

feat(parsers): pluggable document parsers — MinerU / Mistral / VLM (closes #77)#81

Open
KylinMountain wants to merge 26 commits into
mainfrom
feat/online-parsers
Open

feat(parsers): pluggable document parsers — MinerU / Mistral / VLM (closes #77)#81
KylinMountain wants to merge 26 commits into
mainfrom
feat/online-parsers

Conversation

@KylinMountain
Copy link
Copy Markdown
Collaborator

Summary

Adds a pluggable parser abstraction so the file → Markdown step can be routed
through higher-accuracy online/self-hosted parsers, addressing #77 (markitdown gives
breadth but not precision on complex docs like papers). The local parser remains the
zero-config default with no new required dependencies.

  • New openkb/parsers/ package: Parser ABC + ParseResult, a get_parser registry, and adapters:
    • local (default) — existing pymupdf/markitdown behavior, refactored behind the abstraction (behavior-preserving).
    • mineru — MinerU over HTTP: hosted cloud (submit→poll→zip) or self-hosted server.
    • mistral — Mistral OCR via the official mistralai SDK (sync).
    • vlm — any vision LLM via the existing litellm dep (Gemini, GPT-4o, Claude, …).
  • Selection via .openkb/config.yaml parser: + a parsers: per-provider block; per-run CLI override openkb add --parser <name> (validated by click.Choice).
  • Optional deps via pip extras: openkb[mistral], openkb[mineru], openkb[parsers]; vlm needs none. Adapters lazy-import and raise actionable errors when an extra/key is missing.
  • Image handling unified via localize_images (writes basenames, rewrites links; reuses extract_base64_images).

Scope / non-goals

Test plan

  • Unit tests for every adapter + registry + localize_images + the converter refactor (HTTP/SDK fully mocked — no network). MinerU cloud poll+download flow covered.
  • Full suite green (616 passed; the only repo-wide failures are pre-existing trafilatura url_ingest tests, unrelated).
  • Real-PDF smoke test of the local path (examples/docs/attention-is-all-you-need.pdf → Markdown + image extraction).
  • Real online-parser e2e (MinerU/Mistral/Gemini) — requires API keys; not run in CI.

Review fixes included

Hardened from review: MinerU poll-loop guards (non-positive interval, empty result), anchored image-link rewrite, path-traversal-safe image filenames, VLM global-model fallback warning, narrowed LLM_API_KEY propagation (no longer defeats the Mistral key guard), and single-source-of-truth parser dispatch.

"""Tests for LocalParser — preserves legacy md/pdf/markitdown behavior."""
from __future__ import annotations

from pathlib import Path
Comment thread tests/test_parsers_mineru.py Fixed
Comment thread tests/test_parsers_mineru.py Fixed


def test_result_from_zip_does_not_rewrite_links(tmp_path):
import io, zipfile


def test_cloud_empty_extract_result_then_done(monkeypatch, tmp_path):
import io, sys, types, zipfile


def test_full_md_basename_preferred_over_endswith(tmp_path):
import io, zipfile


def test_image_basename_collision_warns(tmp_path, caplog):
import io, zipfile, logging as _logging
zf.writestr("images/fig.png", b"A")
zf.writestr("sub/fig.png", b"B")
with caplog.at_level(_logging.WARNING):
result = _result_from_zip(buf.getvalue())
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant