PDF ingestion

Feeding technical documents to local language models

2026-05-23 — 2026-06-13

Wherein the Routes from PDF to Language Model Are Mapped — Letting the Model Read the Page, or Converting It to Text — Converters Are Compared Over Equation-Dense Papers, Silent Mathematical Symbol Corruption Is Traced to Whole-Page Transcription, and a Routing Policy for Agent Deployment Is Codified.

computers are awful
machine learning
neural nets
NLP
UI
Figure 1

I want my local language models to read my PDFs, analyze them, answer questions about them, and so on.

There are two ways to get there.

  1. We can let a vision-capable model look at the rendered page as an image— direct, no conversion.
  2. We can turn the PDF into text first and feed it that.

For cloud pipelines this doesn’t matter; it is all magically taken care of. For local models I tend to be more stingy with my VRAM and I want to use a pure LLM rather than waste bytes on a multimodal variant.

This is fiddlier than I’d like. In particular, since I care about getting the mathematics our of the papers I read, I want a high-fidelity converter, and the ones I’ve tried have been a mixed bag.

This is the Reverse LaTeX at scale: not rendering an equation but recovering it from the render.

1 The options

flowchart TD
  A[A PDF I want a local model to read] --> B{Does the client/model<br/>already handle it<br/>well enough?}
  B -->|yes| STOP[Stop — let the client do it]
  B -->|no| C{Read it once,<br/>or keep the text?}
  C -->|read once,<br/>interactively| V[Vision model reads the page<br/>— fluent, but not for<br/>maths to be quoted verbatim]
  C -->|reuse: batch,<br/>long doc, RAG index| D{What is in it?}
  D -->|born-digital<br/>text layer| MIT[markitdown]
  D -->|maths to be used| MRK[marker or MinerU]
  D -->|tables, structure,<br/>JSON for machines| DOC[Docling]
  D -->|scans, photos,<br/>handwriting| OCR[DeepSeek-OCR<br/>or Chandra]

2 When the client already handles it

The easiest method is to do nothing. Where the client (or the model itself) handles the PDF well enough this seems fine. Several frontends accept drag-n-dropped PDFs and convert them into something the model can read, and that does well enough we stop here. See below for (my best understanding of) the toolchain in each case.

Frontend Native input How
LM Studio Drag-drop .pdf / .docx / .txt; short docs go fully into context, long ones trigger auto-RAG
Jan ✅ + tool-calling File attachments since v0.7.4, plus MCP tool-calling can hand the PDF off to whichever extractor we wire up
Osaurus ✅ text-layer Attachments route through a document adapter registry (PDF, XLSX, PPTX, CSV) that preserves structure: the PDF adapter does text-layer extraction with page anchors and heuristic table detection, no OCR. Scans still need the osaurus-vision plugin (image OCR, no layout) or an external converter via a tool/skill.
MLX Studio 26+ agentic tools available; invoke a converter via shell.
DwarfStar / pi ⚠️ partial pi’s web UI extracts via pdfjs-dist; the CLI tries to read PDFs as UTF-8 and corrupts them (issue #204). The maintainer is leaning toward dropping built-in extraction in favour of a tool. A community pdf-reader skill is the current workaround.

If that covers the job, we are done. If not — or if we would rather the model saw the page than its extracted text — the two manual routes follow.

3 Reading the page directly

If the model is vision-capable, the simplest pipeline has no converter in it at all. We render the page — or just screenshot it — and hand the image to the model. It sees the two-column layout, the figure, the table, the equation as we do, and for many tasks that is enough: what does this paper say, what is in this plot, summarize section 3. Qwen3.6 reads screenshots happily, and so does any vision-capable model on the Mac.

For reference, what Claude does as a multimodal model is a hybrid:

  1. extract the text with a run of the mill PDF to text extractor (something like markitdown).
  2. Additionally render each page of the PDF (!)
  3. Slap the extracted text and image per page into the context

The path has two advantages over conversion. It drops nothing — a chart or a hand-drawn diagram that text extraction would discard is right there in the image — and it needs no setup (assuming out system can render PDFs); there is no tool to install, no intermediate file, no second model in RAM beside the one already answering.

I am not 100% sure I trust it though; if we never see fiddly stuff like maths and tables as characters, how do we know that we got it right? It is a strong option for reading and comprehension but a weak one as a source of record.

Two practical limits favour conversion of longer documents. Page images are token-heavy, so a long PDF burns through context fast. Moreover, reading the page leaves no artifact: the transcription lives in the chat and dies with it, so anything we want to grep, diff, re-embed, or index has to be converted and kept.

4 Converting to text

So, suppose we want a greppable, re-usable text artefact made from a PDF. (If we only want to read a page or two and not keep them, there is no converter to pick — we read the page directly.) The right converter depends upon what is in that PDF, and what we want it for:

Tool Engine Footprint Best for
markitdown text-layer only none born-digital prose
Docling MLX (Granite-258M) sub-GB structure, tables, JSON
marker PyTorch / MPS 3–5 GB maths — one-offs
MinerU MLX VLM 16–32 GB maths — batches
DeepSeek-OCR MLX (mlx-vlm) small scans — fast
Chandra 2 PyTorch (4B VLM) heaviest scans — accurate, 40+ langs

Commands and gotchas for each are in the tools below; the bake-off is where I ran them over equation-dense papers.

5 The bake-off

I ran the contenders over two equation-dense papers (a free-energy belief-change paper and the V-information ICLR paper; June 2026), reading the PDFs against the conversions. markitdown mangles display maths into pseudo-tables studded with (cid:…) glyph junk — expected, it is a text-layer extractor. Docling/Granite emits plausible LaTeX with token-level substitutions inside it: a dropped leading minus sign in a free-energy decomposition on one paper, and on the other 6 of 25 occurrences of \(\log\frac{1}{\delta}\) silently became \(\log\frac{1}{8}\) or \(\log\frac{1}{2}\) — the same theorem corrupted differently in two places. That is the dangerous failure mode: the output renders cleanly and reads plausibly, so nothing flags the error. marker and MinerU both came through corruption-free — every \(\delta\) intact (26/26), \underbrace annotations, side-annotated inequality chains and fraktur Rademacher complexities all faithful; MinerU additionally preserves equation numbers as \tag{…}. So the refined rule: Docling for structure and prose, marker or MinerU for any equation that will be used rather than skimmed. Scheduling differs, though: marker’s runtime is corpus-dependent (62 s on the paper whose text layer it could reuse, ~8 min on the one that triggered full OCR), while MinerU pays a fixed ~4-minute predictor load and then converted the 24-page test paper in under a minute — so MinerU amortizes over batches, marker wins one-offs.

I also tested Docling’s other architecture, in case Granite’s end-to-end VLM was the wrong tool: the standard pipeline with --enrich-formula, which detects equation regions and hands each to IBM’s specialist CodeFormula model. It does fix the token swaps — 14/14 \(\delta\) intact on the appendix slice — confirming that the silent substitutions are a property of whole-page VLM transcription, not of Docling. But it is not a usable fix: 46 minutes for the 24-page paper (CPU-bound — and ~4.7 min/page on the equation-dense appendix, since per-page cost tracks formula count), and it introduces its own structural errors (the Rademacher subscript \(\mathfrak{R}_{|\mathcal{D}|}\) came out with the absolute-value bars displaced). Both Docling modes also render the predictive family \(\mathcal{V}\) as \(\nu\). So the conclusion stands — marker or MinerU, which are faster and cleaner than either Docling path.

A general vision model reading the pages directly belongs in this comparison but I ran out of enthusiasm. I would not be surprised to see it land beside Docling on the corrupting side, since it is the same whole-page operation, but YMMV.

6 The tools

6.1 markitdown

Text-layer extraction only: fast, lightweight, no OCR, no layout, nothing else to install. Most PDFs from modern journals or word processors have a text layer (i.e. we can copy-paste text out of them), and markitdown extracts that text into a plain file:

uvx --from 'markitdown[pdf]' markitdown some.pdf

Given a scan without a text layer it produces nothing — the cue to drop to an OCR engine below.

6.2 Docling

End-to-end document VLM (Granite-Docling-258M), MLX-native and auto-selected on a Mac; at 258M params it runs alongside a resident 22 GB chat model, and Docling’s own benchmark clocks 6.2 s/page on MLX against 102 s on Transformers (M3 Max). Docling emits DocTags, a lossless structural markup that converts to Markdown, HTML, or JSON. The whole setup is uv tool install docling, then docling --pipeline vlm --vlm-model granite_docling <file> — the first run downloads the checkpoint from Hugging Face automatically, and on Apple Silicon it picks the MLX build without being asked.

Mind the UX: it converts in complete silence and never prints the document — the output is <stem>.md written to the current directory (--output <dir> to redirect), and a conference paper took 45 seconds end-to-end on Apple Silicon, measured here. An interrupted run leaves nothing behind, so “it ran but produced no output” usually means we lost patience mid-grind; -v shows per-stage progress, and the HF_TOKEN / transformers use_fast warnings it emits along the way are noise. The output is not only LLM-feed, either: --to accepts md, json, yaml, html, html_split_page, text, and doctags, stackable in one run — the JSON/YAML is the full DoclingDocument (layout tree, bounding boxes, reading order, structured table cells; tables export to dataframes via the Python API), and --to html_split_page --show-layout renders a visual page-by-page QA view.

For LLM usage, beware that Markdown embeds every figure as a base64 data URL, which torches the token budget — so use --image-export-mode placeholder or referenced.

Its maths is gist-grade only — for an equation that will be used rather than skimmed, see the bake-off and route to marker or MinerU instead.

On the Docling naming, because it confused me: VlmPipeline is Docling’s end-to-end-VLM mode (the default mode is an ensemble of layout, table, and OCR models); Granite-Docling-258M is the model IBM recommends inside it — successor to the SmolDocling preview.

6.3 marker

A pipeline of specialist surya models — layout, OCR, maths-to-LaTeX, tables. Install with uv tool install marker-pdf (or one-off via uvx --from marker-pdf marker_single ./doc.pdf); the first run pulls the surya model weights into the Hugging Face cache automatically — no separate weights step. Note marker never touches MLX — it is PyTorch end-to-end, and its Apple accelerator is MPS, which the README says is auto-detected. Budget ~3–5 GB of memory per worker. For LLM-assisted refinement — cross-page table merging, inline maths, form extraction — point --use_llm at the local endpoint we already run, instead of its default Gemini:

marker_single ./doc.pdf --use_llm \
  --llm_service marker.services.openai.OpenAIService \
  --openai_base_url http://127.0.0.1:1337/v1 \
  --openai_model qwen3.6-35b-a3b-mxfp4 \
  --openai_api_key unused

The above reuses the Osaurus-resident daily driver for the refinement calls — one backend, one model in RAM — and Qwen3.6 being vision-capable matters, because some refinement passes send images.

The alternative is a second backend on Ollama with a deliberately small vision-capable model that OLLAMA_KEEP_ALIVE evicts after the batch — --llm_service marker.services.ollama.OllamaService --ollama_base_url http://localhost:11434 --ollama_model qwen3.5:9b. Qwen3.5, not 3.6, on purpose: the 3.6 registry entry ships no small tags (27b/35b only), while qwen3.5:9b is 6.6 GB with image input; note also that Ollama’s qwen3.6:*-mlx builds are text-only. Either way the whole pipeline stays offline.

6.4 MinerU

uv tool install 'mineru[all]'. On Apple Silicon the 3.3 hybrid-engine backend auto-selects the MLX VLM engine Budget the one-off ~4-minute predictor load per session and 16–32 GB of RAM; after that it is fast, converting the 24-page test paper in under a minute— better batch conversion rather than waste all that memory. Formulas come out with high fidelity LaTeX including niceties like \tag{…} equation numbers, tables as HTML, and OCR in a notional 109 languages. It ships its own MCP server, which is helpful for the agent wiring below.

6.5 DeepSeek-OCR and Chandra

For scans, photos, and handwriting — there is no text layer to extract, so the job is pure OCR. DeepSeek-OCR is fast pick and has an 8-bit community port mlx-vlm (uv pip install mlx-vlm). Note that it is subject to fragile behavior, going crazy if you get the prompt even slightly wrong. Chandra 2 marker’s big sibling, a 4B OCR VLM and benchmark leader on scans, handwriting, forms, and 40+ languages, and the heaviest option here. Neither went through the bake-off so this is the makers’ positioning, not my own measurement.

7 For agents — the pdf-ingest skill

When the reader is an agent rather than me, the routing above has to be packaged where the agent can load it. The mechanics that decide the design: in the agentskills model, every skill’s one-line description sits permanently in the system prompt and the body loads on trigger — so per-engine skills route at trigger time (the model picks from one-liners, the least reliable dispatch there is), while a single skill routes at instruction time (one trigger, then the body walks the model through the decision). The routing policy is the hard-won part — probe the text layer first, placeholder the images, convert once and read selectively, point refinement calls at the host’s own idle endpoint — so it gets the monolithic treatment, while each engine’s flags and gotchas live in per-engine files the skill loads on demand:

.claude/skills/pdf-ingest/
  SKILL.md          # trigger + routing policy + budget disciplines
  engines/
    markitdown.md   # instant text-layer extraction
    docling.md      # Granite VLM, MLX, placeholder images
    marker.md       # maths→LaTeX, MPS, --use_llm wiring
    mineru.md       # the accuracy-maximalist escalation

That directory is published at danmackinlay/pdf-ingest-skill — projected by git subtree out of this site’s (private) source repo, where it is actually edited and tested, so the public copy cannot fork, only lag. Install: Claude Code, git clone https://github.com/danmackinlay/pdf-ingest-skill ~/.claude/skills/pdf-ingest (or per-project under .claude/skills/); Hermes, the same clone into ~/.hermes/skills/; pi, pi install git:github.com/danmackinlay/pdf-ingest-skill. One key fact makes the wiring safe: during tool execution the harness’s model is idle — the agent loop is generate → dispatch tool → wait — so a converter that wants LLM refinement can call back into the same local endpoint that hosts the agent, with no contention and no second model in RAM.

Osaurus is the awkward middle case, because it plays two roles at once: an agent harness that reads agentskills.io skills natively, and an LLM server listening on localhost:1337 — and this skill fits one role much better than the other. As a harness it can import the skill (Management window → Skills → Import), though its GitHub importer looks for a .claude-plugin/marketplace.json manifest this repo doesn’t ship, so the import path is a local file or zip. The deeper mismatch is execution: Osaurus agents only get a shell inside an isolated Alpine Linux VM (macOS 26+, Apple’s Containerization framework), where Docling loses MLX and marker loses MPS — the recipes still run, just CPU-only, forfeiting the Apple Silicon tuning that justifies half the routing. Better wiring for now uses the server role: run the skill from a harness with a native macOS shell — Claude Code, Hermes, pi — pointed at Osaurus on localhost:1337.

Jan speaks MCP Connectors rather than skills; for it, the official docling-mcp server (LF AI & Data, MIT) covers the Docling path — with one trap: v2.0 defaults to remote conversion against a Docling Serve API, so offline use needs the [local] extra and DOCLING_CONVERSION_MODE=local, or it quietly phones out. MinerU ships its own MCP server too. Nothing pre-existing encodes the routing policy — Osaurus’s plugin registry has no PDF-converter entry at all (checked June 2026) — which is exactly why the skill exists.