PDF ingestion
Feeding technical documents to local language models
2026-05-23 — 2026-06-15
Wherein the Conversion of PDF Documents to Text Is Examined, and Whole-Page Vision Models Are Found to Corrupt Mathematical Notation Silently, With Specialist Pipeline Converters Offered in Contrast.
I want my local language models to read my PDFs, analyse them, answer questions about them, and so on.
There are two ways to get there.
- We can let a vision-capable model look at the rendered page as an image — direct, no conversion.
- We can turn the PDF into text first and feed it that.
For cloud pipelines this doesn’t matter; it is all magically taken care of. For local models I tend to be more stingy with my VRAM and I may want to use a pure LLM for my reasoning rather than waste bytes on a multimodal variant.
In that case it may be fiddlier than I’d like. In particular, since I care about getting the mathematics out of the papers I read, I want a high-fidelity converter, and not all converters are that.
This is the Reverse LaTeX problem at scale: not rendering an equation but recovering it from the render.
1 Decision tree
Does the agent (or the model) already handle the PDF well enough? If so, let the agent do it. If not, do we need the text just once (explain this!), or do we want to keep it (copy-paste required, knowledge base building…)?
To read the PDF once, interactively, let a vision-capable model read the page — fluent, but not for maths to be quoted verbatim.
To keep the text — for a batch, a long document, a RAG index — we convert it.
2 When the agent already handles it
The easiest thing to do is nothing. Where the agent harness (or the model itself) handles the PDF well enough, this seems fine. Several frontends accept drag-and-dropped PDFs and convert them into something the model can read (markdown or more structured formats), and that does well enough that we stop here. See below for (my best understanding of) the toolchain in each case.
| Frontend | Native input | How |
|---|---|---|
| LM Studio | ✅ | Drag-drop .pdf / .docx / .txt; short docs go fully into context, long ones trigger auto-RAG |
| Jan | ✅ + tool-calling | File attachments plus MCP tool-calling can hand the PDF off to whichever extractor we wire up |
| Osaurus | ✅ text-layer | Attachments route through a document adapter registry (PDF, XLSX, PPTX, CSV). Scans still need the osaurus-vision plugin or an external converter via a tool/skill. |
| Qwen-Agent | ✅ + RAG | Python harness; Assistant(files=[…]) runs a built-in RAG pipeline (DocParser + BM25, no vector DB) over PDF / Office / HTML |
| MLX Studio | ❌ | 26+ agentic tools available; invoke a converter via shell. |
| DwarfStar / pi | ⚠️ partial | pi’s web UI extracts badly. A community pdf-reader skill is the current workaround. |
If that covers the job, we are done. If not — or if we would rather the model saw the page than its extracted text — read on.
3 Reading the page directly
If the agent’s model is vision-capable, the simplest pipeline needs no converter at all. We render the page — or just screenshot it — and pass the image to the model. It sees the two-column layout, the figure, the table, the equations as we do, and for many tasks that is enough: what does this paper say, what is in this plot, summarize section 3. Qwen3.6 reads screenshots happily, for example.
For reference, what Claude does as a multimodal model is a hybrid:
- extract the text with a run-of-the-mill PDF-to-text extractor (something like markitdown),
- also render each page of the PDF as an image, and
- slap both into the context, page by page.
This approach has two advantages over conversion. It drops nothing — a chart or a hand-drawn diagram that text extraction would discard is right there in the image — and it needs no setup (assuming our system can render PDFs).
I’m not 100% sure I trust it though; if we never see fiddly stuff like maths and tables as characters, how do we know we got it right? It is a strong option for reading and comprehension but a weak one as a source of record.
Moreover, even when this works we might still want conversion of longer documents, since the page images are token-heavy and burn through context fast. Also, since the transcription lives in the chat and dies with it, anything we want to grep, diff, re-embed, or index has to be converted and kept.
4 The converters
So, suppose we want a greppable, reusable text artefact made from a PDF and friendly to an LLM. The right one depends on what is in the PDF: markitdown for born-digital prose, marker or MinerU for maths, Docling for tables and structure, an OCR engine for scans. Here are the trade-offs:
| Converter | Ingest speed | Download | Maths fidelity | RAM |
|---|---|---|---|---|
| markitdown | instant | none | text only | negligible |
| Docling — standard | slow¹ | small | structure ✓, maths weak¹ | sub-GB |
Docling — --pipeline vlm |
fast (~6 s/pg) | 258 M | silently corrupts | sub-GB |
| marker | 1–8 min² | moderate | clean LaTeX ✓ | 3–5 GB |
| MinerU | batch³ | large | clean LaTeX ✓ | 16–32 GB |
¹ Docling’s accurate maths needs --enrich-formula — ~46 min for a 24-page paper, and it adds its own operator-grouping errors. ² ~1 min when marker reuses a text layer; ~8 min when it must OCR. ³ ~4-min predictor load per session, then under a minute a doc — only worth it across a batch.
I ran the contenders over two equation-dense papers from recent ICLR, checking each conversion against the PDF. Notably, converters that use a single vision-language model to transcribe the whole page silently corrupt the maths. Docling’s Granite VLM emits plausible LaTeX, but it turned 6 of 25 \(\log\frac{1}{\delta}\) terms into \(\log\frac{1}{8}\) or \(\log\frac{1}{2}\) — valid LaTeX, wrong maths, the most dangerous kind of error. marker and MinerU route the page through specialist recognisers instead of one VLM, and came through clean — every one of 26 \(\delta\) intact, \underbrace annotations and all. I did not put a general vision model reading the page directly through the same test, but it is the same whole-page operation, so I wouldn’t be amazed to see it corrupt the maths too (citation needed).
4.1 markitdown
Text-layer extraction only: fast, lightweight, no OCR, no layout, nothing else to install. Most PDFs from modern journals or word processors have a text layer (i.e. we can copy-paste text out of them), and markitdown extracts that text into a plain file:
Given a scan without a text layer it produces nothing. Even with a text layer, it does badly at maths and tables, and it discards figures and diagrams.
4.2 Docling
Docling has two pipelines which are totally different in terms of cost and quality.
The VLM pipeline — docling --pipeline vlm --vlm-model granite_docling <file> — runs one end-to-end document model, Granite-Docling-258M. It is fast and tiny: Docling’s benchmark clocks 6.2 s/page on MLX (against 102 s on Transformers, M3 Max), and at 258M params it runs alongside anything we want. However if “90% accurate” is not good enough for the equations, beware, because it gets some of those wrong.
The standard pipeline (the default) is an ensemble of layout, table, and OCR models. Add --enrich-formula and it hands off equation regions to IBM’s specialist CodeFormula model, which fixes most of the corruption — but it is uselessly slow (~46 min for a 24-page paper, CPU-bound) and adds its own structural slips, like mis-grouping operators. So neither Docling path is the one for hi-fi maths; that is marker or MinerU.
Setup is uv tool install docling; the first run downloads the checkpoint from Hugging Face, and on Apple Silicon it picks the MLX build automatically.
Docling emits DocTags, a lossless structural markup; --to exports md, json, yaml, html, html_split_page, text, and doctags — the JSON/YAML is the full DoclingDocument (layout tree, bounding boxes, reading order, structured table cells, exportable to dataframes), and --to html_split_page --show-layout renders a page-by-page QA view. It converts in silence and writes <stem>.md to the current directory (--output <dir> to redirect); -v shows per-stage progress, and the HF_TOKEN / use_fast warnings are noise. For LLM use pass --image-export-mode placeholder (or referenced), otherwise it inlines every figure as a base64 data URL that torches the token budget.
4.3 marker
marker is a self-contained bundle of small, specialist surya models — layout, OCR, maths-to-LaTeX, tables. It is not a single vision-language model transcribing the whole page, the way Docling and MinerU are; its specialist models ship with it and run automatically.
Install with uv tool install marker-pdf (or one-off via uvx --from marker-pdf marker_single ./doc.pdf). The first run pulls the surya weights into the Hugging Face cache automatically, so there is no separate weights step.
marker_single ./doc.pdf came through the maths test corruption-free, and it loads fast enough to be a competitive choice for a one-off conversion.
It uses MPS accelerator and wants ~3–5 GB per worker, which I would call light on RAM for a local converter. Conversion speed is variable: quick on a born-digital text layer and slow when it has to OCR the page.
marker does have a --use_llm refinement pass for cross-page table merging, inline maths, and form fields. This might be useful? However for me it was working fine with the vanilla surya pipeline already, so it isn’t necessary. I am not sure it is worth the pain for local setups: --use_llm calls Gemini by default, so keeping it local means standing up a second LLM backend or using the agent’s own endpoint and threading its port and model name through marker’s flags. Once a second model has to be resident anyway, the monolithic MinerU does the high-fidelity job in one process with none of the plumbing — so locally I skip --use_llm, and switch to MinerU on the rare occasion I want hardcore cleanup.
4.4 MinerU
A hefty hi-fi tool. uv tool install ‘mineru[all]’. On Apple Silicon the 3.3 hybrid-engine backend auto-selects the MLX VLM engine. Budget the one-off ~4-minute predictor load per session and 16–32 GB of RAM; after that it is fast, converting the 24-page test paper in under a minute. That memory-and-load cost makes sense if we batch convert — for a single document it is wasted. Formulas come out with good LaTeX parsing including niceties like \tag{…} equation numbers, tables as HTML, and OCR in a notional 109 languages. It ships its own MCP server, which is helpful for the agent wiring below if we can afford to keep it running in the background.
4.5 Pure OCR
For scans, photos, and handwriting — there is no text layer to extract, so the job is pure OCR. DeepSeek-OCR is the fast pick, with an 8-bit community port that runs under mlx-vlm (uv pip install mlx-vlm). Note that it is highly strung — it goes haywire if the prompt is even slightly wrong. Chandra 2 is marker’s big sibling: a 4B OCR VLM and benchmark leader on scans, handwriting, forms, and 40+ languages — and the heaviest option here. Neither went through the maths test so this is the makers’ positioning, not my own measurement.
5 pdf-ingest skill for agents
When the reader is an agent rather than me, the routing above has to be packaged in such a way that the agent can load it. In the agentskills model, we can set up a skill that dynamically loads the detailed body on demand. Thus, we can put the routing policy in the root skill while each engine’s flags and gotchas can be in per-engine files the skill loads on demand:
A converter run from the skill may want an LLM of its own — marker’s --use_llm, say. The agent is already an LLM on a local endpoint, so it’s cheapest to point the converter at that same endpoint rather than load a second model. That looks unsafe — won’t it deadlock, or need the model loaded twice? It works anyway: during tool execution the harness’s model is idle, because the agent loop is generate → dispatch tool → wait. While the agent blocks on the converter, the converter has the endpoint to itself.
That directory is published at danmackinlay/pdf-ingest-skill; its README has the per-harness install steps — Claude Code, Hermes, pi, Osaurus, and Jan.
