Generative AI workflows and hacks 2026
2026-01-13 — 2026-05-15
In Which the Tokenization Behaviours of Ollama and Hugging Face Transformers Are Found to Diverge When Processing Markdown, and Practical Remedies for Local Embedding Pipelines Are Considered.
I’ll try to synthesise LLM research elsewhere. This is where I keep ephemeral notes and links, continuing my habit from 2025.
1 PDFs, word docs
Quick and savage, datalab-to:
Higher quality but eats 2GB of space, microsoft/markitdown
Still to test: opendatalab/MinerU:
2 Claude code is locked to Google Chrome
Claude Code has browser integration but only for Google Chrome, and they do not intend to fix that. I am not a fan of Google Chrome because it is creepy and gross. Firefox is not at all well supported, but many Chromium browsers can be forced into working.
Whether it is wise to let your browser traffic flow through the Anthropic servers I’ll leave to you; they already have your intimate thoughts though from all that Claude usage, so why not get some extra value, I guess?
Anyway, there are community-supported hacks to get non-Google-Chrome browsers working at stolot0mt0m/claude-chromium-native-messaging). I like Vivaldi.
macOS setup:
Take care before running this; for all I know it sends your API keys to Macedonian script kiddies and plays the sad trombone sound.
3 ollama tokenization versus huggingface
There are several convenient ways to invoke LLMs locally. I am learning to understand their affordances by breaking them.
First exercise: mxbai-embed-large-v1 from Hugging Face and mxbai-embed-large from Ollama are nominally the same model. The trained weights derive from the same upstream artefact. The stacks around those weights are not the same, and on real inputs the two stacks produce different outputs from the same text.
The HF path: pip install transformers, then AutoModel.from_pretrained(“mixedbread-ai/mxbai-embed-large-v1”) in our own Python process. The first call downloads the weights, the tokenizer config (tokenizer.json), and the model config from huggingface.co into ~/.cache/huggingface/hub/. Inference runs in our process, via PyTorch (or sentence-transformers, or whatever library is wrapping it). Tokenization runs in the same process, performed by HF’s Rust tokenizers library reading that same tokenizer.json. One process, one library, one set of files on disk.
The Ollama path: install the ollama daemon, then ollama pull mxbai-embed-large. The daemon downloads a .gguf file from Ollama’s registry, not from huggingface.co. That .gguf is a single binary bundling the weights (usually quantised to ~4 bits), a re-implementation of the tokenizer, and the model config1. Inference runs inside the long-lived ollama serve process. We talk to it over HTTP on localhost:11434. Tokenization runs server-side, in C++, against the tokenizer rules baked into the .gguf.
So: same upstream weights, two stacks, two tokenizer implementations. The two ought to agree, but in practice the agreement is approximate at best. HF’s Rust tokenizers and llama.cpp’s C++ tokenizer are independent implementations. I assume the conversion that produces the GGUF tokenizer rules from the original tokenizer.json is not bit-perfect: pre-tokeniser steps, Unicode normalisation, whitespace handling, and the post-processor rules ([CLS]/[SEP] insertion) don’t all round-trip exactly. On clean prose they agree. On markdown — code fences, tables, weird whitespace — they diverge by 5% on the same input.
That divergence only matters when we tokenize with one stack and embed with the other — which is what we do if we use HF’s tokenizer to pre-truncate before sending to Ollama. The HF tokenizer says “510 tokens, fits comfortably under the model’s 512 limit.” Ollama, looking at the same text with its converted tokenizer, says “534 tokens, doesn’t fit,” and 400s with the input length exceeds the context length. There is no /api/tokenize on Ollama, so we cannot ask the server how it tokenizes anything; the server-side truncate=True flag is unreliable in 0.20.x for this model. The two paths cannot be made consistent from the outside.
So: do not mix and match. I switched to an all-transformers stack for embedding for this blog. sentence-transformers is a thin wrapper around HF transformers that loads the same weights into our Python process and tokenizes with the same tokenizer it embeds with. One model, one tokenizer, one process, no IPC, no version skew.
Dropping the Ollama hop also lets us set dtype=torch.float16 on MPS — a 15× speedup over fp32 with indistinguishable quality. The fp16 path only matters on GPU or Apple Silicon; CPU stays in fp32 because half-precision on CPU is slow and pointless.
4 Reading the internet
MCPs are handy for reading the internet.
Currently my favourite is Jina, which is affordable.
Their competitor Firecrawl is too expensive for my use case at marginal value add.
5 Google is also a bit shit at copy-pasteable content
Google broke its Markdown output in an interesting way: links don’t work in the output docs. This is particularly infuriating in my favourite Google product, Deep Research. It’s supposed to be about citations and links, and yet it can’t turn URLs into linked text (try copy-pasting the output to get inline citations if you don’t believe it).
For that reason—and because Gemini subscriptions are annoying and have shitty bundle pricing—as soon as Google released an API version of Google Deep Research, I wrote a Google Deep Research client that does all kinds of clever stuff to bypass their horrible output formatting and get the links and math and tables and code blocks out in a format I can actually use. Unlike Google’s horrible subscription options, it is pay-per-use at standard token pricing. My last research report was about USD1.70.
See danmackinlay/gemini_deep_research_client. It does quite a lot to make sure the output works, including links, mathematics, and the other things researchers actually want included in their work. Pull requests welcome.
6 OpenAI UX sucks increasingly and they do not really care about that
The ChatGPT client used to generate lovely markdown that I could use where I wanted structure (math, code, tables) and HTML for text-y unstructured things.
That was a good time. A few months ago they totally fucked it up, and they’ve shown no interest in fixing it. We can no longer copy-paste tidy, structured Markdown from ChatGPT — just the slobbery mess of HTML.
Read on if you want to fix that.
But actually, I have just decided to quit OpenAI and use a different provider, so I don’t care about this any more. They used to have the best model for advanced mathematics, but Anthropic’s Claude is now better at that for my use cases, and it has beautiful Markdown output. Anyway, OpenAI is emitting worrying signs of being unethical.
I’m leaving the following bit here for posterity. I suspect OpenAI’s interest isn’t in chat clients — that’s a legacy product they keep around while they plan on brain-computer interfaces or utility fog or something. The fix won’t come from them.
Solutions:
First, we could move to a good client like Jan that preserves Markdown on copy-paste and, as a bonus, lets us use diverse backends.
Alternatively, if we’re being lazy, open the chat we want to copy in a web browser and then use a browser extension to convert the HTML. Here are two that were recommended to me:
- Chrome: Markdown Capturer - BibCit
- Firefox: ChatGPT LaTeX Copy Fix
Partial fix: Use a macOS script to convert clipboard HTML to Markdown. Here is a fish script
function chat2md
if not type -q pandoc; echo "Install pandoc: brew install pandoc" >&2; return 1; end
pbpaste -Prefer public.html 2>/dev/null | \
sed 's/class="Apple-converted-space"//g; s/<span[^>]*>//g; s/<\/span>//g' | \
pandoc -f html+raw_html -t markdown+raw_tex+tex_math_dollars+fenced_code_blocks+hard_line_breaks --wrap=preserve -s |
pbcopy
echo "HTML → Markdown complete (LaTeX, code, newlines preserved)"
endTBH, this still messes with Markdown math, but it preserves code blocks and newlines, which is a win.
7 But actually just ditch OpenAI
Claude has beautiful Markdown output. Also, OpenAI seems to be on some kind of slide into being generally evil, and I’ve noticed that my friends who work for OpenAI soon quit or stop getting invited to fun parties. Various actors, e.g. QuitGPT have been arguing that OpenAI is a bad actor. I am not qualified to assess all their claim
8 Incoming
Footnotes
GGUF is the format that llama.cpp uses; Ollama wraps llama.cpp. When a model is added to the Ollama library, someone — usually an Ollama maintainer — runs a conversion script over the original HF release to produce the
.gguf. That happens once, well before we pull the model.↩︎
