PDF, Portable Document Format
How we may use a thousand dollar computer to simulate a one cent piece of paper with zero day exploits
June 3, 2018 — May 28, 2023
Portable Document Format, the abstruse and inconvenient format beloved of academics, bureaucracies, and Adobe. It has the notable feature of being a better format than Microsoft Word, in much the same way that sticking your hand in a blender is better than sticking your hand in a woodchipper.
Look, they can include video games.
1 PDF readers
See PDF readers.
2 Extracting data
Tabula is a tool for liberating data tables locked inside PDF files.
pdfplumber also exists but I have not used it.
Camelot is an OpenCV-backed table extractor. It has a browser-based GUI, Excalibur.
There are both open (Tabula, pdfplumber) and closed-source tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy.
tl;dr Tabula if you want the easiest possible experience at the cost of some power, otherwise Camelot/Excalibur.
Commercial online PDF tool smallpdf claims to do this. (USD12/month with free trial)
3 Tools
3.1 Ghostscript
A classic, very useful but kind of ugly and bizarre to use. See Ghostscript for more info; I won’t document it here because I prefer easier tools. I keep Ghostscript around because some other software I use depends upon it. 🏗️
Exception: it has a really handy thing:
3.1.1 Downgrading PDFs
There is a command to downgrade high-version PDFs to v 1.4, which is safely old:
Very helpful for version incompatibility.
3.2 QPDF
Less powerful, but simpler and less fragile than Ghostscript.
- QPDF: A Content-Preserving PDF Transformation System
- qpdf/qpdf: Primary QPDF source code and documentation
- QPDF Manual
QPDF is a command-line program that does structural, content-preserving transformations on PDF files. It could have been called something like pdf-to-pdf. It also provides many useful capabilities to developers of PDF-producing software or for people who just want to look at the innards of a PDF file to learn more about how they work. … QPDF includes support for merging and splitting PDFs through the ability to copy objects from one PDF file into another and to manipulate the list of pages in a PDF file. Th…
QPDF is not a PDF content creation library, a PDF viewer, or a program capable of converting PDF into other formats. In particular, QPDF knows nothing about the semantics of PDF content streams. If you are looking for something that can do that, you should look elsewhere. However, once you have a valid PDF file, QPDF can be used to transform that file in ways perhaps your original PDF creation can’t handle. For example, programs generate simple PDF files but can’t password-protect them, web-optimize them, or perform other transformations of that type.
4 Concatenate/split
I need to concatenate or split PDFs so often that I may yet get around to making a keyboard shortcut for it.
tl;dr To concatenate PDFs on POSIX systems I use QPDF. To split PDFs on macOS I use Preview, and I have not split many PDFs on other systems so cannot really recommend anything.
There are many ways to do this. Concatenating PDFs is where QPDF excels over Ghostscript, although Ghostscript is older and so has more HOWTOs.
A classic that I see around is this Ghostscript command:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress -sOutputFile=output.pdf input*.pdf
This sometimes works, and sometimes behaves badly in ways that I have not investigated — the output file can be massive, much larger than the sum of its parts. Sometimes it is lossy and the fonts are mangled.
The QPDF version is not much more intuitive but seems to mangle the PDF less often:
On macOS there is was system PDF concatenation. There is a right-click PDF merge in the Finder though.
pdfunite is the poppler concatenate command. Poppler is fairly ubiquitous and odds are good it is already installed.
You can split PDFs also with Ghostscript, but usually you want a GUI to see what you are splitting, no? On macOS I use Preview.app. I do not have a favourite yet on other systems.
Antoine Chambert-Loir points out the underdocumented command texexec
has many PDF-editing sub-commands. The following command extracts pages 1 to 5, page 7 and pages 8 to 12 from file.pdf
and puts it in outputfile.pdf
:
There is also PDFtk which has a straightforward command line. Java port pdftk-java has a GUI called PDF Chain. Might be good? Special power: extracting PDF attachments.
PDFMix and PDF shuffler have both been recommended to me but I have not tried them.
5 Tile cutting
- oxplot/pdftilecut: pdftilecut lets you sub-divide a PDF page(s) into smaller pages so you can print them on small form printers. (clever QPDF front-end)
- Printing god damin A0 poster as set of A4’s — manual CLI option
- Scaffolded Math and Science: How to Enlarge a PDF into a Multi-Page Poster for FREE! 3 Simple Steps
- PosteRazor - Make your own poster! (no longer seems to work well)
6 Misc command lines
6.1 Searching
Shawn Graham suggests a pdftotext
hack. First, install poppler
using your choice of package manager. Now,
6.2 Reduce bloat
PDFs have a lot of ways of storing data and many of them leave you keeping lots of crap there that you do not need for the current purpose, in the form of, presumably, inefficiently compressed images, excessively high-resolution images, or miscellaneous other crap. Slimming them down to the essentials is in general complicated and context-dependent, and I do not know of a general solution. Here are some that work in some contexts.
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 \
-dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=output.pdf input.pdf
Wrapped up into a nice little script, ShrinkPDF: (90
is the dpi here.)
There is also cpdf and the GUI version Densify.
Commercial online PDF tool smallpdf claims to shrink PDFs also. (USD12/month with free trial.)
A peer recommendation: PDF Resizer.
6.3 Wet ink signatures without printer
Many businesses demand “wet ink” signatures on signed PDFs, i.e. that we fill the digital form in, then print it, sign it and then scan it in again. This wastes time, money, paper and adds no discernible security to the truly shitty security mechanism that is signing things with ink. In the hypothetical case that it makes absolutely no legal difference, it could be automated.
I expect that the following methods might be much easier, cheaper, no less secure and for all purposes indistinguishable from printing and scanning a wet ink signature: We can automate fake-printing so that it at least saves time and money. Here are many scripts to simulate a round trip via the printer:
I quite like
convert -density 100 input.pdf -rotate "$([ $((RANDOM % 2)) -eq 1 ] && echo -)0.$(($RANDOM % 4 + 5))" -attenuate 0.4 +noise Multiplicative -attenuate 0.03 +noise Multiplicative -sharpen 0x1.0 -colorspace Gray output.pdf
The scripts get much more elaborate than that though. Please do not use them to do anything illegal. Also, if you have a serious contractual obligation that hinges upon the “wet ink” quality of the signature rather than a more substantive mechanism for verifying your identity and consent (such as cryptographic signing, and witnesses) then perhaps you should reconsider the wisdom of the entire project.
7 Conversions
7.1 PDF to text
OCRMyPDF makes a scanned PDF possibly-searchable and also optimizes the size, optionally aggressively. This will not downsample, but it will get better monochrome compression than normal.
Dangerzone is more extreme still; it deliberately rasterizes the PDF and deletes all metadata to make it anonymous and safe before OCRing it to get the text out:
Dangerzone works like this: You give it a document that you don’t know if you can trust (for example, an email attachment). Inside of a sandbox, dangerzone converts the document to a PDF (if it isn’t already one), and then converts the PDF into raw pixel data: a huge list of RGB colour values for each page. Then, in a separate sandbox, dangerzone takes this pixel data and converts it back into a PDF.
Read the blog post for more information.
The PDF that results is less capable and has errors etc but also safe(r).
7.2 EPS to PDF
EPS to PDF conversion:
7.3 PDF to SVG
pdf2svg generates editable vector diagrams from the PDF.
7.4 HTML to PDF
wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. These run entirely “headless” and do not require a display or display service.
Note also that Weasyprint, above, does this out of the box.
8 Diffing PDFs
The use case is, they claim, a (presumably scientific) review. “You reviewed version A of a paper, and receive version B, and wonder what the changes are.” The tool is pdfdiff.
9 Books, booklets and binding
The professional word for laying out the book correctly for the type of binding you intend is imposition apparently. There are many expensive professional tools to do this and some scrappy free alternatives.
pdfbooklet (Linux/Windows) is one
PdfBooklet is a Python script whose first purpose was to create booklet(s) from existing pdf files. It has been extended to many other functions in pdf pages manipulation.
Featuring
- Multiple booklets
- Add blank pages in the beginning in the end
- Adjust scale and margin.
jPDF is also recommended in some circles.
Honourable mention also to impositioner which seems to be full-featured despite being a short command-line script. bookletimpose/pdfimposer is a Python GTK GUI/command-line combo imposer. pdfimpose is a multiple PDF script whatsit.
Commercial options include Devalipi Imposition Studio and Montax Imposer.
10 Crop marks
There are at least two options:
None makes it clear which of TrimBox
, BleedBox
, Cropbox
or ArtBox
is what I truly want. This might clarify it slightly but I lost focus around here.
I can add crop marks to a PDF document with different PDF tools, e.g. pdftk
:
- Export the first page with crop marks to a PDF file (your_cropmark.pdf)
- Join it with your PDF document (your_document.pdf) in the command line:
OR I can set PDF cropping values with GhostScript for printing.
Create a plain text file with the right cropping values — e.g. this is 5mm crop of A4:
Alternatively, use the command line
Now, convert my_document.pdf
using the previous file (which I called pdfmark.txt
):
11 Colour conversion
Nightmares. Colour management is generally complicated. Ghostscript colour management specifically is complicated, and has many moving parts, specifically that rapidly moving — e.g. the -dUseCIEColor
option was removed in Ghostscript 9, because it is apparently a broken noob feature. Its replacement is broken documentation.
I am aware this is a complicated and nuanced area with much special labour involved. But I do not care. If I am working on a project with a graphic designer then they can do this with their skill and training, but for me, I just want a document which prints adequately with some vague approximation of the colours of the screen and no errors. That means, changing to CMYK, or not. No other alterations considered.
11.1 To CMYK
CMYK colour conversion of RGB PDF with Ghostscript:
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
-sColorConversionStrategy=CMYK \
-sColorConversionStrategyForImages=CMYK \
-sDEVICE=pdfwrite \
-dProcessColorModel=/DeviceCMYK \
-dCompatibilityLevel=1.5 \
-sOutputFile=result_cmyk.pdf \
your_document.pdf
See also a PDF to TIFF example.
11.2 To Greyscale
12 Programmatic editing and generation
So many of these. There are even more that I have not reviewed, e.g. pypdf2.
12.1 Weasyprint
Weasyprint seems the cleanest. It converts HTML+CSS into PDF, and is written in pure Python. It can be used from the command line or programmatically.
It is based on various libraries but not on a full rendering engine like Blink, Gecko or WebKit. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.
12.2 SVGlib
svglib provides a pure Python library that can convert SVG to PDF, and a command line utility for same, svg2pdf
. Thus one can, e.g. add SVGs to PDFs in reportlab.
12.3 Reportlab
Apropos that, reportlab is the famed monstrous classic way of programmatically generating PDFs from code. It includes a modicum of typesetting. It doesn’t edit PDFs so much, but it generates them pretty well. Its integration with other things is often weak — if you thought that inserting LaTeX equations would be simple, or HTML snippets etc. On the other hand it has fancy features such as its own chart generation library. On the third hand, there are better, more widely supported charting libraries that it doesn’t use. Litmus test: Use it if the following feels to you like a natural way to print two columns:
from reportlab.platypus import (
BaseDocTemplate,
Frame,
Paragraph,
PageBreak,
PageTemplate )
from reportlab.lib.styles import getSampleStyleSheet
import random
words = (
"lorem ipsum dolor sit amet consetetur "
"sadipscing elitr sed diam nonumy eirmod "
"tempor invidunt ut labore et").split()
styles=getSampleStyleSheet()
Elements=[]
doc = BaseDocTemplate(
'basedoc.pdf',
showBoundary=1)
#Two Columns
frame1 = Frame(
doc.leftMargin,
doc.bottomMargin,
doc.width/2-6,
doc.height,
id='col1')
frame2 = Frame(
doc.leftMargin+doc.width/2+6,
doc.bottomMargin,
doc.width/2-6,
doc.height,
id='col2')
Elements.append(
Paragraph(
" ".join([random.choice(words) for i in range(1000)]),
styles['Normal']))
doc.addPageTemplates([
PageTemplate(id='TwoCol',frames=[frame1,frame2]),
])
#start the construction of the pdf
doc.build(Elements)
12.4 pdfrw
pdfrw is a Python library and utility that reads and writes PDF files:
- Operations include subsetting, merging, rotating, modifying metadata, etc. […]
- Can be used either standalone, or in conjunction with reportlab to reuse existing PDFs in new ones
Here is a gentle HOWTO. You can use it to put matplotlib plots in reportlab PDFs, getting the best of two bad worlds.
12.5 Scribus
scribus is a reasonable open-source desktop publishing tool. If your content is not amenable to automatic layout out it is a good choice, for e.g. posters. It includes a Python API, albeit a reputedly quirky one, which is AFAICT Python 2. For all that, it’s a simple and interactive way of generating PDFs programmatically, so might be worth it.