Portable Document Format, the abstruse and inconvenient format beloved of academics, bureaucracies and Adobe. It has the notable feature of being a better format than Microsoft Word, in much the same way that sticking your hand in a blender is better than sticking your hand in a woodchipper.
PDF readers
See PDF readers.
Extracting data
Tabula is a tool for liberating data tables locked inside PDF files.
pdfplumber also exists but I have not used it.
Camelot is an OpenCV-backed table extractor. It has a browser-based gui, Excalibur.
pip install excalibur-py
There are both open (Tabula, pdfplumber) and closed-source tools that are widely used to extract data tables from PDFs. They either give a nice output or fail miserably. There is no in between. This is not helpful since everything in the real world, including PDF table extraction, is fuzzy.
tl;dr Tabula if you want the easiest possibly experience at the cost of some power, otherwise Camelot/Excalibur.
camelot -o table.csv -f csv lattice file.pdf
Commercial online PDF tool smallpdf claims to do this.(USD12/month with free trial)
Command line tips
Search
Shawn graham suggests a pdftotext
hack
## dependencies
conda create -n envname python=3.7
conda activate envname
conda config --add channels conda-forge
conda install poppler
## Do it
find /MLR -name '*.pdf' -exec sh \
-c 'pdftotext "{}" - | grep --with-filename --label="{}" \
--color "SEARCHTERM"' \;
Reduce bloat
gs -sDEVICE=pdfwrite -dCompatibilityLevel=1.5 \
-dPDFSETTINGS=/ebook \
-dNOPAUSE -dQUIET -dBATCH \
-sOutputFile=output.pdf input.pdf
or, wrapped up into a nice little script,
ShrinkPDF:
(90
is the dpi here.)
./shrinkpdf.sh in.pdf out.pdf 90
There is also cpdf and the GUI version Densify.
Commercial online PDF tool smallpdf claims to do this. (USD12/month with free trial.)
OCRMyPDF makes a scanned PDF possibly-searchable and also optimizes the size, optionally aggressively. This will not downsample, but it will get better monochrome compression than normal.
A peer recommendation: PDF Resizer - PDF Tools
Concatenate/split
This ghostcript command concatenates PDFs:
gs -dBATCH -dNOPAUSE -q -sDEVICE=pdfwrite \
-dPDFSETTINGS=/prepress -sOutputFile=output.pdf input*.pdf
You can split PDFs also with ghostscript, but usually you want a GUI to see what you are splitting, no?
PDFMix and PDF shuffler have both been recommended to me for this.
Antoine Chambert-Loir points out the underdocumented command texexec
has
many PDF-editing sub-commands.
The following command extract pages 1 to 5, page 7 and pages 8 to 12 from file.pdf
and puts it in outputfile.pdf
texexec --pdfselect --select=1:5,7,8:12 --result=outputfile.pdf file.pdf
PDF to SVG
pdf2svg generates editable vector diagrams from the PDF.
HTML to PDF
See weasyprint, below.
Diffing PDFs
The use case here is, they say, a (presumably scientific) review. “You reviewed version A of a paper, and receive version B, and wonder what the changes are.” The tool is pdfdiff.
Books booklets and binding
The professional word for laying out the book corectly for the type of binding you intend is imposition apparently. There are many expesnive professional tools to do this and some scrappy free alternatives.
pdfbooklet (Linux/windows) is one
PdfBooklet is a Python script whose first purpose was to create booklet(s) from existing pdf files. It has been extended to many other functions in pdf pages manipulation.
Featuring
- Multiple booklets
- Add blank pages in the beginning in the end
- Adjust scale and margin.
jPDF is also recommended in some circles.
Honourable mention also to impositioner which seems to be full-featured despite being a short command-line script bookletimpose/pdfimposer is a python gtk gui/commandline combo imposer. pdfimpose is a multiple pdf script whatsit.
Commercial options include Devalipi Imposition Studio and Montax Imposer.
Crop marks
There are at least two options:
None makes it clear which of TrimBox
, BleedBox
, Cropbox
or ArtBox
is what I truly want.
This might clarify it slightly
but I lost focus around here.
Method A
I can add crop marks to a PDF document with different PDF tools, e.g. pdftk
.:
- Export the first page with crop marks to a PDF file (your_cropmark.pdf)
- Join it with your PDF document (your_document.pdf) in the command line:
pdftk your_document.pdf multistamp your_cropmark.pdf output result.pdf
Method B
I can set PDF cropping values with GhostScript for printing.
Create a plain text file with the right cropping values (eg. this is 5mm crop of A4):
[/CropBox [14.17 14.17 581.1 827.72] /PAGES pdfmark
Alternatively, use the command line
gs -c "[/CropBox [14.17 14.17 581.1 827.72] /PAGES pdfmark" \
Now, convert my_document.pdf
using the previous file (which I called pdfmark.txt
):
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
$OPTIONS \
-c .setpdfwrite \
-sOutputFile=result.pdf \
-f your_document.pdf
pdfmark.txt
Color conversion
Nightmares. Colour management is generally complicated.
Ghostcript colour management specifically is complicated,
and has many moving parts, specifivally
that rapidly moving — e.g. the -dUseCIEColor
option was removed in ghostscript
9, because it is apparently a broken noob feature.
Its replacement is broken documentation.
I am aware this is a complicated and nuanced area with much special labour involved. But I do not care. If I am working on a project with a graphic designer then they can do this with their skill and training, but for me, I just want a document which prints adequately with some vague approximation of the colours of the screen and no errors. That means, changing to CMYK, or not. No other alterations considered.
To CMYK
CMYK Color conversion of RGB PDF with GhostScript:
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
-sColorConversionStrategy=CMYK \
-sColorConversionStrategyForImages=CMYK \
-sDEVICE=pdfwrite \
-dProcessColorModel=/DeviceCMYK \
-dCompatibilityLevel=1.5 \
-sOutputFile=result_cmyk.pdf \
your_document.pdf
See also a PDF to TIFF example.
To Greyscale
gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite \
-sColorConversionStrategy=Gray \
-dProcessColorModel=/DeviceGray \
-dCompatibilityLevel=1.5 \
-sOutputFile=result_gray.pdf \
your_document.pdf
Programmatic editing and generation
So many of these.
Weasyprint
Weasyprint seems the cleanest. It converts HTML+CSS into PDF, and is written in pure python. It can be used from the command line or programatically.
It is based on various libraries but not on a full rendering engine like Blink, Gecko or WebKit. The CSS layout engine is written in Python, designed for pagination, and meant to be easy to hack on.
pip install weasyprint
weasyprint https://weasyprint.org/ weasyprint.pdf
SVGlib
svglib provides a pure python library that
can convert SVG to PDF, and a command line utility for same, svg2pdf
.
Thus one can, e.g.
add SVGs to PDFs in reportlab.
Reportlab
Apropos that, reportlab is the famed monstrous classic way of programatically generating PDFs from code. It includes a modicum of typesetting. It doesn’t edit PDFs so much, but it generates them pretty well. Its integration with other things is often weak — if you thought that inserting LaTeX equations would be simple, or HTML snippets etc. On the other hand it has fancy features such as its own chart generation library. On the third hand, there are better, more widely supported charting libraries that it doesn’t use. Litmus test: Use it if the following feels to you like a natural way to print two columns:
from reportlab.platypus import (
BaseDocTemplate,
Frame,
Paragraph,
PageBreak,
PageTemplate )
from reportlab.lib.styles import getSampleStyleSheet
import random
words = (
"lorem ipsum dolor sit amet consetetur "
"sadipscing elitr sed diam nonumy eirmod "
"tempor invidunt ut labore et").split()
styles=getSampleStyleSheet()
Elements=[]
doc = BaseDocTemplate(
'basedoc.pdf',
showBoundary=1)
#Two Columns
frame1 = Frame(
doc.leftMargin,
doc.bottomMargin,
doc.width/2-6,
doc.height,
id='col1')
frame2 = Frame(
doc.leftMargin+doc.width/2+6,
doc.bottomMargin,
doc.width/2-6,
doc.height,
id='col2')
Elements.append(
Paragraph(
" ".join([random.choice(words) for i in range(1000)]),
styles['Normal']))
doc.addPageTemplates([
PageTemplate(id='TwoCol',frames=[frame1,frame2]),
])
#start the construction of the pdf
doc.build(Elements)
pdfrw
pdfrw is a Python library and utility that reads and writes PDF files:
- Operations include subsetting, merging, rotating, modifying metadata, etc. […]
- Can be used either standalone, or in conjunction with reportlab to reuse existing PDFs in new ones
Here is a gentle HOWTO. You can use it to put matplotlib plots in reportlab PDFs, getting the best of two bad worlds.
Scribus
scribus is a reasonable open-source desktop publishing tool. If your content is not amenable to automatic layout out it is a good choice, for e.g. posters. It includes a Python API, albeit a reputedly quirky one, which is AFAICT Python 2. For all that, it’s a simple and interactive way of generating PDFs programmatically, so might be worth it.
pypdf2
pypdf2 is another alternative python pdf library that looks messier.
Web page to pdf
wkhtmltopdf and wkhtmltoimage are open source (LGPLv3) command line tools to render HTML into PDF and various image formats using the Qt WebKit rendering engine. >These run entirely “headless” and do not require a display or display service.