Manipulating PDF files in Python

Posted on Tue 23 December 2025 in Software

I recently had to interleave pages from two PDF files, reversing the order of the pages in the second file. This came from scanning a pile of pages without a duplexer.

Preview on Mac is a nice little tool for doing simple manipulations on images and PDF files, but it doesn't quite have this feature. Fortunately, the pypdf library (docs) was exactly what I needed.

Here's some code for reading a PDF file into a list of pages, and writing out a list of pages into a PDF file (thanks ChatGPT):

from pypdf import PdfReader, PdfWriter


def load_pages(path):
    reader = PdfReader(path)
    return list(reader.pages)


def write_pdf(pages, output_path):
     writer = PdfWriter()
     for page in pages:
         writer.add_page(page)
     with open(output_path, "wb") as f:
         writer.write(f)

Usage:

pages_obs = load_pages("Obverse.pdf")
pages_rev = load_pages("Reverse.pdf")

# Reverse the second file, and then interleave the pages
combined = []
for o, r in zip(pages_obs, reversed(pages_rev)):
    combined.append(o)
    combined.append(r)

write_pdf(combined, "Combined.pdf")

Job done. As ever, though, there is more going on under the hood...

Playing with this

One-liner

That rather ugly for loop can be turned into a one-liner:

from itertools import chain

combined = chain(*zip(pages_obs, reversed(pages_rev)))
write_pdf(combined, "Combined.pdf")

Note that reversed can't reverse generic iterables (which might be infinite); its input must be a Sequence – it must support __getitem__ for indexing and __len__ so that it has a finite length.

Virtual lists

Happily, this is what PdfReader.pages returns. Specifically, it returns a _VirtualList (src) that emulates the interface of a list but accesses/loads items "lazily".

Thus the list(...) call in read_pages can be omitted:

def load_pages(path) -> Sequence[PageObject]:
    reader = PdfReader(path)
    return reader.pages  # NOT list(...)

returning the _VirtualList object, without iterating through and loading all the items.

Let's play with _VirtualList a little more.

from pypdf._page import _VirtualList
help(_VirtualList)

We learn that the __init__ function has this signature:

class _VirtualList(Sequence[PageObject]):
    def __init__(
        self,
        length_function: Callable[[], int],
        get_function: Callable[[int], PageObject],
    ) -> None: ...

Let's provide a get_function like this:

def gf(x):
    print("calculating", x)
    return x

The print is there so that we get a message each time gf is called.

Create our _VirtualList:

from funcy import constantly

vl = pypdf._page._VirtualList(
    length_function=constantly(3),
    get_function=gf,
)

vl
# <pypdf._page._VirtualList object at 0x110e8fba0>


reversed(vl)
# <generator object Sequence.__reversed__ at 0x10ff7c4f0>

Observe that calling reversed(vl) only prepares the iterable, it doesn't actually invoke the gf.

Neither does zip:

z = zip(vl, reversed(vl))
z
# <zip at 0x11104bd80>

It is only when you actually iterate through the zip object that any calculations are actually performed:

for t in z:
    print(t)

# calculating 0
# calculating 2
# (0, 2)
# calculating 1
# calculating 1
# (1, 1)
# calculating 2
# calculating 0
# (2, 0)

Sloth is a virtue

A virtual list has the advantage of being "lazy" – that is, of delaying any actual computation until you explicitly need it and ask for it.

So here's the revised code:

def load_pages(path) -> Sequence[PageObject]:
    reader = PdfReader(path)
    return reader.pages  # NOT list(...)


def write_pdf(pages, output_path):
    writer = PdfWriter()
    for page in pages:  # actual loop
        writer.add_page(page)
    with open(output_path, "wb") as f:
        writer.write(f)


pages_obs = load_pages("Obverse.pdf")  # lazy
pages_rev = load_pages("Reverse.pdf")  # lazy

combined = chain(*zip(pages_obs, reversed(pages_rev)))  # lazy
write_pdf(combined, "Combined.pdf")

In this version, any actual loading of the pages is delayed until the call to write_pdf – specifically, until the for page in pages causes the pages to be iterated through.

In a simple script like this, there is no real difference, but in a more complex setting it can be advantageous to delay loading the pages, or at least having the control to decide when to do this. If there are many pages or if page objects are large, then we might not want to carry them around in memory until we actually need them.

For example, If it were necessary to do further processing on each page before writing it to the output, then we could do something like this:

def write_pdf(pages, output_path):
    writer = PdfWriter()
     for page in pages: 
		p = process(page)  # whatever this might be
        writer.add_page(p)
    with open(output_path, "wb") as f:
        writer.write(f)

and continue to load and work on one page at a time, rather than having to load them all.