Introduction to Python PyPDF2 Library

PyPDF2 is a Python library that helps in working and dealing with PDF files. It allows us to read, manipulate, and extract information from PDFs without the need for complex software. Using PyPDF2, we can split a single PDF into multiple files, merge multiple PDFs into one, extract text, rotate pages, and even add watermarks. In this article, we are going to learn most of the PyPDF2 library.

What is PyPDF2?

We use PyPDF2 when we have to deal with large documents. Suppose we have a large PDF document, and we only need to send a few pages to someone. Instead of manually extracting those pages, we can do this in just a few lines of code using PyPDF2. We use PyPDF2 to combine multiple PDF files into one file. This tool helps us do things such as reading, extracting text, merging, splitting, rotating, and even encrypting/decrypting PDF files.

Installing PyPDF2 via pip

We have to first install PyPDF2 before using it. We can install using pip. We open our command prompt or terminal and run the following command:

pip install PyPDF2

Basic Concepts

Now, let's look at some important key concepts before understand the features of PyPDF's:

PDF Structure: A PDF file consists of objects like text, images, metadata, and page structure.
Pages: PDF files contain multiple pages, and each page can be manipulated individually.
Metadata: PDFs contain information such as the author, title, and creation date.

Key Features of PyPDF2

Some of key features of PyPDF2 are given below:

It is used for reading PDF files.
It is used for extracting text and metadata.
It is used for merging, splitting, and rotating pages.
It is used for encrypting and decrypting PDF files.
It is also used for adding watermarks and modifying PDF content.

Working with PDF Files

1. Reading PDF Files

If we want to read a PDF file, we have to first open it using PyPDF2. Let's we have a pdf named example.pdf.

Screenshot-2024-09-11-123304 — A simple pdf file

Here is how we can read a pdf using PyPDF2.

Python

import PyPDF2

# Open a PDF file
with open('example.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)

    # Get the total number of pages
    total_pages = len(reader.pages)
    print(f"Total pages: {total_pages}")

    # Read the content of the first page
    first_page = reader.pages[0]
    text = first_page.extract_text()
    print(text)

Output:

Screenshot-2024-09-11-123445 — Reading pdf content using pypdf2

2. Extracting Text from PDF Files

We can easily extract text from PDF files using the extract_text() function. This can be useful for parsing large documents.

Python

import PyPDF2

with open('example.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)

    for page in reader.pages:
        print(page.extract_text())

Output:

Screenshot-2024-09-11-123636 — Extracting Text from a pdf using PyPDF2

3. Extracting Metadata from PDF Files

PyPDF2 allows us to extract metadata such as the author, title, and creation date:

Python

import PyPDF2

with open('example.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)
    
    metadata = reader.metadata
    print(f"Author: {metadata.author}")
    print(f"Title: {metadata.title}")
    print(f"Creation Date: {metadata.creation_date}")

Output:

Screenshot-2024-09-11-124242 — Extracting Meta Data From a pdf using PyPDF

Manipulating PDF Files using PyPDF2

We can play around with all the pdfs we have. Let's see a few ways to manipulate pdfs.

1. Merging Multiple PDF Files

We can merge multiple PDF files into one using PyPDF2's PdfWriter(). Let's we have an another pdf file named example2.pdf.

example2.pdf

Merge example.pdf and example2.pdf:

Python

from PyPDF2 import PdfReader, PdfWriter

pdf_writer = PdfWriter()

# Add PDFs to merge
for pdf_file in ['example.pdf', 'example2.pdf']:
    reader = PdfReader(pdf_file)
    for page in reader.pages:
        pdf_writer.add_page(page)

# Save the merged PDF
with open('merged.pdf', 'wb') as output_file:
    pdf_writer.write(output_file)

Here we get a merged.pdf file.

Screenshot-2024-09-11-124908 — Merged.pdf

merged.pdf

2. Splitting PDF Files into Individual Pages

If we want to split a PDF into separate pages, PyPDF2 makes this easy. Let's split the merged.pdf file.

Python

import PyPDF2

with open('merged.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)

    for i, page in enumerate(reader.pages):
        writer = PyPDF2.PdfWriter()
        writer.add_page(page)

        # Save each page as a new file
        with open(f'page_{i+1}.pdf', 'wb') as output_pdf:
            writer.write(output_pdf)

Output:

Screenshot-2024-09-11-125254 — Splitting a pdf into multiple pdfs using PyPDF

The page_1.pdf and page_2.pdf will have contents of page1 and page two of merged.pdf file respectively.

3. Adding Watermarks to PDF Files

We can also add watermark in PDF file if we want. We need another PDF file containing the watermark (like a logo or text). We can overlay this on our main PDF file.

watermark.pdf

Screenshot-2024-09-11-144435 — watermark.pdf

Python program to add watermark to a pdf using PyPDF2.

Python

import PyPDF2

with open('example.pdf', 'rb') as main_pdf, open('watermark.pdf', 'rb') as watermark_pdf:
    reader = PyPDF2.PdfReader(main_pdf)
    watermark_reader = PyPDF2.PdfReader(watermark_pdf)

    writer = PyPDF2.PdfWriter()
    watermark_page = watermark_reader.pages[0]

    for page in reader.pages:
        page.merge_page(watermark_page)
        writer.add_page(page)

    # Save the watermarked PDF
    with open('watermarked.pdf', 'wb') as output_pdf:
        writer.write(output_pdf)

watermarked.pdf

Screenshot-2024-09-11-144553 — watermarked.pdf

4. Encrypting and Decrypting PDF Files

We can also password-protect our PDF files using encryption.

Below is example:

Python

import PyPDF2

writer = PyPDF2.PdfWriter()

# Add pages to encrypt
with open('example.pdf', 'rb') as pdf_file:
    reader = PyPDF2.PdfReader(pdf_file)
    for page in reader.pages:
        writer.add_page(page)

# Encrypt the PDF with a password
writer.encrypt('password123')

# Save the encrypted PDF
with open('encrypted.pdf', 'wb') as output_pdf:
    writer.write(output_pdf)

Output:

When we try to open the file, we will need to pass the password:

Screenshot-2024-09-11-144830 — Enter the password to view pdf

Common Use Cases

We can automate generating PDF reports by extracting or merging data.
We can also combine PyPDF2 with other Python libraries like matplotlib or PIL for more advanced PDF generation.

Conclusion

PyPDF2 is a useful, simple and powerful library for working with PDFs in Python. By following the steps given above, we can start extracting text from PDF files and explore further to discover all the features PyPDF2 provides.

Introduction to Python PyPDF2 Library

What is PyPDF2?

Installing PyPDF2 via pip

Basic Concepts

Key Features of PyPDF2

Working with PDF Files

1. Reading PDF Files

2. Extracting Text from PDF Files

3. Extracting Metadata from PDF Files

Manipulating PDF Files using PyPDF2

1. Merging Multiple PDF Files

2. Splitting PDF Files into Individual Pages

3. Adding Watermarks to PDF Files

4. Encrypting and Decrypting PDF Files

Common Use Cases

Conclusion

Explore