PyPDF2 is a Python library that helps in working and dealing with PDF files. It allows us to read, manipulate, and extract information from PDFs without the need for complex software. Using PyPDF2, we can split a single PDF into multiple files, merge multiple PDFs into one, extract text, rotate pages, and even add watermarks. In this article, we are going to learn most of the PyPDF2 library.
What is PyPDF2?
We use PyPDF2 when we have to deal with large documents. Suppose we have a large PDF document, and we only need to send a few pages to someone. Instead of manually extracting those pages, we can do this in just a few lines of code using PyPDF2. We use PyPDF2 to combine multiple PDF files into one file. This tool helps us do things such as reading, extracting text, merging, splitting, rotating, and even encrypting/decrypting PDF files.
Installing PyPDF2 via pip
We have to first install PyPDF2 before using it. We can install using pip. We open our command prompt or terminal and run the following command:
pip install PyPDF2Basic Concepts
Now, let's look at some important key concepts before understand the features of PyPDF's:
- PDF Structure: A PDF file consists of objects like text, images, metadata, and page structure.
- Pages: PDF files contain multiple pages, and each page can be manipulated individually.
- Metadata: PDFs contain information such as the author, title, and creation date.
Key Features of PyPDF2
Some of key features of PyPDF2 are given below:
- It is used for reading PDF files.
- It is used for extracting text and metadata.
- It is used for merging, splitting, and rotating pages.
- It is used for encrypting and decrypting PDF files.
- It is also used for adding watermarks and modifying PDF content.
Working with PDF Files
1. Reading PDF Files
If we want to read a PDF file, we have to first open it using PyPDF2. Let's we have a pdf named example.pdf.

Here is how we can read a pdf using PyPDF2.
import PyPDF2
# Open a PDF file
with open('example.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
# Get the total number of pages
total_pages = len(reader.pages)
print(f"Total pages: {total_pages}")
# Read the content of the first page
first_page = reader.pages[0]
text = first_page.extract_text()
print(text)
Output:

2. Extracting Text from PDF Files
We can easily extract text from PDF files using the extract_text() function. This can be useful for parsing large documents.
import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
print(page.extract_text())
Output:

3. Extracting Metadata from PDF Files
PyPDF2 allows us to extract metadata such as the author, title, and creation date:
import PyPDF2
with open('example.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
metadata = reader.metadata
print(f"Author: {metadata.author}")
print(f"Title: {metadata.title}")
print(f"Creation Date: {metadata.creation_date}")
Output:

Manipulating PDF Files using PyPDF2
We can play around with all the pdfs we have. Let's see a few ways to manipulate pdfs.
1. Merging Multiple PDF Files
We can merge multiple PDF files into one using PyPDF2's PdfWriter(). Let's we have an another pdf file named example2.pdf.
example2.pdf

Merge example.pdf and example2.pdf:
from PyPDF2 import PdfReader, PdfWriter
pdf_writer = PdfWriter()
# Add PDFs to merge
for pdf_file in ['example.pdf', 'example2.pdf']:
reader = PdfReader(pdf_file)
for page in reader.pages:
pdf_writer.add_page(page)
# Save the merged PDF
with open('merged.pdf', 'wb') as output_file:
pdf_writer.write(output_file)
Here we get a merged.pdf file.

merged.pdf
2. Splitting PDF Files into Individual Pages
If we want to split a PDF into separate pages, PyPDF2 makes this easy. Let's split the merged.pdf file.
import PyPDF2
with open('merged.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for i, page in enumerate(reader.pages):
writer = PyPDF2.PdfWriter()
writer.add_page(page)
# Save each page as a new file
with open(f'page_{i+1}.pdf', 'wb') as output_pdf:
writer.write(output_pdf)
Output:

The page_1.pdf and page_2.pdf will have contents of page1 and page two of merged.pdf file respectively.
3. Adding Watermarks to PDF Files
We can also add watermark in PDF file if we want. We need another PDF file containing the watermark (like a logo or text). We can overlay this on our main PDF file.
watermark.pdf

Python program to add watermark to a pdf using PyPDF2.
import PyPDF2
with open('example.pdf', 'rb') as main_pdf, open('watermark.pdf', 'rb') as watermark_pdf:
reader = PyPDF2.PdfReader(main_pdf)
watermark_reader = PyPDF2.PdfReader(watermark_pdf)
writer = PyPDF2.PdfWriter()
watermark_page = watermark_reader.pages[0]
for page in reader.pages:
page.merge_page(watermark_page)
writer.add_page(page)
# Save the watermarked PDF
with open('watermarked.pdf', 'wb') as output_pdf:
writer.write(output_pdf)
watermarked.pdf

4. Encrypting and Decrypting PDF Files
We can also password-protect our PDF files using encryption.
Below is example:
import PyPDF2
writer = PyPDF2.PdfWriter()
# Add pages to encrypt
with open('example.pdf', 'rb') as pdf_file:
reader = PyPDF2.PdfReader(pdf_file)
for page in reader.pages:
writer.add_page(page)
# Encrypt the PDF with a password
writer.encrypt('password123')
# Save the encrypted PDF
with open('encrypted.pdf', 'wb') as output_pdf:
writer.write(output_pdf)
Output:
When we try to open the file, we will need to pass the password:

Common Use Cases
- We can automate generating PDF reports by extracting or merging data.
- We can also combine PyPDF2 with other Python libraries like matplotlib or PIL for more advanced PDF generation.
Conclusion
PyPDF2 is a useful, simple and powerful library for working with PDFs in Python. By following the steps given above, we can start extracting text from PDF files and explore further to discover all the features PyPDF2 provides.