PDF Text & Table Extractor (pdfplumber, batch URLs) avatar

PDF Text & Table Extractor (pdfplumber, batch URLs)

Pricing

Pay per usage

Go to Apify Store
PDF Text & Table Extractor (pdfplumber, batch URLs)

PDF Text & Table Extractor (pdfplumber, batch URLs)

Download any PDF by URL and extract clean per-page text + detected tables (as 2D arrays) + document metadata (title/author/created/modified). Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Hojun Lee

Hojun Lee

Maintained by Community

Actor stats

0

Bookmarked

6

Total users

5

Monthly active users

a day ago

Last modified

Share

PDF Text & Table Extractor

Download any PDF by URL and extract clean per-page text + detected tables + document metadata. Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.


Why this exists

PDFs are how every important document gets distributed — SEC filings, research papers, financial reports, government records. But the raw bytes aren't searchable, can't be fed to LLMs, can't be ingested into databases.

This actor handles the conversion. You give it a URL list; it returns a structured per-page dataset including:

  • Clean extracted text (preserving reading order)
  • Detected tables as 2D arrays (ready for CSV / Sheets export)
  • Document-level metadata (title, author, creation date)

What you get

Summary row (one per PDF)

{
"_type": "summary",
"url": "https://www.sec.gov/Archives/.../aapl-10k.pdf",
"ok": true,
"page_count": 80,
"title": "Apple Inc. — Annual Report 2024",
"author": "Apple Inc.",
"creator": "InDesign",
"producer": "Adobe Distiller",
"created": "D:20240928081300Z",
"modified": "D:20240928081400Z"
}

Per-page row

{
"_type": "page",
"url": "https://...",
"page": 12,
"char_count": 3210,
"word_count": 524,
"text": "Item 1A. Risk Factors\n\nOur business...",
"tables": [
[
["Revenue", "Q1 2024", "Q4 2023"],
["iPhone", "$45.96B", "$43.81B"],
["Mac", "$9.66B", "$7.61B"]
]
],
"table_count": 1
}

Quick start

Single PDF

{
"url": "https://www.example.com/whitepaper.pdf"
}

Batch of 10-K filings from SEC

{
"urls": [
"https://www.sec.gov/Archives/edgar/data/320193/aapl-10k.pdf",
"https://www.sec.gov/Archives/edgar/data/789019/msft-10k.pdf"
],
"extractTables": true,
"maxPages": 200
}

Text-only (skip tables for speed)

{
"url": "https://...",
"extractTables": false
}

Pricing

Pay-Per-Event:

  • $0.01 — flat per PDF (download + metadata)
  • $0.0005 — per page extracted
RunPDFs × PagesCost
One 80-page 10-K1 × 80$0.05
Batch of 10 research papers10 × 20$0.20
Quarterly: 50 earnings releases50 × 15$0.88

Vs Adobe Acrobat Pro DC ($23/mo) for manual extraction, or DocParser ($199/mo for API) — this is 5-10x cheaper at typical volumes.


Use cases

  1. SEC filings — Pull text + tables from 10-K, 10-Q, 8-K. Combine with our SEC EDGAR Tracker.
  2. Research aggregation — Build a searchable database of academic papers + abstracts
  3. Financial reports — Auto-extract earnings tables from quarterly releases
  4. LLM RAG — Convert PDFs to chunks for vector search / Q&A
  5. Compliance audit — Index every PDF in your corporate document store

Limitations

  • Scanned PDFs (image-only) — Returns empty text. Use OCR for scanned PDFs.
  • Complex layouts — Multi-column research papers may merge column text awkwardly. Tweak with custom extraction parameters in v0.2.
  • Encrypted PDFs — Will fail with a clear error message.

Data engine

  • pdfplumber v0.11+ — Pure-Python, robust, used by countless data-engineering pipelines.
  • No OCR in this actor. For OCR, combine with a separate actor that runs Tesseract or Vision API.


Feedback

A short review helps researchers / analysts find it: Leave a review on Apify Store