PDF Text & Table Extractor (pdfplumber, batch URLs)
Pricing
Pay per usage
PDF Text & Table Extractor (pdfplumber, batch URLs)
Download any PDF by URL and extract clean per-page text + detected tables (as 2D arrays) + document metadata (title/author/created/modified). Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Hojun Lee
Maintained by CommunityActor stats
0
Bookmarked
6
Total users
5
Monthly active users
a day ago
Last modified
Categories
Share
PDF Text & Table Extractor
Download any PDF by URL and extract clean per-page text + detected tables + document metadata. Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.
Why this exists
PDFs are how every important document gets distributed — SEC filings, research papers, financial reports, government records. But the raw bytes aren't searchable, can't be fed to LLMs, can't be ingested into databases.
This actor handles the conversion. You give it a URL list; it returns a structured per-page dataset including:
- Clean extracted text (preserving reading order)
- Detected tables as 2D arrays (ready for CSV / Sheets export)
- Document-level metadata (title, author, creation date)
What you get
Summary row (one per PDF)
{"_type": "summary","url": "https://www.sec.gov/Archives/.../aapl-10k.pdf","ok": true,"page_count": 80,"title": "Apple Inc. — Annual Report 2024","author": "Apple Inc.","creator": "InDesign","producer": "Adobe Distiller","created": "D:20240928081300Z","modified": "D:20240928081400Z"}
Per-page row
{"_type": "page","url": "https://...","page": 12,"char_count": 3210,"word_count": 524,"text": "Item 1A. Risk Factors\n\nOur business...","tables": [[["Revenue", "Q1 2024", "Q4 2023"],["iPhone", "$45.96B", "$43.81B"],["Mac", "$9.66B", "$7.61B"]]],"table_count": 1}
Quick start
Single PDF
{"url": "https://www.example.com/whitepaper.pdf"}
Batch of 10-K filings from SEC
{"urls": ["https://www.sec.gov/Archives/edgar/data/320193/aapl-10k.pdf","https://www.sec.gov/Archives/edgar/data/789019/msft-10k.pdf"],"extractTables": true,"maxPages": 200}
Text-only (skip tables for speed)
{"url": "https://...","extractTables": false}
Pricing
Pay-Per-Event:
$0.01— flat per PDF (download + metadata)$0.0005— per page extracted
| Run | PDFs × Pages | Cost |
|---|---|---|
| One 80-page 10-K | 1 × 80 | $0.05 |
| Batch of 10 research papers | 10 × 20 | $0.20 |
| Quarterly: 50 earnings releases | 50 × 15 | $0.88 |
Vs Adobe Acrobat Pro DC ($23/mo) for manual extraction, or DocParser ($199/mo for API) — this is 5-10x cheaper at typical volumes.
Use cases
- SEC filings — Pull text + tables from 10-K, 10-Q, 8-K. Combine with our SEC EDGAR Tracker.
- Research aggregation — Build a searchable database of academic papers + abstracts
- Financial reports — Auto-extract earnings tables from quarterly releases
- LLM RAG — Convert PDFs to chunks for vector search / Q&A
- Compliance audit — Index every PDF in your corporate document store
Limitations
- Scanned PDFs (image-only) — Returns empty text. Use OCR for scanned PDFs.
- Complex layouts — Multi-column research papers may merge column text awkwardly. Tweak with custom extraction parameters in v0.2.
- Encrypted PDFs — Will fail with a clear error message.
Data engine
- pdfplumber v0.11+ — Pure-Python, robust, used by countless data-engineering pipelines.
- No OCR in this actor. For OCR, combine with a separate actor that runs Tesseract or Vision API.
Related actors (same author)
- Web Page → Markdown Converter — HTML version of the same idea
- HTML Metadata Extractor
- SEC EDGAR Filing Tracker — Get the SEC filing URLs to feed in
- JSON Schema Generator
Feedback
A short review helps researchers / analysts find it: Leave a review on Apify Store