📄 PDF Text Extractor
Pricing
from $4.99 / 1,000 results
📄 PDF Text Extractor
📄✨ PDF Text Extractor extracts clean text from PDF files with precision. ⚡ Perfect for data mining, document processing, and searchable archives. 🚀 Fast, reliable, and efficient for your workflow!
Pricing
from $4.99 / 1,000 results
Rating
0.0
(0)
Developer
Scraper Engine
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
📄 PDF Text Extractor & Chunker
Extract clean, ordered text from any PDF on the web — page-by-page or split into LLM-ready chunks with controllable size and overlap. Point it at one URL or thousands; results stream into your dataset section by section, live.
Perfect for building RAG pipelines, question-answering systems, document search, and any workflow that needs PDF content as plain text. 🚀
🌟 Why Choose This Actor?
- ⚡ Live results — every page/chunk is saved the moment it's ready. A long run never leaves you staring at an empty output table.
- 🧩 LLM-friendly chunking — character-based chunking with overlap, so context isn't cut mid-sentence.
- 📦 Bulk input — drop in a whole list of PDF URLs at once.
- 🛡️ Smart anti-rate-limit ladder — starts with a direct connection and automatically falls back to datacenter, then residential proxies if a host blocks you.
- 🎉 Engaging real-time logs — watch exactly what's happening, page by page.
✨ Key Features
- Extract text from PDFs provided as URLs.
- Toggle between page mode (one record per page) and chunk mode.
- Configure
chunkSizeandchunkOverlapfor perfect LLM context windows. - Resilient downloads with proxy fallback and retries.
- Output ready for JSON / CSV / XLSX export.
📥 Input
| Field | Type | Description |
|---|---|---|
urls | array | 🔗 Direct URLs of the PDF files (bulk supported). |
performChunking | boolean | ✂️ true → split into chunks. false → one record per page. |
chunkSize | integer | 📏 Max characters per chunk (chunk mode). Default 1000. |
chunkOverlap | integer | 🔁 Characters shared between adjacent chunks. Default 0. |
proxyConfiguration | object | 🛡️ Apify proxy used to power the automatic fallbacks. |
Example input
{"urls": ["https://arxiv.org/pdf/2307.12856"],"performChunking": true,"chunkSize": 1000,"chunkOverlap": 0,"proxyConfiguration": { "useApifyProxy": true }}
📤 Output
Each record is one text section:
{"url": "https://arxiv.org/pdf/2307.12856","index": 0,"text": "A Real-World WebAgent with Planning, Long Context Understanding…"}
| Field | Description |
|---|---|
url | 🔗 Source PDF URL. |
index | 🔢 Position of the section (chunk number, or page number in page mode). |
text | 📝 Extracted text for that section. |
🛡️ How the connection ladder works
- 🌐 Direct — no proxy; the request goes straight to the PDF host.
- 🛰️ Datacenter proxy — engaged automatically if the host blocks or rate-limits the direct request.
- 🏠 Residential proxy — the final fallback, retried up to 3 times. Once residential is engaged, the run sticks with it for every remaining PDF.
Every switch is logged clearly so you always know which path delivered your data.
🚀 How to Use (Apify Console)
- Log in at Apify Console → Actors.
- Open PDF Text Extractor & Chunker.
- Paste your PDF URLs, set chunking options, pick a proxy.
- Click Start and watch the sections roll in live. 📡
- Open the Output tab and export to JSON / CSV / XLSX.
🤖 Use via API
curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"urls":["https://arxiv.org/pdf/2307.12856"],"performChunking":true,"chunkSize":1000,"chunkOverlap":0}'
💡 Best Use Cases
- 📚 Build RAG / knowledge bases from PDF libraries.
- 🤖 Feed document text into LLMs (chunk mode).
- 🔍 Full-text search across PDF collections.
- 🧾 Convert reports, papers, and manuals to plain text.
❓ FAQ
Does it work on scanned/image-only PDFs? It extracts the text layer of a PDF. Image-only scans without an embedded text layer will return little or no text (OCR is not performed).
Can I pass many URLs? Yes — urls accepts a bulk list, processed one after another with results saved live.
What if a host rate-limits me? The Actor automatically falls back through datacenter and residential proxies and retries, then sticks with residential.
🛟 Support & Feedback
Found a bug or have a feature request? Open an issue on the Actor's Issues tab in the Apify Console.
⚖️ Use responsibly. Only extract content from PDFs you are authorized to access. You are responsible for compliance with applicable laws and the source site's terms.