📄 PDF Text Extractor avatar

📄 PDF Text Extractor

Pricing

from $3.99 / 1,000 results

Go to Apify Store
📄 PDF Text Extractor

📄 PDF Text Extractor

📄 PDF Text Extractor (pdf-text-extractor) extracts clean text from PDF files for faster search, data analysis, and content reuse. ⚡ Saves time & boosts productivity for research, automation, and document workflows.

Pricing

from $3.99 / 1,000 results

Rating

0.0

(0)

Developer

Scrapio

Scrapio

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

📄 PDF Text Extractor & Chunker

Extract clean, ordered text from any PDF on the web — page-by-page or split into LLM-ready chunks with controllable size and overlap. Point it at one URL or thousands; results stream into your dataset section by section, live.

Perfect for building RAG pipelines, question-answering systems, document search, and any workflow that needs PDF content as plain text. 🚀

🌟 Why Choose This Actor?

  • ⚡ Live results — every page/chunk is saved the moment it's ready. A long run never leaves you staring at an empty output table.
  • 🧩 LLM-friendly chunking — character-based chunking with overlap, so context isn't cut mid-sentence.
  • 📦 Bulk input — drop in a whole list of PDF URLs at once.
  • 🛡️ Smart anti-rate-limit ladder — starts with a direct connection and automatically falls back to datacenter, then residential proxies if a host blocks you.
  • 🎉 Engaging real-time logs — watch exactly what's happening, page by page.

✨ Key Features

  • Extract text from PDFs provided as URLs.
  • Toggle between page mode (one record per page) and chunk mode.
  • Configure chunkSize and chunkOverlap for perfect LLM context windows.
  • Resilient downloads with proxy fallback and retries.
  • Output ready for JSON / CSV / XLSX export.

📥 Input

FieldTypeDescription
urlsarray🔗 Direct URLs of the PDF files (bulk supported).
performChunkingboolean✂️ true → split into chunks. false → one record per page.
chunkSizeinteger📏 Max characters per chunk (chunk mode). Default 1000.
chunkOverlapinteger🔁 Characters shared between adjacent chunks. Default 0.
proxyConfigurationobject🛡️ Apify proxy used to power the automatic fallbacks.

Example input

{
"urls": ["https://arxiv.org/pdf/2307.12856"],
"performChunking": true,
"chunkSize": 1000,
"chunkOverlap": 0,
"proxyConfiguration": { "useApifyProxy": true }
}

📤 Output

Each record is one text section:

{
"url": "https://arxiv.org/pdf/2307.12856",
"index": 0,
"text": "A Real-World WebAgent with Planning, Long Context Understanding…"
}
FieldDescription
url🔗 Source PDF URL.
index🔢 Position of the section (chunk number, or page number in page mode).
text📝 Extracted text for that section.

🛡️ How the connection ladder works

  1. 🌐 Direct — no proxy; the request goes straight to the PDF host.
  2. 🛰️ Datacenter proxy — engaged automatically if the host blocks or rate-limits the direct request.
  3. 🏠 Residential proxy — the final fallback, retried up to 3 times. Once residential is engaged, the run sticks with it for every remaining PDF.

Every switch is logged clearly so you always know which path delivered your data.

🚀 How to Use (Apify Console)

  1. Log in at Apify ConsoleActors.
  2. Open PDF Text Extractor & Chunker.
  3. Paste your PDF URLs, set chunking options, pick a proxy.
  4. Click Start and watch the sections roll in live. 📡
  5. Open the Output tab and export to JSON / CSV / XLSX.

🤖 Use via API

curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls":["https://arxiv.org/pdf/2307.12856"],"performChunking":true,"chunkSize":1000,"chunkOverlap":0}'

💡 Best Use Cases

  • 📚 Build RAG / knowledge bases from PDF libraries.
  • 🤖 Feed document text into LLMs (chunk mode).
  • 🔍 Full-text search across PDF collections.
  • 🧾 Convert reports, papers, and manuals to plain text.

❓ FAQ

Does it work on scanned/image-only PDFs? It extracts the text layer of a PDF. Image-only scans without an embedded text layer will return little or no text (OCR is not performed).

Can I pass many URLs? Yes — urls accepts a bulk list, processed one after another with results saved live.

What if a host rate-limits me? The Actor automatically falls back through datacenter and residential proxies and retries, then sticks with residential.

🛟 Support & Feedback

Found a bug or have a feature request? Open an issue on the Actor's Issues tab in the Apify Console.


⚖️ Use responsibly. Only extract content from PDFs you are authorized to access. You are responsible for compliance with applicable laws and the source site's terms.