Pricing

from $3.99 / 1,000 results

📄 PDF Text Extractor

📄 PDF Text Extractor (pdf-text-extractor) extracts clean text from PDF files for faster search, data analysis, and content reuse. ⚡ Saves time & boosts productivity for research, automation, and document workflows.

Pricing

from $3.99 / 1,000 results

Rating

0.0

(0)

Developer

Scrapio

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

📄 PDF Text Extractor & Chunker

Extract clean, ordered text from any PDF on the web — page-by-page or split into LLM-ready chunks with controllable size and overlap. Point it at one URL or thousands; results stream into your dataset section by section, live.

Perfect for building RAG pipelines, question-answering systems, document search, and any workflow that needs PDF content as plain text. 🚀

🌟 Why Choose This Actor?

⚡ Live results — every page/chunk is saved the moment it's ready. A long run never leaves you staring at an empty output table.
🧩 LLM-friendly chunking — character-based chunking with overlap, so context isn't cut mid-sentence.
📦 Bulk input — drop in a whole list of PDF URLs at once.
🛡️ Smart anti-rate-limit ladder — starts with a direct connection and automatically falls back to datacenter, then residential proxies if a host blocks you.
🎉 Engaging real-time logs — watch exactly what's happening, page by page.

✨ Key Features

Extract text from PDFs provided as URLs.
Toggle between page mode (one record per page) and chunk mode.
Configure chunkSize and chunkOverlap for perfect LLM context windows.
Resilient downloads with proxy fallback and retries.
Output ready for JSON / CSV / XLSX export.

📥 Input

Field	Type	Description
`urls`	array	🔗 Direct URLs of the PDF files (bulk supported).
`performChunking`	boolean	✂️ `true` → split into chunks. `false` → one record per page.
`chunkSize`	integer	📏 Max characters per chunk (chunk mode). Default `1000`.
`chunkOverlap`	integer	🔁 Characters shared between adjacent chunks. Default `0`.
`proxyConfiguration`	object	🛡️ Apify proxy used to power the automatic fallbacks.

Example input

{
  "urls": ["https://arxiv.org/pdf/2307.12856"],
  "performChunking": true,
  "chunkSize": 1000,
  "chunkOverlap": 0,
  "proxyConfiguration": { "useApifyProxy": true }
}

📤 Output

Each record is one text section:

{
  "url": "https://arxiv.org/pdf/2307.12856",
  "index": 0,
  "text": "A Real-World WebAgent with Planning, Long Context Understanding…"
}

Field	Description
`url`	🔗 Source PDF URL.
`index`	🔢 Position of the section (chunk number, or page number in page mode).
`text`	📝 Extracted text for that section.

🛡️ How the connection ladder works

🌐 Direct — no proxy; the request goes straight to the PDF host.
🛰️ Datacenter proxy — engaged automatically if the host blocks or rate-limits the direct request.
🏠 Residential proxy — the final fallback, retried up to 3 times. Once residential is engaged, the run sticks with it for every remaining PDF.

Every switch is logged clearly so you always know which path delivered your data.

🚀 How to Use (Apify Console)

Log in at Apify Console → Actors.
Open PDF Text Extractor & Chunker.
Paste your PDF URLs, set chunking options, pick a proxy.
Click Start and watch the sections roll in live. 📡
Open the Output tab and export to JSON / CSV / XLSX.

🤖 Use via API

curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
     -H "Content-Type: application/json" \
     -d '{"urls":["https://arxiv.org/pdf/2307.12856"],"performChunking":true,"chunkSize":1000,"chunkOverlap":0}'

💡 Best Use Cases

📚 Build RAG / knowledge bases from PDF libraries.
🤖 Feed document text into LLMs (chunk mode).
🔍 Full-text search across PDF collections.
🧾 Convert reports, papers, and manuals to plain text.

❓ FAQ

Does it work on scanned/image-only PDFs? It extracts the text layer of a PDF. Image-only scans without an embedded text layer will return little or no text (OCR is not performed).

Can I pass many URLs? Yes — urls accepts a bulk list, processed one after another with results saved live.

What if a host rate-limits me? The Actor automatically falls back through datacenter and residential proxies and retries, then sticks with residential.

🛟 Support & Feedback

Found a bug or have a feature request? Open an issue on the Actor's Issues tab in the Apify Console.

⚖️ Use responsibly. Only extract content from PDFs you are authorized to access. You are responsible for compliance with applicable laws and the source site's terms.

📄 PDF Text Extractor

simpleapi/pdf-text-extractor

📄✨ PDF Text Extractor pulls clean text from PDF files fast and accurately. Perfect for parsing, indexing, and document search — saving hours on manual copy-paste. 🚀📊 Try it now!

SimpleAPI

📄 PDF Text Extractor

scraper-engine/pdf-text-extractor

📄✨ PDF Text Extractor extracts clean text from PDF files with precision. ⚡ Perfect for data mining, document processing, and searchable archives. 🚀 Fast, reliable, and efficient for your workflow!

Scraper Engine

Document Extractor API - AI-Powered PDF & Text Analysis

fresh_cliff/document-extractor-api

Extract text and data from PDF, Word, and image documents using AI-powered OCR. Convert documents to structured JSON, analyze content, and extract insights. No API keys required with mirror fallbacks.

Brennan Crawford

USPTO Patents Scraper | Google Patents Full-Text Grants Search

parseforge/uspto-patents-scraper

Export US patent grants and applications via Google Patents public search: patent number, title, abstract, assignee, inventor, filing date, grant date, publication date, CPC codes and PDF link. Search by query, classification or assignee. CSV, Excel, JSON or XML.

ParseForge

PDF AI Extractor MCP

devaditya/pdf-ai-extractor-mcp

Extracts text, tables, summaries, and structured data from any PDF using OpenAI, Google Gemini, or Claude. Supports bulk AI processing, clean JSON exports, and an AI-ready MCP mode for agent workflows.

lalithhh

Markdown to PDF MCP Server

parseforge/markdown-to-pdf-mcp

Convert Markdown content to PDF format using Model Context Protocol (MCP). Perfect for developers, content creators, and businesses who need to programmatically convert Markdown documents to professional PDFs with custom styling, page sizes, margins, and orientations.

ParseForge

5.0

(2)

Google Scholar Scraper

automation-lab/google-scholar-scraper

Search Google Scholar and extract academic papers. Get titles, authors, citation counts, abstracts, PDF links, and publication details. Supports year filtering.

Stas Persiianenko

OCR & Document Extractor – PDF & Image to Text, JSON, Word

lofomachines/ocr-document-extractor

Convert scanned PDFs and images into clean, structured text in bulk. Export to JSON, Markdown, DOCX, TXT or HTML with tables and layout preserved.

Lofomachines

Unpaywall Scraper

parseforge/unpaywall-scraper

Discover open access research articles with our powerful Unpaywall scraper! Search through millions of articles in the Unpaywall database to find free-to-read scholarly publications. Perfect for researchers, librarians, and academics who need to find and access open access articles efficiently.

ParseForge

Semantic Scholar Scraper

parseforge/semantic-scholar-scraper

Extract detailed academic paper data from Semantic Scholar, including abstracts, citations, authors, and publication details. Ideal for researchers, academics, and analysts who need structured scholarly data for literature reviews, research workflows, and large-scale academic analysis.

ParseForge

5.0

(1)