Extract text from PDF avatar

Extract text from PDF

Pricing

from $0.00005 / actor start

Go to Apify Store
Extract text from PDF

Extract text from PDF

Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.

Pricing

from $0.00005 / actor start

Rating

0.0

(0)

Developer

Akash Kumar Naik

Akash Kumar Naik

Maintained by Community

Actor stats

1

Bookmarked

107

Total users

8

Monthly active users

11 days ago

Last modified

Share

PDF Text Extractor — Extract Text from Any PDF File

Extract text from PDF files with OCR support for scanned documents and image-based PDFs. Supports direct URLs and cloud storage links.

🎯 What It Does

  • Extract text from any PDF — digitally created or scanned
  • Cloud storage support — Google Drive, Dropbox, and OneDrive share links
  • OCR fallback — automatically runs Tesseract OCR on pages with no embedded text
  • Multi-language OCR — 100+ languages supported
  • Mistral AI OCR fallback — when both pdf.js-extract and Tesseract fail, optionally use Mistral OCR for state-of-the-art document understanding (tables, equations, complex layouts)
  • Page limiting — optionally cap extraction to a specific number of pages
  • Structured output — extracted text plus metadata (page count, source type, file size, timestamp)

📥 Input

{
"pdfUrl": "https://drive.google.com/file/d/FILE_ID/view?usp=sharing",
"maxPages": 0,
"ocrFallback": true,
"ocrLanguage": "eng",
"mistralApiKey": "your-mistral-api-key"
}
FieldTypeDefaultDescription
pdfUrlstringPDF URL or cloud storage share link (Google Drive, Dropbox, OneDrive)
maxPagesinteger0Max pages to extract. 0 = all pages
ocrFallbackbooleantrueRun Tesseract OCR on pages with <50 chars of embedded text
ocrLanguagestringengTesseract language code (e.g. fra, deu, chi_sim)
mistralApiKeystringOptional Mistral AI API key. When provided, if both pdf.js-extract and Tesseract fail to produce meaningful text, the PDF is sent to Mistral OCR for premium document understanding

📤 Output

{
"originalPdfUrl": "https://drive.google.com/file/d/FILE_ID/view",
"processedPdfUrl": "https://drive.google.com/uc?export=download&id=FILE_ID",
"extractedText": "Full text content extracted from the PDF...",
"pageCount": 12,
"extractedPages": 12,
"fileSizeBytes": 1048576,
"sourceType": "google-drive",
"ocrApplied": true,
"mistralFallbackApplied": false,
"timestamp": "2026-06-05T07:00:00.000Z",
"success": true
}

🔍 How It Works

  1. Downloads the PDF via HTTP with retry logic (3 attempts, exponential backoff)
  2. Extracts embedded text using pdf.js-extract (fast for digital PDFs) — Stage 1
  3. Tesseract OCR fallback (when enabled): pages with <50 chars are rendered to PNG at 300 DPI via pdftoppm, then processed with Tesseract OCR — Stage 2
  4. Mistral OCR premium fallback (when mistralApiKey is set): if Stages 1 and 2 produce fewer than 200 characters total, the PDF URL is sent to Mistral AI's OCR API for state-of-the-art document understanding — Stage 3
  5. Returns structured JSON with extracted text and metadata

💰 Pricing

Pay-per-event — charged only on successful extractions.

EventPriceTrigger
pdf-processed$0.005Per successfully processed PDF
page-extracted$0.0005Per page (only when extractedPages > 1)

🚀 Use Cases

  • Document processing — invoices, contracts, reports, scanned paper copies
  • Research — academic papers, white papers, archival PDFs
  • Data pipelines — feed PDF content into NLP or search systems
  • Content management — index PDF archives for full-text search
  • Automation — process PDFs at scale via Apify API, Zapier, or Make

⚡ Tips

  • Scanned PDFs: ocrFallback: true is enabled by default — works out of the box
  • Large PDFs: set maxPages to limit processing time and cost
  • Non-English docs: set ocrLanguage to the matching Tesseract language code
  • Failed extractions: not charged — error details returned in errorMessage field