Extract text from PDF
Pricing
from $0.00005 / actor start
Go to Apify Store

Extract text from PDF
Efficiently extract text content from PDF files, ideal for data processing, content analysis, and automation workflows. Supports various PDF structures and outputs clean, readable text.
Pricing
from $0.00005 / actor start
Rating
0.0
(0)
Developer
Akash Kumar Naik
Maintained by CommunityActor stats
1
Bookmarked
107
Total users
8
Monthly active users
11 days ago
Last modified
Categories
Share
PDF Text Extractor — Extract Text from Any PDF File
Extract text from PDF files with OCR support for scanned documents and image-based PDFs. Supports direct URLs and cloud storage links.
🎯 What It Does
- Extract text from any PDF — digitally created or scanned
- Cloud storage support — Google Drive, Dropbox, and OneDrive share links
- OCR fallback — automatically runs Tesseract OCR on pages with no embedded text
- Multi-language OCR — 100+ languages supported
- Mistral AI OCR fallback — when both pdf.js-extract and Tesseract fail, optionally use Mistral OCR for state-of-the-art document understanding (tables, equations, complex layouts)
- Page limiting — optionally cap extraction to a specific number of pages
- Structured output — extracted text plus metadata (page count, source type, file size, timestamp)
📥 Input
{"pdfUrl": "https://drive.google.com/file/d/FILE_ID/view?usp=sharing","maxPages": 0,"ocrFallback": true,"ocrLanguage": "eng","mistralApiKey": "your-mistral-api-key"}
| Field | Type | Default | Description |
|---|---|---|---|
pdfUrl | string | — | PDF URL or cloud storage share link (Google Drive, Dropbox, OneDrive) |
maxPages | integer | 0 | Max pages to extract. 0 = all pages |
ocrFallback | boolean | true | Run Tesseract OCR on pages with <50 chars of embedded text |
ocrLanguage | string | eng | Tesseract language code (e.g. fra, deu, chi_sim) |
mistralApiKey | string | — | Optional Mistral AI API key. When provided, if both pdf.js-extract and Tesseract fail to produce meaningful text, the PDF is sent to Mistral OCR for premium document understanding |
📤 Output
{"originalPdfUrl": "https://drive.google.com/file/d/FILE_ID/view","processedPdfUrl": "https://drive.google.com/uc?export=download&id=FILE_ID","extractedText": "Full text content extracted from the PDF...","pageCount": 12,"extractedPages": 12,"fileSizeBytes": 1048576,"sourceType": "google-drive","ocrApplied": true,"mistralFallbackApplied": false,"timestamp": "2026-06-05T07:00:00.000Z","success": true}
🔍 How It Works
- Downloads the PDF via HTTP with retry logic (3 attempts, exponential backoff)
- Extracts embedded text using
pdf.js-extract(fast for digital PDFs) — Stage 1 - Tesseract OCR fallback (when enabled): pages with <50 chars are rendered to PNG at 300 DPI via
pdftoppm, then processed with Tesseract OCR — Stage 2 - Mistral OCR premium fallback (when
mistralApiKeyis set): if Stages 1 and 2 produce fewer than 200 characters total, the PDF URL is sent to Mistral AI's OCR API for state-of-the-art document understanding — Stage 3 - Returns structured JSON with extracted text and metadata
💰 Pricing
Pay-per-event — charged only on successful extractions.
| Event | Price | Trigger |
|---|---|---|
pdf-processed | $0.005 | Per successfully processed PDF |
page-extracted | $0.0005 | Per page (only when extractedPages > 1) |
🚀 Use Cases
- Document processing — invoices, contracts, reports, scanned paper copies
- Research — academic papers, white papers, archival PDFs
- Data pipelines — feed PDF content into NLP or search systems
- Content management — index PDF archives for full-text search
- Automation — process PDFs at scale via Apify API, Zapier, or Make
⚡ Tips
- Scanned PDFs:
ocrFallback: trueis enabled by default — works out of the box - Large PDFs: set
maxPagesto limit processing time and cost - Non-English docs: set
ocrLanguageto the matching Tesseract language code - Failed extractions: not charged — error details returned in
errorMessagefield