# PDF OCR Tool — Extract Text from Scanned Documents (`junipr/pdf-ocr-tool`) Actor

Extract text from scanned PDFs and images using Tesseract OCR. 100+ languages, multi-page support. Configurable DPI, page segmentation, language selection. Output as plain text or structured JSON per page.

- **URL**: https://apify.com/junipr/pdf-ocr-tool.md
- **Developed by:** [junipr](https://apify.com/junipr) (community)
- **Categories:** Automation, Developer tools
- **Stats:** 2 total users, 0 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $5.20 / 1,000 page ocrs

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## PDF OCR Tool

Extract text from scanned PDFs and image-based documents using built-in Tesseract.js OCR — no API keys required, no external services, runs entirely on Apify.

### Overview

Most PDF text extractors fail silently on scanned documents: the PDF looks normal but contains images instead of selectable text. This actor solves that with a two-stage pipeline:

1. **Smart detection** — first attempts direct text extraction (fast, free). If the PDF is text-based, it returns results immediately.
2. **OCR fallback** — if the PDF is scanned or image-based (fewer than 20 chars/page), it renders each page using a headless Chrome browser and runs Tesseract.js OCR on the resulting images.

Every result includes per-page confidence scores so you know exactly how reliable each extraction was.

### Features

- **No API keys** — Tesseract.js runs entirely inside the actor. Zero external dependencies, zero cost per call to a third-party API.
- **11 languages** — English, French, German, Spanish, Italian, Portuguese, Simplified Chinese, Japanese, Korean, Arabic, Russian.
- **Smart detection** — text-based PDFs take the fast path (direct extraction). Only scanned PDFs incur the OCR overhead.
- **Confidence scores** — every page reports an OCR confidence value (0–100). Text-extracted pages always score 100.
- **Extraction method** — each result reports `text-extraction`, `ocr`, or `hybrid` so you know what happened.
- **Batch processing** — supply as many PDF URLs as needed. Configurable concurrency keeps memory usage in check.
- **Metadata extraction** — title, author, subject, creator, producer, creation date, modification date.
- **Multiple output formats** — plain text, markdown, or structured JSON.
- **Page-by-page output** — optional `pages` array with per-page text, char count, confidence, and method.

### Input

All fields have defaults — run with zero configuration using the built-in sample PDF.

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `pdfUrls` | array | W3C sample PDF | List of `{url, label}` objects to process |
| `language` | string | `eng` | Tesseract OCR language code |
| `outputFormat` | string | `text` | `text`, `markdown`, or `json` |
| `extractMetadata` | boolean | `true` | Include PDF metadata in output |
| `pageByPage` | boolean | `true` | Include per-page breakdown with confidence scores |
| `maxPages` | integer | `0` (all) | Limit pages per PDF (0 = no limit) |
| `dpi` | integer | `300` | Rendering resolution for OCR (higher = better accuracy, slower) |
| `maxConcurrency` | integer | `2` | Parallel PDFs (keep low — OCR is CPU-heavy) |
| `requestTimeout` | integer | `120000` | Download timeout in milliseconds |

### Output

Each dataset item corresponds to one PDF:

```json
{
  "url": "https://example.com/report.pdf",
  "label": "Q4 Report",
  "fileName": "report.pdf",
  "method": "ocr",
  "metadata": {
    "title": "Annual Report 2023",
    "author": "Acme Corp",
    "creationDate": "2024-01-15"
  },
  "text": "Full extracted text here...",
  "pageCount": 12,
  "averageConfidence": 94.3,
  "pages": [
    {
      "pageNumber": 1,
      "text": "Page text here...",
      "charCount": 847,
      "confidence": 96.1,
      "method": "ocr"
    }
  ],
  "extractedAt": "2024-03-11T12:00:00.000Z",
  "errors": []
}
````

#### Extraction Methods

| Method | When used |
|--------|-----------|
| `text-extraction` | PDF contains embedded text (≥20 chars/page average) |
| `ocr` | PDF is scanned or image-based — all pages processed with Tesseract |
| `hybrid` | Mixed document: some pages had text, others needed OCR |

### Supported Languages

`eng` (English), `fra` (French), `deu` (German), `spa` (Spanish), `ita` (Italian), `por` (Portuguese), `chi_sim` (Simplified Chinese), `jpn` (Japanese), `kor` (Korean), `ara` (Arabic), `rus` (Russian).

Language data is downloaded from Tesseract's CDN on first use. Subsequent runs on the same build cache the data automatically.

### Performance & Cost

- **Text-based PDFs** process very fast (seconds per document).
- **Scanned PDFs** require rendering + OCR — expect 5–30 seconds per page depending on resolution and document complexity.
- Set `dpi: 150` for faster processing when accuracy is less critical. Use `dpi: 300–600` for small or dense text.
- Set `maxConcurrency: 1` for large batches if you hit memory limits.

#### FAQ

##### Does this work on password-protected PDFs?

No. Password-protected PDFs cannot be downloaded or parsed without the password. The actor will report a parse error and return an empty result for those files.

##### What DPI should I use?

- **72–150 DPI**: Fast, lower accuracy. Fine for large clear text.
- **300 DPI** (default): Good balance of speed and accuracy for most scanned documents.
- **400–600 DPI**: Best accuracy for small fonts, handwriting, or dense tables. Significantly slower.

##### Why does my text PDF still show method: text-extraction even though it looks scanned?

Some PDFs embed invisible text layers over scanned images (common in documents processed by Adobe Acrobat or similar tools). The actor detects this embedded text and uses it directly — it's more accurate than re-running OCR on those documents.

##### Can I process multiple languages in one PDF?

Tesseract supports one language per run. For multilingual documents, run the actor twice with different `language` settings and compare results, or use `eng` which handles many Latin-script languages adequately.

##### What happens if OCR confidence is low?

Low confidence (below ~60%) usually means the scan quality is poor, the wrong language is selected, or the document contains complex layouts. Try increasing `dpi`, selecting the correct language, or pre-processing the PDF to improve image quality.

##### Is there a page limit?

Default is 0 (no limit). Set `maxPages` to limit pages per PDF. Actor timeout is 60 minutes — for very large batches, increase the actor's timeout in run options.

### Competitive Advantage

Unlike alternatives that require Google Vision API, OpenAI, or AWS Textract (all paid, all requiring API keys to be configured), this actor uses Tesseract.js which is open-source, runs locally inside the actor, and has zero per-call API cost. You only pay for Apify compute time.

# Actor input Schema

## `pdfUrls` (type: `array`):

List of PDF URLs to process. Each entry is an object with a 'url' field and an optional 'label' for identification.

## `language` (type: `string`):

Language used by Tesseract OCR for text recognition. Choose the primary language in your PDFs for best accuracy.

## `outputFormat` (type: `string`):

Format for the extracted text. 'text' returns plain text, 'markdown' adds basic structure hints, 'json' returns structured data.

## `extractMetadata` (type: `boolean`):

Extract PDF metadata such as title, author, subject, creator, producer, creation date, and modification date.

## `pageByPage` (type: `boolean`):

When enabled, output includes a 'pages' array with text, character count, and OCR confidence score for each individual page.

## `maxPages` (type: `integer`):

Maximum number of pages to process per PDF. Set to 0 to process all pages. Useful for previewing large documents or managing costs.

## `dpi` (type: `integer`):

Resolution in DPI used when rendering PDF pages to images for OCR. Higher values improve accuracy on small text but increase processing time.

## `maxConcurrency` (type: `integer`):

Maximum number of PDFs to process simultaneously. OCR is CPU-intensive, so keep this low to avoid timeouts.

## `requestTimeout` (type: `integer`):

Timeout in milliseconds for downloading each PDF from a URL. Increase for very large files or slow servers.

## Actor input object example

```json
{
  "pdfUrls": [
    {
      "url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
      "label": "sample"
    }
  ],
  "language": "eng",
  "outputFormat": "text",
  "extractMetadata": true,
  "pageByPage": true,
  "maxPages": 0,
  "dpi": 300,
  "maxConcurrency": 2,
  "requestTimeout": 120000
}
```

# Actor output Schema

## `results` (type: `string`):

OCR-extracted text, metadata, confidence scores, and per-page content for each processed PDF

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("junipr/pdf-ocr-tool").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("junipr/pdf-ocr-tool").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call junipr/pdf-ocr-tool --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=junipr/pdf-ocr-tool",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "PDF OCR Tool — Extract Text from Scanned Documents",
        "description": "Extract text from scanned PDFs and images using Tesseract OCR. 100+ languages, multi-page support. Configurable DPI, page segmentation, language selection. Output as plain text or structured JSON per page.",
        "version": "1.0",
        "x-build-id": "iciTHoZhGMrXnRMvj"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/junipr~pdf-ocr-tool/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-junipr-pdf-ocr-tool",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/junipr~pdf-ocr-tool/runs": {
            "post": {
                "operationId": "runs-sync-junipr-pdf-ocr-tool",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/junipr~pdf-ocr-tool/run-sync": {
            "post": {
                "operationId": "run-sync-junipr-pdf-ocr-tool",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "pdfUrls": {
                        "title": "PDF URLs",
                        "type": "array",
                        "description": "List of PDF URLs to process. Each entry is an object with a 'url' field and an optional 'label' for identification.",
                        "default": [
                            {
                                "url": "https://www.w3.org/WAI/ER/tests/xhtml/testfiles/resources/pdf/dummy.pdf",
                                "label": "sample"
                            }
                        ]
                    },
                    "language": {
                        "title": "OCR Language",
                        "enum": [
                            "eng",
                            "fra",
                            "deu",
                            "spa",
                            "ita",
                            "por",
                            "chi_sim",
                            "jpn",
                            "kor",
                            "ara",
                            "rus"
                        ],
                        "type": "string",
                        "description": "Language used by Tesseract OCR for text recognition. Choose the primary language in your PDFs for best accuracy.",
                        "default": "eng"
                    },
                    "outputFormat": {
                        "title": "Output Format",
                        "enum": [
                            "text",
                            "markdown",
                            "json"
                        ],
                        "type": "string",
                        "description": "Format for the extracted text. 'text' returns plain text, 'markdown' adds basic structure hints, 'json' returns structured data.",
                        "default": "text"
                    },
                    "extractMetadata": {
                        "title": "Extract Metadata",
                        "type": "boolean",
                        "description": "Extract PDF metadata such as title, author, subject, creator, producer, creation date, and modification date.",
                        "default": true
                    },
                    "pageByPage": {
                        "title": "Page-by-Page Output",
                        "type": "boolean",
                        "description": "When enabled, output includes a 'pages' array with text, character count, and OCR confidence score for each individual page.",
                        "default": true
                    },
                    "maxPages": {
                        "title": "Max Pages",
                        "minimum": 0,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum number of pages to process per PDF. Set to 0 to process all pages. Useful for previewing large documents or managing costs.",
                        "default": 0
                    },
                    "dpi": {
                        "title": "OCR Resolution (DPI)",
                        "minimum": 72,
                        "maximum": 600,
                        "type": "integer",
                        "description": "Resolution in DPI used when rendering PDF pages to images for OCR. Higher values improve accuracy on small text but increase processing time.",
                        "default": 300
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 5,
                        "type": "integer",
                        "description": "Maximum number of PDFs to process simultaneously. OCR is CPU-intensive, so keep this low to avoid timeouts.",
                        "default": 2
                    },
                    "requestTimeout": {
                        "title": "Request Timeout (ms)",
                        "minimum": 10000,
                        "maximum": 600000,
                        "type": "integer",
                        "description": "Timeout in milliseconds for downloading each PDF from a URL. Increase for very large files or slow servers.",
                        "default": 120000
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
