Pricing

Pay per usage

PDF Text & Table Extractor (pdfplumber, batch URLs)

Download any PDF by URL and extract clean per-page text + detected tables (as 2D arrays) + document metadata (title/author/created/modified). Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Hojun Lee

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

PDF Text & Table Extractor

Download any PDF by URL and extract clean per-page text + detected tables + document metadata. Powered by pdfplumber. Batch up to 50 PDFs. $0.01 per PDF + $0.0005 per page.

Why this exists

PDFs are how every important document gets distributed — SEC filings, research papers, financial reports, government records. But the raw bytes aren't searchable, can't be fed to LLMs, can't be ingested into databases.

This actor handles the conversion. You give it a URL list; it returns a structured per-page dataset including:

Clean extracted text (preserving reading order)
Detected tables as 2D arrays (ready for CSV / Sheets export)
Document-level metadata (title, author, creation date)

What you get

Summary row (one per PDF)

{
  "_type": "summary",
  "url": "https://www.sec.gov/Archives/.../aapl-10k.pdf",
  "ok": true,
  "page_count": 80,
  "title": "Apple Inc. — Annual Report 2024",
  "author": "Apple Inc.",
  "creator": "InDesign",
  "producer": "Adobe Distiller",
  "created": "D:20240928081300Z",
  "modified": "D:20240928081400Z"
}

Per-page row

{
  "_type": "page",
  "url": "https://...",
  "page": 12,
  "char_count": 3210,
  "word_count": 524,
  "text": "Item 1A. Risk Factors\n\nOur business...",
  "tables": [
    [
      ["Revenue", "Q1 2024", "Q4 2023"],
      ["iPhone", "$45.96B", "$43.81B"],
      ["Mac", "$9.66B", "$7.61B"]
    ]
  ],
  "table_count": 1
}

Quick start

Single PDF

{
  "url": "https://www.example.com/whitepaper.pdf"
}

Batch of 10-K filings from SEC

{
  "urls": [
    "https://www.sec.gov/Archives/edgar/data/320193/aapl-10k.pdf",
    "https://www.sec.gov/Archives/edgar/data/789019/msft-10k.pdf"
  ],
  "extractTables": true,
  "maxPages": 200
}

Text-only (skip tables for speed)

{
  "url": "https://...",
  "extractTables": false
}

Pricing

Pay-Per-Event:

$0.01 — flat per PDF (download + metadata)
$0.0005 — per page extracted

Run	PDFs × Pages	Cost
One 80-page 10-K	1 × 80	$0.05
Batch of 10 research papers	10 × 20	$0.20
Quarterly: 50 earnings releases	50 × 15	$0.88

Vs Adobe Acrobat Pro DC ($23/mo) for manual extraction, or DocParser ($199/mo for API) — this is 5-10x cheaper at typical volumes.

Use cases

SEC filings — Pull text + tables from 10-K, 10-Q, 8-K. Combine with our SEC EDGAR Tracker.
Research aggregation — Build a searchable database of academic papers + abstracts
Financial reports — Auto-extract earnings tables from quarterly releases
LLM RAG — Convert PDFs to chunks for vector search / Q&A
Compliance audit — Index every PDF in your corporate document store

Limitations

Scanned PDFs (image-only) — Returns empty text. Use OCR for scanned PDFs.
Complex layouts — Multi-column research papers may merge column text awkwardly. Tweak with custom extraction parameters in v0.2.
Encrypted PDFs — Will fail with a clear error message.

Data engine

pdfplumber v0.11+ — Pure-Python, robust, used by countless data-engineering pipelines.
No OCR in this actor. For OCR, combine with a separate actor that runs Tesseract or Vision API.

Web Page → Markdown Converter — HTML version of the same idea
HTML Metadata Extractor
SEC EDGAR Filing Tracker — Get the SEC filing URLs to feed in
JSON Schema Generator

Feedback

A short review helps researchers / analysts find it: Leave a review on Apify Store

PDF URL to Markdown, Tables & RAG Extractor

thescrapelab/Apify-PDF-url-scraper

Extract clean Markdown, page text, tables, metadata, summaries, and AI-ready RAG chunks from PDF URLs.

Inus Grobler

PDF to Markdown & JSON Converter (Docling)

actorzlab/docling-pdf-converter

Convert PDF documents to clean Markdown, structured JSON, and plain text using IBM's open-source Docling AI. Handles text PDFs and scanned documents (OCR), extracts tables and images. No external API key required — runs fully on-device.

Khalil Drissi

PDF to Text Extractor

junipr/pdf-to-text-extractor

Extract text from PDFs with native parsing and OCR fallback. Per-page granularity, paragraph structure preserved. Batch process multiple URLs. Output as plain text, JSON, or combined document. Ideal for data pipelines.

junipr

HTML Table Extractor

automation-lab/html-table-extractor

Extract HTML tables from any webpage into structured JSON. Supports multiple URLs, filtering by CSS selector or table index, auto-header detection, and nested tables. Pure HTTP — no proxy needed.

Stas Persiianenko

Smart Page Fetcher — HTML, Markdown & Text

shelvick/smart-page-fetcher

Fetch a batch of URLs and get the page as HTML, Markdown, or clean text. Tries plain HTTP first, renders JavaScript in a real browser when needed, and escalates to stealth + residential proxy for Cloudflare-protected, bot-defended pages, per URL. Pay only for the difficulty each URL needed.

Scott Helvick

Universal Web Scraper - Extract Any URL

lazymac/web-scraper-toolkit

Pay-per-result web scraper with JS rendering, CSS selector / XPath / regex extraction, schema validation, retry on failure. Use for product catalogs, competitor pricing, news aggregation, lead generation. Fast (<2s/page), respects robots.txt by default.

2x lazymac

Wayback Machine Scraper - Track Website Changes Over Time

ryanclinton/wayback-machine-search

Search the Internet Archive's Wayback Machine for historical snapshots of any website. Retrieve archived page metadata -- including timestamps, URLs, MIME types, HTTP status codes, and content hashes -- for up to 10,000 snapshots per run.

Ryan Clinton

PDF To JSON Parser

parseforge/pdf-to-json-parser

Convert PDF documents into structured JSON using AI-powered OCR and smart data extraction. The Actor processes every page to ensure complete coverage, then identifies text, fields, tables, and key details, delivering clean, organized JSON ready for automation or analysis.

ParseForge

5.0

(1)

Image OCR Scraper

seemuapps/image-ocr-scraper

Extract text from any image. Bulk OCR for screenshots, scanned documents, receipts, signs, and photos. Supports 109 languages and outputs clean Markdown or structured JSON with bounding boxes.

Andrew

medRxiv Scraper

parseforge/medrxiv-scraper

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.

ParseForge