Pricing

from $51.00 / 1,000 url extracted (base)s

Structured Data Extractor — URL to JSON

Extract structured data from a batch of URLs as schema-validated JSON. Send web pages and a JSON Schema; it scrapes each (stealth + residential proxy as needed), runs an LLM to convert the page to JSON matching your schema, and validates per URL. Omit schema for best-effort. Public pages only.

Pricing

from $51.00 / 1,000 url extracted (base)s

Rating

0.0

(0)

Developer

Scott Helvick

Actor stats

Bookmarked

Total users

Monthly active users

2 days ago

Last modified

Structured Data Extractor

Extract structured data from a batch of URLs as schema-validated JSON. AI agents that scrape pages get raw HTML or markdown back, then burn their own tokens — and risk hallucinating — turning it into the fields they actually wanted. Structured Data Extractor closes that gap in one batch: send a list of URLs and one JSON Schema, and it scrapes them all in a single pass (escalating to a stealth browser and residential proxy when a page is defended), runs an LLM per page to convert each page to JSON matching your schema, validates conformance, and returns one clean structured record per URL — turning each URL into the exact fields you defined.

What this does

Batch, one shared schema — pass up to 100 URLs and a single outputSchema; the same shape is extracted from every page. Because all URLs are fetched in one pass, a stealth-browser/proxy launch is amortized across the batch instead of paid per page.
Schema-directed extraction — each result is constrained to your schema, validated against it, and reported per URL via a schemaValid flag. If a page's first attempt doesn't conform, that page is retried once with the validation errors fed back.
Best-effort mode — omit the schema and each page returns sensible inferred JSON.
Handles defended pages — the fetch escalates automatically from plain HTTP to a real browser to a stealth + residential-proxy path. You don't pick a method.
Bounded cost and context — maxInputTokens caps how much of each page reaches the model, and each content component is capped independently so one oversized field can't blow the budget or overflow the context window.
Time-budgeted — maxRuntimeSecs (default 270s) keeps synchronous callers under the API's 5-minute cap; URLs not reached in time come back deferred and uncharged, so you can retry them.
One dataset record per URL — url, status, result, schemaValid, tokensUsed, inputTokens, error.

Use cases:

Extract {title, price, in_stock} from a list of bot-defended product pages as typed JSON, ready to insert into a database.
Normalize a set of listing or article pages into one fixed schema your pipeline expects.
Turn a crawl frontier (a page of result links) into structured records in one call.
Structured web scraping into a fixed schema: pull the same fields from many JavaScript-heavy or bot-defended pages as typed JSON.
Get schema-validated output per page that you can trust downstream instead of free-form model text.

Why batch + schema-directed extraction matters

The common failure mode: an agent fetches a page, gets tens of thousands of tokens of markdown, and parses it itself. That burns tokens, can overflow the context window, and invites hallucinated values when data is sparse.

A subtler one: pages behind bot detection serve degraded content to suspected automation — different prices, missing inventory, placeholder text. An agent fetching with an ordinary client extracts data that looks correct but isn't.

And a practical one specific to defended pages: spinning up a stealth browser is expensive, so doing it once per URL — one run per page — wastes that setup. Most real extraction work is "the same fields from a list of similar pages," so this Actor takes the whole list at once and fetches it in a single pass, amortizing the browser and proxy launch across the batch.

The design answers all three: pages are fetched through a stealth path so what's extracted is the real page; the fetched content is capped to a per-URL token budget so cost stays bounded and scales with page size rather than spiking unpredictably; and the model output is constrained to your schema and validated against it — with a per-page retry that feeds the specific errors back — so you get typed, checked data instead of hopeful prose. When a field genuinely isn't on a page, the model is told to return null rather than invent one.

How it compares to alternatives

Approach	Stealth fetch	Structured to your schema	Conformance validated	Batch fetch amortization
Raw stealth fetcher	Yes	No — raw HTML/markdown	No	Depends
Model call on your own fetched HTML	No — you fetch	Yes	Usually not	No
Browser automation + hand-written selectors	Yes	Yes — you script it	Manual	You build it
Structured Data Extractor	Yes	Yes — JSON Schema in	Yes, per-page retry	Yes — one fetch per batch

The raw-fetch and own-LLM approaches each solve half the problem; hand-written selectors solve both but cost ongoing maintenance. This Actor is the intersection — stealth fetch, schema-constrained extraction, conformance check — applied across a batch so the expensive fetch setup is paid once.

Input

Field	Type	Required	Default	Description
`urls`	array	yes	--	The pages to extract from (1–100). One shared `outputSchema` applies to all, so pass pages of the same kind. Fetched in one pass; stealth escalation is automatic and amortized across the batch. Public, unauthenticated pages only.
`outputSchema`	object	--	--	JSON Schema for the output shape, applied to every URL. Each result is validated; `schemaValid` reports per-URL conformance. Omit for best-effort extraction. Mark expected-but-optional fields nullable so the model returns null rather than guessing.
`maxInputTokens`	integer	--	`32000`	Upper bound on fetched content fed to the model per URL. Bounds per-URL cost and prevents context overflow. Range 2000–200000.
`maxRuntimeSecs`	integer	--	`270`	Soft wall-clock budget for the whole run. The default keeps synchronous/x402 callers under the 300s cap; URLs not reached by the deadline return `deferred` (uncharged). Range 60–600.
`proxyGeo`	string	--	--	ISO 3166-1 alpha-2 country code (e.g. `US`, `DE`) for residential routing. Leave empty for default routing; stealth escalation still applies.

Output

One dataset record per input URL.

Completed:

{
  "url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
  "status": "completed",
  "result": { "title": "A Light in the Attic", "price": "£51.77", "in_stock": true },
  "schemaValid": true,
  "tokensUsed": 4200,
  "inputTokens": 3400,
  "error": null
}

Failed (couldn't fetch or extract) and deferred (time budget exhausted) — neither is charged:

{ "url": "https://example.com/blocked", "status": "failed", "result": {}, "schemaValid": true, "tokensUsed": 0, "inputTokens": 0, "error": "empty-content: failed" }
{ "url": "https://example.com/late", "status": "deferred", "result": {}, "schemaValid": true, "tokensUsed": 0, "inputTokens": 0, "error": "runtime-budget-exhausted" }

Field	Type	Description
`url`	string	The input URL this record corresponds to. One record per input URL.
`status`	string	`completed` (extracted, charged), `failed` (couldn't fetch or extract; not charged), or `deferred` (time budget hit before this URL; not charged, retry it).
`result`	object	Extracted data. Conforms to `outputSchema` when provided (subject to `schemaValid`); best-effort otherwise.
`schemaValid`	boolean	Whether `result` validated against `outputSchema`. Always `true` when no schema was supplied. `false` means the model couldn't conform even after a retry.
`tokensUsed`	integer	LLM tokens consumed for this URL (input + output), across the extraction and any retry.
`inputTokens`	integer	Page-content tokens fed to the model for this URL (input only), across the extraction and any retry. This is the figure the per-1,000-token charge meters, so it explains the variable part of this URL's cost.
`error`	string	Reason when `status` is not `completed`; `null` on success.

Example

{
  "urls": [
    "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
    "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
  ],
  "outputSchema": {
    "type": "object",
    "properties": { "title": { "type": "string" }, "price": { "type": "string" } },
    "required": ["title", "price"]
  }
}

curl -X POST "https://api.apify.com/v2/acts/shelvick~structured-extractor/run-sync-get-dataset-items?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls":["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html","https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"],"outputSchema":{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"string"}},"required":["title","price"]}}'

Calling from an AI agent

Apify MCP server

The Actor is available as a callable tool on mcp.apify.com. The input schema is self-documenting — an LLM can construct a correct call from the tool description and field names alone. Pay per call via x402 USDC on Base or Skyfire managed tokens. Note the 300s synchronous cap: keep maxRuntimeSecs at its default for sync/x402 calls and let large batches defer the tail, or use the async path for big batches.

Apify SDK (Python)

from apify_client import ApifyClient

client = ApifyClient("YOUR_TOKEN")
run = client.actor("shelvick/structured-extractor").call(
    run_input={
        "urls": [
            "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
            "https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
        ],
        "outputSchema": {
            "type": "object",
            "properties": {"title": {"type": "string"}, "price": {"type": "string"}},
            "required": ["title", "price"],
        },
    }
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item["url"], item["status"], item["schemaValid"], item["result"])

REST API

POST https://api.apify.com/v2/acts/shelvick~structured-extractor/run-sync-get-dataset-items?token=YOUR_TOKEN

For large batches, start asynchronously and poll:

POST https://api.apify.com/v2/acts/shelvick~structured-extractor/runs?token=YOUR_TOKEN
GET  https://api.apify.com/v2/actor-runs/{runId}/dataset/items?token=YOUR_TOKEN

Pricing

Pay-per-event, billed only on success, and metered to page size so you pay for what each page actually costs to process. A successfully extracted URL carries two charges, both fired after that URL's record is pushed:

a flat per-URL base — covers the page fetch, extraction setup, schema validation, and any per-page retry; and
a per-1,000-input-token charge that scales with how much page content the model actually read (reported as inputTokens on each record).

Small pages cost less and large pages cost proportionally more, instead of every page paying one worst-case flat rate. failed and deferred URLs trigger neither charge, so a batch only ever costs you the URLs it actually extracted. Cap the variable part with maxInputTokens, and cap a whole run with maxTotalChargeUsd.

See the Pricing tab on this Store page for the current per-event rates and any active subscriber discounts.

Behavior

Run-level failures (rare): invalid input fails the run before any work — empty urls, more than 100 URLs, or maxInputTokens/maxRuntimeSecs out of range. Nothing is charged.

Per-URL outcomes:

completed — extracted. Check schemaValid for conformance to your schema.
failed — the page couldn't be fetched (defended beyond stealth, blank, login wall) or the model returned unparseable output; error says which.
deferred — the run's maxRuntimeSecs budget was exhausted before this URL was processed. Retry it (uncharged).
schemaValid: false — extraction completed but didn't validate even after a retry; the best-effort result is still returned.

Performance expectations:

One fetch pass covers the whole batch; a stealth-browser/proxy launch is shared across URLs rather than repeated.
Cooperative pages: a few seconds each, fetched concurrently. Bot-defended pages add stealth latency.
Extraction runs per URL; total time scales with batch size. At default maxRuntimeSecs (270s) a stealth batch of roughly 20 URLs completes; larger batches return the tail as deferred.
Larger maxInputTokens increases per-URL latency and cost on big pages.

FAQ

Can the URLs have different output shapes? No — one outputSchema applies to the whole batch. Pass pages of the same kind (all products, all articles). For a different shape, use a separate run.

Am I charged for URLs that fail or get deferred? No. The charge fires only per completed URL. failed and deferred URLs are free, so a batch costs only what it actually extracted.

What's deferred and how do I handle it? The run hit its maxRuntimeSecs budget before reaching that URL. It's uncharged and retry-friendly — re-submit the deferred URLs, or raise maxRuntimeSecs (and use the async API) for bigger batches in one run.

How do I keep cost down on very large pages? Lower maxInputTokens. It caps how much of each page reaches the model — the dominant cost — and each content component is capped independently so no single field can blow the budget.

What does schemaValid: false mean? The model couldn't produce output conforming to your schema for that URL even after a retry. The result is still returned for inspection; simplify the schema or mark uncertain fields nullable.

What this doesn't do

No authentication. Public, unauthenticated pages only. It won't log in or submit credentials.
No per-URL schemas. One shared outputSchema per run — pages should be the same kind.
No page interaction. It doesn't click, fill forms, or navigate multi-step flows before extracting.
No crawling. It extracts from the URLs you pass; it won't discover or follow links.
No CAPTCHA solving / no file parsing. Interactive-CAPTCHA pages return failed; it extracts from web pages, not uploaded PDFs/images.

For raw page content (HTML or markdown) without an extraction step, use a page-fetching tool instead — this Actor adds an LLM extraction cost you don't need if you only want content. For clicking, form-filling, or authenticated sessions before extraction, use a browser-automation tool. For discovering links to extract, run a crawler first and pass its URLs here as a batch.

Design notes: www.scotthelvick.com/tools/structured-extractor

Validate Dataset(s) with JSON Schema

jaroslavhejlek/validate-dataset-with-json-schema

This Actor validates items in one or more datasets against a provided JSON Schema. Use it if you planning to add a dataset validation schema to your actor and you want test it.

Jaroslav Hejlek

Schema.org Markup Validator

scrappy_garden/schema-org-markup-validator

Validate Schema.org structured data for SEO. Parses JSON-LD, detects Microdata and RDFa, highlights schema types, and reports common issues like invalid JSON-LD, missing @type, non-schema.org @context, and missing key properties for popular schema types.

Bikram Adhikari

Schema Markup & JSON-LD Scraper - Structured Data API

pink_comic/schema-markup-extractor

Extract schema markup, JSON-LD, Open Graph, Twitter Cards, and meta tags from any URL. Structured data scraper/API for SEO audits, rich result checks, schema validation, and competitor research.

Ava Torres

Schema Universal Converter

fiery_dream/schema-universal-converter

Convert between JSON Schema, TypeScript, Zod, OpenAPI, GraphQL, and more. Maintain schema consistency across your entire stack.

Cody Churchwell

JSON-LD Extractor

automationagents/web-json-ld

Extract structured JSON-LD (Schema.org) data from any web page.

Alex Jordan

AI Web Scraper — URL to JSON with Confidence

crisp_gopher/ai-scraper-to-json

Extract structured data from any website into typed JSON matching your schema, with a confidence score on every field. AI-powered, RAG-ready, with built-in schema validation and grounding to catch hallucinations.