Structured Data Extractor — URL to JSON avatar

Structured Data Extractor — URL to JSON

Pricing

from $51.00 / 1,000 url extracted (base)s

Go to Apify Store
Structured Data Extractor — URL to JSON

Structured Data Extractor — URL to JSON

Extract structured data from a batch of URLs as schema-validated JSON. Send web pages and a JSON Schema; it scrapes each (stealth + residential proxy as needed), runs an LLM to convert the page to JSON matching your schema, and validates per URL. Omit schema for best-effort. Public pages only.

Pricing

from $51.00 / 1,000 url extracted (base)s

Rating

0.0

(0)

Developer

Scott Helvick

Scott Helvick

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

2 days ago

Last modified

Share

Structured Data Extractor

Extract structured data from a batch of URLs as schema-validated JSON. AI agents that scrape pages get raw HTML or markdown back, then burn their own tokens — and risk hallucinating — turning it into the fields they actually wanted. Structured Data Extractor closes that gap in one batch: send a list of URLs and one JSON Schema, and it scrapes them all in a single pass (escalating to a stealth browser and residential proxy when a page is defended), runs an LLM per page to convert each page to JSON matching your schema, validates conformance, and returns one clean structured record per URL — turning each URL into the exact fields you defined.

What this does

  • Batch, one shared schema — pass up to 100 URLs and a single outputSchema; the same shape is extracted from every page. Because all URLs are fetched in one pass, a stealth-browser/proxy launch is amortized across the batch instead of paid per page.
  • Schema-directed extraction — each result is constrained to your schema, validated against it, and reported per URL via a schemaValid flag. If a page's first attempt doesn't conform, that page is retried once with the validation errors fed back.
  • Best-effort mode — omit the schema and each page returns sensible inferred JSON.
  • Handles defended pages — the fetch escalates automatically from plain HTTP to a real browser to a stealth + residential-proxy path. You don't pick a method.
  • Bounded cost and contextmaxInputTokens caps how much of each page reaches the model, and each content component is capped independently so one oversized field can't blow the budget or overflow the context window.
  • Time-budgetedmaxRuntimeSecs (default 270s) keeps synchronous callers under the API's 5-minute cap; URLs not reached in time come back deferred and uncharged, so you can retry them.
  • One dataset record per URLurl, status, result, schemaValid, tokensUsed, inputTokens, error.

Use cases:

  • Extract {title, price, in_stock} from a list of bot-defended product pages as typed JSON, ready to insert into a database.
  • Normalize a set of listing or article pages into one fixed schema your pipeline expects.
  • Turn a crawl frontier (a page of result links) into structured records in one call.
  • Structured web scraping into a fixed schema: pull the same fields from many JavaScript-heavy or bot-defended pages as typed JSON.
  • Get schema-validated output per page that you can trust downstream instead of free-form model text.

Why batch + schema-directed extraction matters

The common failure mode: an agent fetches a page, gets tens of thousands of tokens of markdown, and parses it itself. That burns tokens, can overflow the context window, and invites hallucinated values when data is sparse.

A subtler one: pages behind bot detection serve degraded content to suspected automation — different prices, missing inventory, placeholder text. An agent fetching with an ordinary client extracts data that looks correct but isn't.

And a practical one specific to defended pages: spinning up a stealth browser is expensive, so doing it once per URL — one run per page — wastes that setup. Most real extraction work is "the same fields from a list of similar pages," so this Actor takes the whole list at once and fetches it in a single pass, amortizing the browser and proxy launch across the batch.

The design answers all three: pages are fetched through a stealth path so what's extracted is the real page; the fetched content is capped to a per-URL token budget so cost stays bounded and scales with page size rather than spiking unpredictably; and the model output is constrained to your schema and validated against it — with a per-page retry that feeds the specific errors back — so you get typed, checked data instead of hopeful prose. When a field genuinely isn't on a page, the model is told to return null rather than invent one.

How it compares to alternatives

ApproachStealth fetchStructured to your schemaConformance validatedBatch fetch amortization
Raw stealth fetcherYesNo — raw HTML/markdownNoDepends
Model call on your own fetched HTMLNo — you fetchYesUsually notNo
Browser automation + hand-written selectorsYesYes — you script itManualYou build it
Structured Data ExtractorYesYes — JSON Schema inYes, per-page retryYes — one fetch per batch

The raw-fetch and own-LLM approaches each solve half the problem; hand-written selectors solve both but cost ongoing maintenance. This Actor is the intersection — stealth fetch, schema-constrained extraction, conformance check — applied across a batch so the expensive fetch setup is paid once.

Input

FieldTypeRequiredDefaultDescription
urlsarrayyes--The pages to extract from (1–100). One shared outputSchema applies to all, so pass pages of the same kind. Fetched in one pass; stealth escalation is automatic and amortized across the batch. Public, unauthenticated pages only.
outputSchemaobject----JSON Schema for the output shape, applied to every URL. Each result is validated; schemaValid reports per-URL conformance. Omit for best-effort extraction. Mark expected-but-optional fields nullable so the model returns null rather than guessing.
maxInputTokensinteger--32000Upper bound on fetched content fed to the model per URL. Bounds per-URL cost and prevents context overflow. Range 2000–200000.
maxRuntimeSecsinteger--270Soft wall-clock budget for the whole run. The default keeps synchronous/x402 callers under the 300s cap; URLs not reached by the deadline return deferred (uncharged). Range 60–600.
proxyGeostring----ISO 3166-1 alpha-2 country code (e.g. US, DE) for residential routing. Leave empty for default routing; stealth escalation still applies.

Output

One dataset record per input URL.

Completed:

{
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"status": "completed",
"result": { "title": "A Light in the Attic", "price": "£51.77", "in_stock": true },
"schemaValid": true,
"tokensUsed": 4200,
"inputTokens": 3400,
"error": null
}

Failed (couldn't fetch or extract) and deferred (time budget exhausted) — neither is charged:

{ "url": "https://example.com/blocked", "status": "failed", "result": {}, "schemaValid": true, "tokensUsed": 0, "inputTokens": 0, "error": "empty-content: failed" }
{ "url": "https://example.com/late", "status": "deferred", "result": {}, "schemaValid": true, "tokensUsed": 0, "inputTokens": 0, "error": "runtime-budget-exhausted" }
FieldTypeDescription
urlstringThe input URL this record corresponds to. One record per input URL.
statusstringcompleted (extracted, charged), failed (couldn't fetch or extract; not charged), or deferred (time budget hit before this URL; not charged, retry it).
resultobjectExtracted data. Conforms to outputSchema when provided (subject to schemaValid); best-effort otherwise.
schemaValidbooleanWhether result validated against outputSchema. Always true when no schema was supplied. false means the model couldn't conform even after a retry.
tokensUsedintegerLLM tokens consumed for this URL (input + output), across the extraction and any retry.
inputTokensintegerPage-content tokens fed to the model for this URL (input only), across the extraction and any retry. This is the figure the per-1,000-token charge meters, so it explains the variable part of this URL's cost.
errorstringReason when status is not completed; null on success.

Example

{
"urls": [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"
],
"outputSchema": {
"type": "object",
"properties": { "title": { "type": "string" }, "price": { "type": "string" } },
"required": ["title", "price"]
}
}
curl -X POST "https://api.apify.com/v2/acts/shelvick~structured-extractor/run-sync-get-dataset-items?token=YOUR_TOKEN" \
-H "Content-Type: application/json" \
-d '{"urls":["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html","https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html"],"outputSchema":{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"string"}},"required":["title","price"]}}'

Calling from an AI agent

Apify MCP server

The Actor is available as a callable tool on mcp.apify.com. The input schema is self-documenting — an LLM can construct a correct call from the tool description and field names alone. Pay per call via x402 USDC on Base or Skyfire managed tokens. Note the 300s synchronous cap: keep maxRuntimeSecs at its default for sync/x402 calls and let large batches defer the tail, or use the async path for big batches.

Apify SDK (Python)

from apify_client import ApifyClient
client = ApifyClient("YOUR_TOKEN")
run = client.actor("shelvick/structured-extractor").call(
run_input={
"urls": [
"https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"https://books.toscrape.com/catalogue/tipping-the-velvet_999/index.html",
],
"outputSchema": {
"type": "object",
"properties": {"title": {"type": "string"}, "price": {"type": "string"}},
"required": ["title", "price"],
},
}
)
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["url"], item["status"], item["schemaValid"], item["result"])

REST API

POST https://api.apify.com/v2/acts/shelvick~structured-extractor/run-sync-get-dataset-items?token=YOUR_TOKEN

For large batches, start asynchronously and poll:

POST https://api.apify.com/v2/acts/shelvick~structured-extractor/runs?token=YOUR_TOKEN
GET https://api.apify.com/v2/actor-runs/{runId}/dataset/items?token=YOUR_TOKEN

Pricing

Pay-per-event, billed only on success, and metered to page size so you pay for what each page actually costs to process. A successfully extracted URL carries two charges, both fired after that URL's record is pushed:

  • a flat per-URL base — covers the page fetch, extraction setup, schema validation, and any per-page retry; and
  • a per-1,000-input-token charge that scales with how much page content the model actually read (reported as inputTokens on each record).

Small pages cost less and large pages cost proportionally more, instead of every page paying one worst-case flat rate. failed and deferred URLs trigger neither charge, so a batch only ever costs you the URLs it actually extracted. Cap the variable part with maxInputTokens, and cap a whole run with maxTotalChargeUsd.

See the Pricing tab on this Store page for the current per-event rates and any active subscriber discounts.

Behavior

Run-level failures (rare): invalid input fails the run before any work — empty urls, more than 100 URLs, or maxInputTokens/maxRuntimeSecs out of range. Nothing is charged.

Per-URL outcomes:

  • completed — extracted. Check schemaValid for conformance to your schema.
  • failed — the page couldn't be fetched (defended beyond stealth, blank, login wall) or the model returned unparseable output; error says which.
  • deferred — the run's maxRuntimeSecs budget was exhausted before this URL was processed. Retry it (uncharged).
  • schemaValid: false — extraction completed but didn't validate even after a retry; the best-effort result is still returned.

Performance expectations:

  • One fetch pass covers the whole batch; a stealth-browser/proxy launch is shared across URLs rather than repeated.
  • Cooperative pages: a few seconds each, fetched concurrently. Bot-defended pages add stealth latency.
  • Extraction runs per URL; total time scales with batch size. At default maxRuntimeSecs (270s) a stealth batch of roughly 20 URLs completes; larger batches return the tail as deferred.
  • Larger maxInputTokens increases per-URL latency and cost on big pages.

FAQ

Can the URLs have different output shapes? No — one outputSchema applies to the whole batch. Pass pages of the same kind (all products, all articles). For a different shape, use a separate run.

Am I charged for URLs that fail or get deferred? No. The charge fires only per completed URL. failed and deferred URLs are free, so a batch costs only what it actually extracted.

What's deferred and how do I handle it? The run hit its maxRuntimeSecs budget before reaching that URL. It's uncharged and retry-friendly — re-submit the deferred URLs, or raise maxRuntimeSecs (and use the async API) for bigger batches in one run.

How do I keep cost down on very large pages? Lower maxInputTokens. It caps how much of each page reaches the model — the dominant cost — and each content component is capped independently so no single field can blow the budget.

What does schemaValid: false mean? The model couldn't produce output conforming to your schema for that URL even after a retry. The result is still returned for inspection; simplify the schema or mark uncertain fields nullable.

What this doesn't do

  • No authentication. Public, unauthenticated pages only. It won't log in or submit credentials.
  • No per-URL schemas. One shared outputSchema per run — pages should be the same kind.
  • No page interaction. It doesn't click, fill forms, or navigate multi-step flows before extracting.
  • No crawling. It extracts from the URLs you pass; it won't discover or follow links.
  • No CAPTCHA solving / no file parsing. Interactive-CAPTCHA pages return failed; it extracts from web pages, not uploaded PDFs/images.

For raw page content (HTML or markdown) without an extraction step, use a page-fetching tool instead — this Actor adds an LLM extraction cost you don't need if you only want content. For clicking, form-filling, or authenticated sessions before extraction, use a browser-automation tool. For discovering links to extract, run a crawler first and pass its URLs here as a batch.


Design notes: www.scotthelvick.com/tools/structured-extractor