# Structured Data Extractor (`automation-lab/structured-data-extractor`) Actor

This actor extracts structured data markup from web pages. It parses all three major formats: JSON-LD (`<script type="application/ld+json">`), Microdata (`itemscope`/`itemprop`), and RDFa (`typeof`/`property`). For each page, it returns the full structured data objects, detected Schema.org...

- **URL**: https://apify.com/automation-lab/structured-data-extractor.md
- **Developed by:** [Stas Persiianenko](https://apify.com/automation-lab) (community)
- **Categories:** SEO tools, Developer tools
- **Stats:** 15 total users, 2 monthly users, 100.0% runs succeeded, 1 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Structured Data Extractor

Extract JSON-LD, Microdata, and RDFa structured data from web pages for SEO auditing and Schema.org validation.

### What does Structured Data Extractor do?

This actor extracts **structured data markup** from web pages. It parses all three major formats: **JSON-LD** (`<script type="application/ld+json">`), **Microdata** (`itemscope`/`itemprop`), and **RDFa** (`typeof`/`property`). For each page, it returns the full structured data objects, detected Schema.org types, and format counts. Use it to audit rich snippet eligibility, verify Schema.org implementation, or monitor structured data across your entire site.

### Use cases

- **SEO specialists** -- verify Schema.org markup implementation across hundreds of pages in a single run
- **Rich snippet auditors** -- check that pages have the right structured data types for Google rich results (Product, Article, FAQ, etc.)
- **Competitive analysts** -- see what structured data competitors use and identify markup opportunities you are missing
- **Migration testers** -- ensure structured data survives CMS, domain, or URL migrations without data loss
- **Content monitoring teams** -- track structured data changes across pages over time to catch regressions
- **AI/ML engineers** -- extract structured Schema.org data to build knowledge graphs, enrich RAG pipelines, or create training datasets with clean entity relationships

### Why use Structured Data Extractor?

- **All three formats** -- extracts JSON-LD, Microdata, and RDFa in a single pass, so you never miss markup regardless of implementation
- **Full data objects** -- returns the complete structured data payload, not just type names, so you can inspect every property
- **Batch processing** -- analyze hundreds of URLs at once instead of checking pages one at a time in Google's testing tool
- **AI-ready structured output** -- each result includes format counts, detected Schema.org types, and boolean flags, ready for LLM training data or knowledge graph construction
- **API and integration ready** -- trigger runs programmatically or connect to dashboards via Google Sheets, Zapier, and more
- **Pay-per-event pricing** -- only pay for pages you actually analyze, starting at $0.001 per URL

### Input parameters

| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `urls` | string[] | Yes | -- | List of web page URLs to extract structured data from |

#### Example input

```json
{
    "urls": [
        "https://www.google.com",
        "https://en.wikipedia.org/wiki/Web_scraping",
        "https://www.imdb.com/title/tt0111161/"
    ]
}
````

### Output example

```json
{
    "url": "https://en.wikipedia.org/wiki/Web_scraping",
    "title": "Web scraping - Wikipedia",
    "structuredDataCount": 2,
    "jsonLdCount": 1,
    "microdataCount": 1,
    "rdfaCount": 0,
    "schemaTypes": ["Article", "BreadcrumbList"],
    "structuredData": [
        {
            "type": "Article",
            "format": "json-ld",
            "data": { "@type": "Article", "name": "Web scraping", "headline": "Web scraping" }
        }
    ],
    "hasJsonLd": true,
    "hasMicrodata": true,
    "hasRdfa": false,
    "error": null,
    "extractedAt": "2026-03-01T12:00:00.000Z"
}
```

### Output fields

| Field | Type | Description |
|-------|------|-------------|
| `url` | string | The analyzed page URL |
| `title` | string | The page title |
| `structuredDataCount` | number | Total number of structured data items found |
| `jsonLdCount` | number | Number of JSON-LD blocks found |
| `microdataCount` | number | Number of Microdata items found |
| `rdfaCount` | number | Number of RDFa items found |
| `schemaTypes` | string\[] | List of detected Schema.org types |
| `structuredData` | array | Full structured data objects with type, format, and data |
| `hasJsonLd` | boolean | Whether the page contains any JSON-LD |
| `hasMicrodata` | boolean | Whether the page contains any Microdata |
| `hasRdfa` | boolean | Whether the page contains any RDFa |
| `error` | string | Error message if extraction failed, null otherwise |
| `extractedAt` | string | ISO timestamp of the extraction |

### How to extract structured data from web pages

1. Go to [Structured Data Extractor](https://apify.com/automation-lab/structured-data-extractor) on Apify Store
2. Enter one or more URLs in the `urls` field
3. Click **Start** to run the extractor
4. Wait for results -- each page is analyzed in seconds
5. Review the output for JSON-LD, Microdata, and RDFa structured data found on each page
6. Download results as JSON, CSV, or Excel, or connect via API

### How much does it cost to extract structured data?

Structured Data Extractor uses Apify's pay-per-event pricing model. You only pay for what you use.

| Event | Price | Description |
|-------|-------|-------------|
| Start | $0.035 | One-time per run |
| URL extracted | $0.001 | Per page extracted |

**Example costs:**

- 10 pages: $0.035 + 10 x $0.001 = **$0.045**
- 100 pages: $0.035 + 100 x $0.001 = **$0.135**
- 1,000 pages: $0.035 + 1,000 x $0.001 = **$1.035**

### Using the Apify API

You can start Structured Data Extractor programmatically from your own applications using the Apify API. The following examples show how to run the actor and retrieve results in both Node.js and Python.

#### Node.js

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_TOKEN' });
const run = await client.actor('automation-lab/structured-data-extractor').call({
    urls: ['https://en.wikipedia.org/wiki/Web_scraping'],
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);
```

#### Python

```python
from apify_client import ApifyClient

client = ApifyClient('YOUR_TOKEN')
run = client.actor('automation-lab/structured-data-extractor').call(run_input={
    'urls': ['https://en.wikipedia.org/wiki/Web_scraping'],
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)
```

#### cURL

```bash
curl -X POST "https://api.apify.com/v2/acts/automation-lab~structured-data-extractor/runs?token=YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://en.wikipedia.org/wiki/Web_scraping"]
  }'
```

### Use with Claude AI (MCP)

This actor is available as a tool in Claude AI through the Model Context Protocol (MCP). Add it to Claude Desktop, Cursor, Windsurf, or any MCP-compatible client.

#### Setup for Claude Code

```bash
claude mcp add --transport http apify "https://mcp.apify.com?tools=automation-lab/structured-data-extractor"
```

#### Setup for Claude Desktop, Cursor, or VS Code

Add this to your MCP config file:

```json
{
    "mcpServers": {
        "apify": {
            "url": "https://mcp.apify.com?tools=automation-lab/structured-data-extractor"
        }
    }
}
```

#### Example prompts

- "Extract structured data from this product page: https://www.example.com/product/123"
- "Get schema.org markup from these URLs and tell me which types they use"
- "Check if these pages have JSON-LD structured data for rich snippets"

Learn more in the [Apify MCP documentation](https://docs.apify.com/platform/integrations/mcp).

### Integrations

Structured Data Extractor works with all major automation platforms available on Apify. Export results to **Google Sheets** to build a structured data audit dashboard across your site. Use **Zapier** or **Make** to trigger extraction runs whenever new pages are published. Send alerts to **Slack** when pages are missing expected Schema.org types. Pipe results into **n8n** workflows for custom validation logic, or set up **webhooks** to trigger downstream actions as soon as a run finishes. Chain it with JSON-LD Validator to first extract and then validate your structured data.

### Tips and best practices

- **Focus on pages eligible for rich results** -- prioritize product pages, articles, FAQ pages, and recipe pages where structured data directly impacts search appearance
- **Filter by `schemaTypes`** to quickly find pages missing specific types like Product, Article, or BreadcrumbList
- **Use `structuredDataCount: 0` to find pages with no markup** -- these are your biggest opportunities for SEO improvement
- **Combine with JSON-LD Validator** to first extract structured data with this actor, then validate the JSON-LD blocks for errors and warnings
- **Schedule regular runs** to catch structured data regressions after site deployments or CMS updates

### Legality

This tool analyzes publicly accessible web content. Automated analysis of public web resources is standard practice in SEO and web development. Always respect robots.txt directives and rate limits when analyzing third-party websites. For personal data processing, ensure compliance with applicable privacy regulations.

### FAQ

**What structured data formats does this actor support?**
It extracts all three major formats: JSON-LD (script tags), Microdata (itemscope/itemprop attributes), and RDFa (typeof/property attributes).

**Does it validate the structured data?**
No. This actor extracts and reports what structured data exists on a page. For validation of JSON-LD syntax and required fields, use the JSON-LD Validator actor.

**Can it extract structured data from JavaScript-rendered pages?**
No. The actor uses plain HTTP requests and parses the initial HTML response. Structured data that is injected by client-side JavaScript after page load will not be captured.

**The actor returns `structuredDataCount: 0` for a page I know has structured data. Why?**
The actor uses plain HTTP requests and parses the initial HTML. If the structured data is injected by client-side JavaScript after page load (common with React, Angular, or Vue apps), it will not be captured. Test by viewing the page source (Ctrl+U) rather than the browser's inspector to see what the actor receives.

**Why does the actor find Microdata but not JSON-LD on a page?**
Some websites use Microdata (HTML attributes like `itemscope` and `itemprop`) instead of JSON-LD script tags. Both are valid formats for structured data. The actor extracts both, and the `format` field in each `structuredData` entry tells you which format was used.

### Other SEO tools

- [JSON-LD Validator](https://apify.com/automation-lab/jsonld-validator) -- Validate JSON-LD structured data for errors and warnings
- [OG Meta Extractor](https://apify.com/automation-lab/og-meta-extractor) -- Extract Open Graph meta tags from web pages
- [SEO Title Checker](https://apify.com/automation-lab/seo-title-checker) -- Check page titles for SEO best practices
- [Subdomain Finder](https://apify.com/automation-lab/subdomain-finder) -- Discover subdomains via certificate transparency logs
- [Domain Availability Checker](https://apify.com/automation-lab/domain-availability-checker) -- Check if domain names are available for registration

# Actor input Schema

## `urls` (type: `array`):

List of web page URLs to extract structured data from.

## Actor input object example

```json
{
  "urls": [
    "https://www.google.com",
    "https://en.wikipedia.org/wiki/Web_scraping",
    "https://www.imdb.com/title/tt0111161/"
  ]
}
```

# Actor output Schema

## `overview` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://www.google.com",
        "https://en.wikipedia.org/wiki/Web_scraping",
        "https://www.imdb.com/title/tt0111161/"
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("automation-lab/structured-data-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "urls": [
        "https://www.google.com",
        "https://en.wikipedia.org/wiki/Web_scraping",
        "https://www.imdb.com/title/tt0111161/",
    ] }

# Run the Actor and wait for it to finish
run = client.actor("automation-lab/structured-data-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://www.google.com",
    "https://en.wikipedia.org/wiki/Web_scraping",
    "https://www.imdb.com/title/tt0111161/"
  ]
}' |
apify call automation-lab/structured-data-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=automation-lab/structured-data-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Structured Data Extractor",
        "description": "This actor extracts structured data markup from web pages. It parses all three major formats: JSON-LD (`<script type=\"application/ld+json\">`), Microdata (`itemscope`/`itemprop`), and RDFa (`typeof`/`property`). For each page, it returns the full structured data objects, detected Schema.org...",
        "version": "0.1",
        "x-build-id": "fMxOcz1pvQqZUGBrK"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/automation-lab~structured-data-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-automation-lab-structured-data-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/automation-lab~structured-data-extractor/runs": {
            "post": {
                "operationId": "runs-sync-automation-lab-structured-data-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/automation-lab~structured-data-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-automation-lab-structured-data-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls"
                ],
                "properties": {
                    "urls": {
                        "title": "URLs to extract",
                        "type": "array",
                        "description": "List of web page URLs to extract structured data from.",
                        "items": {
                            "type": "string"
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
