# Web Content Extractor API — URL to JSON (`george.the.developer/web-content-extractor-api`) Actor

Extract structured JSON from any webpage. Articles, products, recipes, jobs. Auto-detects content type. Returns metadata, headings, images, links. For AI agents and RAG.

- **URL**: https://apify.com/george.the.developer/web-content-extractor-api.md
- **Developed by:** [George Kioko](https://apify.com/george.the.developer) (community)
- **Categories:** Lead generation, Marketing
- **Stats:** 11 total users, 2 monthly users, 0.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $3.00 / 1,000 content extractions

This Actor is paid per event and usage. You are charged both the fixed price for specific events and for Apify platform usage.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🔍 Web Content Extractor API — URL to Structured JSON

> **One API call. Any URL. Clean structured JSON.** Extract articles, products, recipes, job postings, and more — automatically detected and organized. Built for AI agents, RAG pipelines, and data workflows.

---

### Architecture Overview

```mermaid
flowchart TB
    subgraph Input
        URL[/"URL: any webpage"/]
    end

    subgraph Processing["Extraction Pipeline"]
        FETCH["1. Fetch & Parse HTML"]
        DETECT["2. Auto-Detect Content Type"]
        SCORE["3. Score Content Blocks"]
        EXTRACT["4. Extract Structured Data"]
        ENRICH["5. Enrich with Metadata"]
    end

    subgraph Detection["Content Type Detection"]
        ART["Article"]
        PROD["Product"]
        REC["Recipe"]
        JOB["Job Posting"]
        EVT["Event"]
        WEB["Generic Webpage"]
    end

    subgraph Output["Structured JSON"]
        META["Metadata: title, author, date, image"]
        CONTENT["Content: text, headings, word count"]
        MEDIA["Media: images, links"]
        SCHEMA["JSON-LD Structured Data"]
        TYPED["Type-Specific: price, ingredients, salary..."]
    end

    URL --> FETCH --> DETECT --> SCORE --> EXTRACT --> ENRICH
    DETECT --> ART & PROD & REC & JOB & EVT & WEB
    ENRICH --> META & CONTENT & MEDIA & SCHEMA & TYPED

    style Input fill:#1a1a2e,color:#fff
    style Processing fill:#16213e,color:#fff
    style Detection fill:#0f3460,color:#fff
    style Output fill:#533483,color:#fff
````

### What Makes This Different?

| Feature | This Actor | Typical Scrapers |
|---------|-----------|-----------------|
| Output format | **Structured JSON** | Raw HTML |
| Content detection | **Auto-detects 6 types** | Manual configuration |
| Setup time | **Zero** — just pass URL | Hours of selector writing |
| AI-ready | **Yes** — clean text for LLMs | Needs post-processing |
| Batch support | **Up to 25 URLs** per call | One at a time |
| Response time | **1-3 seconds** | 5-30 seconds |

***

### Request Flow

```mermaid
sequenceDiagram
    participant Client as Your App
    participant API as Content Extractor
    participant Web as Target Website
    participant Cache as 30-min Cache

    Client->>API: GET /extract?url=example.com
    API->>Cache: Check cache

    alt Cache Hit
        Cache-->>API: Return cached result
        API-->>Client: JSON response (instant)
    else Cache Miss
        API->>Web: Fetch HTML
        Web-->>API: HTML content
        API->>API: Detect type + Extract + Score
        API->>Cache: Store result
        API-->>Client: Structured JSON (1-3s)
    end

    Note over Client,API: PPE charge: $0.003 per extraction
```

***

### API Endpoints

#### `GET /extract` — Extract from URL

```
GET /extract?url=https://techcrunch.com/2026/03/24/ai-news&format=full
```

| Parameter | Type | Required | Default | Options |
|-----------|------|----------|---------|---------|
| `url` | string | Yes | — | Any valid URL |
| `format` | string | No | `full` | `full`, `article`, `metadata` |

#### `POST /extract` — Extract with JSON body

```json
POST /extract
{
  "url": "https://techcrunch.com/2026/03/24/ai-news",
  "format": "article"
}
```

#### `POST /batch` — Extract multiple URLs

```json
POST /batch
{
  "urls": [
    "https://news.ycombinator.com",
    "https://techcrunch.com",
    "https://bbc.com/news"
  ],
  "format": "full"
}
```

#### `GET /` — Health check

Returns API status, version, and endpoint documentation.

***

### Content Type Detection

```mermaid
flowchart LR
    HTML["HTML Page"] --> CHECK{"Detect Signals"}

    CHECK -->|"og:type=article<br/>or article tag"| ART["**article**<br/>title, author, date,<br/>full text, headings"]
    CHECK -->|"Schema: Product<br/>or .product-price"| PROD["**product**<br/>name, price, rating,<br/>images, SKU, brand"]
    CHECK -->|"Schema: Recipe<br/>or .recipe"| REC["**recipe**<br/>ingredients, instructions,<br/>prep time, servings"]
    CHECK -->|"Schema: JobPosting<br/>or .job-title"| JOB["**job_posting**<br/>title, company, salary,<br/>location, type"]
    CHECK -->|"Schema: Event<br/>or .event-date"| EVT["**event**<br/>name, date, location,<br/>description"]
    CHECK -->|"No specific<br/>signals found"| WEB["**webpage**<br/>metadata, content,<br/>links, images"]

    style ART fill:#10b981,color:#fff
    style PROD fill:#f59e0b,color:#fff
    style REC fill:#ef4444,color:#fff
    style JOB fill:#3b82f6,color:#fff
    style EVT fill:#8b5cf6,color:#fff
    style WEB fill:#6b7280,color:#fff
```

***

### Output Examples

#### Article Extraction

```json
{
  "url": "https://techcrunch.com/2026/03/24/ai-agents",
  "type": "article",
  "metadata": {
    "title": "AI Agents Are Reshaping Enterprise Software",
    "description": "How autonomous AI agents are changing B2B SaaS",
    "author": "Sarah Perez",
    "date": "2026-03-24T10:00:00Z",
    "image": "https://techcrunch.com/hero.jpg",
    "siteName": "TechCrunch",
    "locale": "en-US",
    "canonical": "https://techcrunch.com/2026/03/24/ai-agents",
    "keywords": ["AI", "agents", "enterprise", "SaaS"]
  },
  "content": {
    "text": "The rise of AI agents represents a fundamental shift in how enterprise software operates. Unlike traditional chatbots...",
    "headings": [
      { "level": 2, "text": "What Are AI Agents?" },
      { "level": 2, "text": "The Enterprise Impact" },
      { "level": 3, "text": "Case Study: Salesforce" }
    ],
    "wordCount": 2847
  },
  "media": {
    "images": [
      { "src": "https://techcrunch.com/diagram.png", "alt": "AI agent architecture" }
    ],
    "links": [
      { "href": "https://openai.com/agents", "text": "OpenAI's agent framework" }
    ]
  },
  "structuredData": [{ "@type": "NewsArticle", "headline": "..." }],
  "extractedAt": "2026-03-24T12:34:56.789Z"
}
```

#### Product Extraction

```json
{
  "url": "https://store.example.com/product/widget-pro",
  "type": "product",
  "metadata": { "title": "Widget Pro - Best Seller", "siteName": "Example Store" },
  "content": { "text": "The Widget Pro is our most popular...", "wordCount": 342 },
  "product": {
    "name": "Widget Pro",
    "price": "$49.99",
    "currency": "USD",
    "availability": "InStock",
    "rating": "4.8",
    "reviewCount": "1,247",
    "brand": "WidgetCo",
    "sku": "WP-2026",
    "images": ["https://store.example.com/widget-pro-1.jpg"]
  }
}
```

***

### Use Case Workflows

#### RAG Pipeline Integration

```mermaid
flowchart LR
    URLs["URL List<br/>100+ sources"] --> EXTRACT["Web Content<br/>Extractor API"]
    EXTRACT --> TEXT["Clean Text<br/>+ Metadata"]
    TEXT --> CHUNK["Text Chunking<br/>(LangChain)"]
    CHUNK --> EMBED["Embeddings<br/>(OpenAI)"]
    EMBED --> VECTOR["Vector DB<br/>(Pinecone)"]
    VECTOR --> RAG["RAG Query<br/>Engine"]
    RAG --> ANSWER["AI-Powered<br/>Answers"]

    style EXTRACT fill:#10b981,color:#fff
    style RAG fill:#3b82f6,color:#fff
```

#### Competitive Intelligence Pipeline

```mermaid
flowchart LR
    COMP["Competitor<br/>URLs"] --> EXTRACT["Web Content<br/>Extractor API"]
    EXTRACT --> PROD["Product Data:<br/>prices, features"]
    EXTRACT --> NEWS["News & Blog:<br/>announcements"]
    PROD --> DASH["Analytics<br/>Dashboard"]
    NEWS --> ALERT["Email<br/>Alerts"]

    style EXTRACT fill:#10b981,color:#fff
```

***

### Pricing

| Event | Price per call | Cost per 1,000 |
|-------|---------------|-----------------|
| **Content extraction** | $0.003 | $3.00 |

#### Cost Comparison

| Solution | Cost per 1,000 URLs | Setup Time |
|----------|-------------------|------------|
| **This Actor** | **$3.00** | **0 minutes** |
| Diffbot | $299/month flat | Hours |
| Custom scraper | $50+ developer hours | Days |
| Manual copy-paste | 40+ hours labor | Forever |

***

### Integrations

| Platform | How to Connect |
|----------|---------------|
| **LangChain** | Use as Document Loader via HTTP |
| **LlamaIndex** | Custom reader pointing to /extract |
| **Zapier** | Webhook trigger -> GET /extract |
| **Make (Integromat)** | HTTP module -> POST /extract |
| **n8n** | HTTP Request node |
| **Apify Orchestrator** | Direct actor call or Standby URL |

***

### FAQ

**Q: How fast is extraction?**
A: 1-3 seconds for a single URL. Batch processes 25 URLs in parallel.

**Q: Does it handle paywalled content?**
A: It extracts whatever is publicly visible in the HTML. Paywalled content behind JavaScript auth won't be extracted.

**Q: What about JavaScript-rendered pages (SPAs)?**
A: Current version uses server-side HTML. For JS-heavy pages, pair with our [Screenshot & PDF API](https://apify.com/george.the.developer/screenshot-pdf-api).

**Q: Is there a rate limit?**
A: No hard rate limit. Apify Standby handles concurrent requests automatically.

**Q: What languages are supported?**
A: Any language. The extractor works with HTML structure, not language-specific parsing.

***

### Related Actors

- [WebSight API](https://apify.com/george.the.developer/websight-api) — Technical website analysis (SEO, tech stack, AI score)
- [Screenshot & PDF API](https://apify.com/george.the.developer/screenshot-pdf-api) — Pixel-perfect webpage captures
- [Website Contact Scraper](https://apify.com/george.the.developer/website-contact-scraper) — Extract emails, phones, social links

***

*Built by [George Kioko](https://apify.com/george.the.developer) | 6,196+ data extraction jobs completed | 35+ production APIs*

# Actor input Schema

## `url` (type: `string`):

URL to extract content from

## `format` (type: `string`):

Output format: 'full' (all fields), 'article' (text-focused), 'metadata' (meta only)

## Actor input object example

```json
{
  "url": "https://example.com",
  "format": "full"
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "url": "https://example.com"
};

// Run the Actor and wait for it to finish
const run = await client.actor("george.the.developer/web-content-extractor-api").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "url": "https://example.com" }

# Run the Actor and wait for it to finish
run = client.actor("george.the.developer/web-content-extractor-api").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "url": "https://example.com"
}' |
apify call george.the.developer/web-content-extractor-api --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=george.the.developer/web-content-extractor-api",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Web Content Extractor API — URL to JSON",
        "description": "Extract structured JSON from any webpage. Articles, products, recipes, jobs. Auto-detects content type. Returns metadata, headings, images, links. For AI agents and RAG.",
        "version": "1.0",
        "x-build-id": "FsaUo4M0tGbepRWqO"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/george.the.developer~web-content-extractor-api/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-george.the.developer-web-content-extractor-api",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/george.the.developer~web-content-extractor-api/runs": {
            "post": {
                "operationId": "runs-sync-george.the.developer-web-content-extractor-api",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/george.the.developer~web-content-extractor-api/run-sync": {
            "post": {
                "operationId": "run-sync-george.the.developer-web-content-extractor-api",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "url"
                ],
                "properties": {
                    "url": {
                        "title": "URL",
                        "type": "string",
                        "description": "URL to extract content from"
                    },
                    "format": {
                        "title": "Output Format",
                        "enum": [
                            "full",
                            "article",
                            "metadata"
                        ],
                        "type": "string",
                        "description": "Output format: 'full' (all fields), 'article' (text-focused), 'metadata' (meta only)",
                        "default": "full"
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
