# Smart Article Extractor (`parseforge/article-extractor`) Actor

Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!

- **URL**: https://apify.com/parseforge/article-extractor.md
- **Developed by:** [ParseForge](https://apify.com/parseforge) (community)
- **Categories:** News, AI, Automation
- **Stats:** 7 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $40.00 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

![ParseForge Banner](https://github.com/ParseForge/apify-assets/blob/ad35ccc13ddd068b9d6cba33f323962e39aed5b2/banner.jpg?raw=true)

## 📰 Smart Article Extractor

> 🚀 **Parse any news article or blog post into clean structured text in seconds.** Get **23 metadata fields** per article including authors, tags, publish date, lead image, paywall flag, and reading time. No API key, no registration, no manual parser maintenance.

> 🕒 **Last updated:** 2026-04-21 · **📊 23 fields** per article · **🌐 Works on any site** · **⚡ 10 articles in ~10 seconds** · **💰 Paywall detection**


<table><tr>
<td style="border-left:4px solid #0F766E;padding:12px 16px;font-weight:600">Pull structured records from Smart Article Extractor — clean fields ready as CSV, JSON, JSONL, Excel, or XML for downstream pipelines.</td>
</tr></table>

##### Copy to your AI assistant

Copy this block into ChatGPT, Claude, Cursor, or any LLM to start using this actor.

````

parseforge/article-extractor on Apify. Call: ApifyClient("TOKEN").actor("parseforge/article-extractor").call(run\_input={...}), then client.dataset(run\["defaultDatasetId"]).list\_items().items for results. Key inputs: startUrls (array, default \[{"url": "https://www.bbc.com/news/articles/c86w8elez74o"}]), maxItems (integer, default 10). Full actor spec: fetch build via GET https://api.apify.com/v2/acts/parseforge~article-extractor (Bearer TOKEN). Get token: https://console.apify.com/account/integrations

````

The **Smart Article Extractor** takes any article URL and returns the main body as clean Markdown alongside 22 metadata fields. It scores DOM nodes by paragraph count, word count, and link density to identify the main content block, then strips navigation, sidebars, and ads. Author, tags, section, publishedAt, modifiedAt, and canonical URL are pulled from meta tags, JSON-LD, and itemprop attributes.

Extras include a paywall-detection heuristic, inline image collection, lead image (Open Graph), language detection, word count, and reading time. Concurrent fetching keeps 10 articles flying in parallel, so a list of 100 news URLs finishes in about 15 seconds. Works out of the box on most major news sites, blogs, and publishing platforms.

| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| News aggregators, media monitoring teams, AI app developers, content researchers, data journalists, archivists | News datasets, summarization pipelines, media monitoring, sentiment analysis, archive assembly |

---

### 📋 What the Smart Article Extractor does

Five extraction workflows in a single run:

- 📝 **Main body extraction.** DOM scoring isolates the article content and strips navigation, ads, and sidebars.
- 👥 **Author detection.** Pulls authors from meta tags, JSON-LD, and itemprop attributes.
- 📅 **Date stamps.** Captures both `article:published_time` and `article:modified_time`.
- 🏷️ **Tags and section.** Extracts `article:tag` and `article:section` metadata.
- 💰 **Paywall flag.** Heuristic detects common paywall markers so you can filter downstream.

Every record also includes the canonical URL, lead image, inline images, word count, reading time, language, site name, HTTP status, and timestamp.

> 💡 **Why it matters:** news sites each have their own HTML structure. Writing per-site parsers is brittle and breaks every time a publisher redesigns their pages. This Actor uses readability-style scoring that works across any article-shaped page.

---

### 🎬 Full Demo

_🚧 Coming soon: a 3-minute walkthrough showing extraction across news sites, blogs, and platforms._

---

### ⚙️ Input

<table>
<thead>
<tr><th>Input</th><th>Type</th><th>Default</th><th>Behavior</th></tr>
</thead>
<tbody>
<tr><td>startUrls</td><td>array of URLs</td><td>required</td><td>One or more article URLs to extract.</td></tr>
<tr><td>maxItems</td><td>integer</td><td>10</td><td>Articles returned. Free plan caps at 10, paid plan at 1,000,000.</td></tr>
</tbody>
</table>

**Example: extract a single article.**

```json
{
    "startUrls": [
        { "url": "https://techcrunch.com/2025/01/10/openai-launches-gpt-store/" }
    ],
    "maxItems": 1
}
````

**Example: batch extraction for media monitoring.**

```json
{
    "startUrls": [
        { "url": "https://www.theverge.com/2025/ai-coverage-1" },
        { "url": "https://www.wired.com/story/ai-agents-2026" },
        { "url": "https://arstechnica.com/ai/article" }
    ],
    "maxItems": 100
}
```

> ⚠️ **Good to Know:** works best on article-shaped pages (one headline, one author, one body). Homepages, category pages, and list views return thin extractions because there is no single article to score.

***

### 📊 Output

Each record contains **23 fields**. Download the dataset as CSV, Excel, JSON, or XML.

#### 🧾 Schema

| Field | Type | Example |
|---|---|---|
| 🔗 url | string | `"https://techcrunch.com/.../gpt-store/"` |
| 🔁 canonicalUrl | string | null | `"https://techcrunch.com/.../gpt-store/"` |
| 🏷️ title | string | null | `"OpenAI launches GPT Store"` |
| 📑 subtitle | string | null | `"Available to Plus, Team, Enterprise"` |
| 🧑 author | string | null | `"Kyle Wiggers"` |
| 👥 authors | string\[] | `["Kyle Wiggers"]` |
| 📅 publishedAt | ISO 8601 | null | `"2025-01-10T14:00:00Z"` |
| 🔁 modifiedAt | ISO 8601 | null | `"2025-01-10T16:30:00Z"` |
| 🏢 siteName | string | null | `"TechCrunch"` |
| 🗂️ section | string | null | `"AI"` |
| 🏷️ tags | string\[] | `["openai", "gpt-store"]` |
| 🌍 language | string | null | `"en-US"` |
| 📝 description | string | null | `"OpenAI rolled out the long-teased GPT Store..."` |
| 🖼️ leadImage | string | null | `"https://.../og.jpg"` |
| 🎨 images | string\[] | `["https://...", "https://..."]` |
| 📃 markdown | string | `"# OpenAI launches GPT Store..."` |
| 💬 text | string | plain text without markdown markers |
| 🧾 html | string | cleaned article HTML |
| 🔢 wordCount | number | `742` |
| ⏱️ readingTimeMinutes | number | `4` |
| 💰 hasPaywall | boolean | false |
| 🟢 httpStatus | number | `200` |
| 🕒 scrapedAt | ISO 8601 | `"2026-04-21T12:00:00.000Z"` |
| ❗ error | string | null | `"Timeout"` on failure |

#### 📦 Sample records

<details>
<summary><strong>📰 Typical news article with full metadata</strong></summary>

```json
{
    "url": "https://techcrunch.com/2025/01/10/openai-launches-gpt-store/",
    "canonicalUrl": "https://techcrunch.com/2025/01/10/openai-launches-gpt-store/",
    "title": "OpenAI launches GPT Store for custom chatbots",
    "subtitle": "Available to ChatGPT Plus, Team and Enterprise users",
    "author": "Kyle Wiggers",
    "authors": ["Kyle Wiggers"],
    "publishedAt": "2025-01-10T14:00:00Z",
    "modifiedAt": "2025-01-10T16:30:00Z",
    "siteName": "TechCrunch",
    "section": "AI",
    "tags": ["openai", "gpt-store", "chatbots"],
    "language": "en-US",
    "description": "OpenAI rolled out the long-teased GPT Store today...",
    "leadImage": "https://techcrunch.com/wp/gpt-store-og.jpg",
    "images": ["https://.../1.jpg", "https://.../2.jpg"],
    "markdown": "# OpenAI launches GPT Store\n\nOpenAI rolled out...",
    "wordCount": 742,
    "readingTimeMinutes": 4,
    "hasPaywall": false,
    "httpStatus": 200,
    "scrapedAt": "2026-04-21T12:00:00.000Z"
}
```

</details>

<details>
<summary><strong>💰 Paywalled article detected</strong></summary>

```json
{
    "url": "https://www.nytimes.com/2025/01/10/opinion/ai-regulation.html",
    "canonicalUrl": "https://www.nytimes.com/2025/01/10/opinion/ai-regulation.html",
    "title": "A New Era of AI Regulation",
    "subtitle": "The next two years will reshape the rules",
    "author": "Editorial Board",
    "authors": ["Editorial Board"],
    "publishedAt": "2025-01-10T10:00:00Z",
    "modifiedAt": null,
    "siteName": "The New York Times",
    "section": "Opinion",
    "tags": ["ai", "regulation"],
    "language": "en",
    "description": "A preview paragraph before the paywall...",
    "leadImage": "https://static01.nyt.com/...opinion-og.jpg",
    "images": [],
    "markdown": "# A New Era of AI Regulation\n\nA preview paragraph...",
    "wordCount": 120,
    "readingTimeMinutes": 1,
    "hasPaywall": true,
    "httpStatus": 200,
    "scrapedAt": "2026-04-21T12:00:00.000Z"
}
```

</details>

<details>
<summary><strong>🚧 Minimal blog post with sparse metadata</strong></summary>

```json
{
    "url": "https://example-blog.com/hello",
    "canonicalUrl": "https://example-blog.com/hello",
    "title": "Hello world",
    "subtitle": null,
    "author": null,
    "authors": [],
    "publishedAt": null,
    "modifiedAt": null,
    "siteName": null,
    "section": null,
    "tags": [],
    "language": "en",
    "description": null,
    "leadImage": null,
    "images": [],
    "markdown": "# Hello world\n\nThis is a short post.",
    "wordCount": 6,
    "readingTimeMinutes": 1,
    "hasPaywall": false,
    "httpStatus": 200,
    "scrapedAt": "2026-04-21T12:00:00.000Z"
}
```

</details>

***

### ✨ Why choose this Actor

| | Capability |
|---|---|
| 🧠 | **DOM scoring.** Readability-style extraction works across any article-shaped page without per-site rules. |
| 📊 | **23 fields.** Authors, tags, section, dates, images, paywall, reading time, and canonical URL. |
| 💰 | **Paywall detection.** Flags articles likely behind a paywall so you can filter them out. |
| ⚡ | **Fast.** 10 articles in under 10 seconds with parallel fetching. |
| 🖼️ | **Image capture.** Lead image plus every inline image URL in the article body. |
| 🚫 | **No credentials.** Runs on any public article URL. |
| 🔌 | **Integrations.** Plugs into RSS feeds, newsroom tools, and news datasets. |

> 📊 Clean article text is the foundation of news summarization, sentiment analysis, and media monitoring. This Actor delivers it consistently without per-site parsers.

***

### 📈 How it compares to alternatives

| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| **⭐ Smart Article Extractor** *(this Actor)* | $5 free credit, then pay-per-use | Any public article URL | **Live per run** | 23 metadata fields | ⚡ 2 min |
| Open-source readability libs | Free | Whatever you host | Your code | Whatever you build | 🐢 Days |
| News API services | $99+/month | Curated feeds | Real-time | Per-plan limits | ⏳ Hours |
| Paid media monitoring | $$$+/month | Managed sources | Real-time | Rich UI | 🕒 Variable |

Pick this Actor when you want article text from arbitrary URLs without maintaining your own extraction library.

***

### 🚀 How to use

1. 📝 **Sign up.** [Create a free account with $5 credit](https://console.apify.com/sign-up?fpr=vmoqkp) (takes 2 minutes).
2. 🌐 **Open the Actor.** Go to the Smart Article Extractor page on the Apify Store.
3. 🎯 **Paste URLs.** Add article URLs to the startUrls field and set maxItems.
4. 🚀 **Run it.** Click **Start** and let the Actor extract the content.
5. 📥 **Download.** Grab your results in the **Dataset** tab as CSV, Excel, JSON, or XML.

> ⏱️ Total time from signup to downloaded dataset: **3-5 minutes.** No coding required.

***

### 💼 Business use cases

<table>
<tr>
<td width="50%" valign="top">

#### 📰 News Aggregation

- Build custom news feeds across sources
- Deduplicate stories across outlets
- Normalize article structure for downstream apps
- Feed summarization pipelines

</td>
<td width="50%" valign="top">

#### 🧠 AI & Summarization

- Extract clean text for LLM summaries
- Build news datasets for fine-tuning
- Ground chatbots with current media
- Power question-answering over news

</td>
</tr>
<tr>
<td width="50%" valign="top">

#### 📡 Media Monitoring

- Track brand mentions across outlets
- Monitor coverage of products or events
- Capture executive quotes and bylines
- Detect paywalled coverage to license

</td>
<td width="50%" valign="top">

#### 📚 Research & Archives

- Build academic text corpora
- Archive public journalism
- Extract metadata for bibliographies
- Preserve retracted or deleted articles

</td>
</tr>
</table>

***

***

### 🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

<table>
<tr>
<td width="50%">

#### 🎓 Research and academia

- Empirical datasets for papers, thesis work, and coursework
- Longitudinal studies tracking changes across snapshots
- Reproducible research with cited, versioned data pulls
- Classroom exercises on data analysis and ethical scraping

</td>
<td width="50%">

#### 🎨 Personal and creative

- Side projects, portfolio demos, and indie app launches
- Data visualizations, dashboards, and infographics
- Content research for bloggers, YouTubers, and podcasters
- Hobbyist collections and personal trackers

</td>
</tr>
<tr>
<td width="50%">

#### 🤝 Non-profit and civic

- Transparency reporting and accountability projects
- Advocacy campaigns backed by public-interest data
- Community-run databases for local issues
- Investigative journalism on public records

</td>
<td width="50%">

#### 🧪 Experimentation

- Prototype AI and machine-learning pipelines with real data
- Validate product-market hypotheses before engineering spend
- Train small domain-specific models on niche corpora
- Test dashboard concepts with live input

</td>
</tr>
</table>

### 🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

- 💬 [**ChatGPT**](https://chat.openai.com/?q=How%20do%20I%20use%20the%20Smart%20Article%20Extractor%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)
- 🧠 [**Claude**](https://claude.ai/new?q=How%20do%20I%20use%20the%20Smart%20Article%20Extractor%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)
- 🔍 [**Perplexity**](https://perplexity.ai/search?q=How%20do%20I%20use%20the%20Smart%20Article%20Extractor%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)
- 🅒 [**Copilot**](https://copilot.microsoft.com/?q=How%20do%20I%20use%20the%20Smart%20Article%20Extractor%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)

### ❓ Frequently Asked Questions

<details>
<summary><b>💳 Do I need a paid Apify plan to run this actor?</b></summary>

No. You can start right now on the free Apify plan, which includes **$5 in free monthly credit**. That is enough to run this actor several times and explore the output before committing to anything. Paid plans unlock higher limits, more concurrent runs, and larger datasets. [Create a free Apify account here](https://console.apify.com/sign-up?fpr=vmoqkp) to get started.

</details>

<details>
<summary><b>🚨 What happens if my run fails or returns no results?</b></summary>

Failed runs are not charged. If the source site changes, proxies get rate-limited, or a specific input matches nothing, re-run the actor or open our [contact form](https://tally.so/r/BzdKgA) and we will investigate. You can also check the run log in the Apify console to see why the run stopped.

</details>

<details>
<summary><b>📏 How many items can I scrape per run?</b></summary>

Free users are limited to **10 items per run** so you can preview the output and confirm the actor works for your use case. Paid users can raise maxItems up to **1,000,000** per run. [Upgrade here](https://console.apify.com/sign-up?fpr=vmoqkp) if you need full scale.

</details>

<details>
<summary><b>🕒 How fresh is the data?</b></summary>

Every run fetches live data at the moment of execution. There is no cache or delay: the records you get reflect what the source returned at that moment. Schedule the actor to maintain a rolling snapshot of the data you need.

</details>

<details>
<summary><b>🧑‍💻 Can I call this actor from my own code?</b></summary>

Yes. Apify exposes every actor as a REST endpoint and ships first-class SDKs for [Node.js](https://docs.apify.com/sdk/js) and [Python](https://docs.apify.com/sdk/python). You can start a run, read the dataset, and handle webhooks from your own app in a few lines. All you need is your Apify API token.

</details>

<details>
<summary><b>📤 How do I export the data?</b></summary>

Every Apify dataset can be downloaded in one click from the console as CSV, JSON, JSONL, Excel, HTML, XML, or RSS. You can also pull results programmatically via the [Apify API](https://docs.apify.com/api/v2) or stream them into BigQuery, S3, and other destinations through built-in integrations.

</details>

<details>
<summary><b>📅 Can I schedule the actor to run automatically?</b></summary>

Yes. Use the Apify scheduler to run the actor on any cadence, from hourly to monthly. Results are saved to your dataset and can be delivered to webhooks, email, Slack, cloud storage, or automation tools such as Zapier and Make.

***

</details>

### 🔌 Automating Smart Article Extractor

Control the scraper programmatically for scheduled runs and pipeline integrations:

- 🟢 **Node.js.** Install the apify-client NPM package.
- 🐍 **Python.** Use the apify-client PyPI package.
- 📚 See the [Apify API documentation](https://docs.apify.com/api/v2) for full details.

The [Apify Schedules feature](https://docs.apify.com/platform/schedules) lets you trigger this Actor on any cron interval. Pair it with an RSS reader or Google News feed for continuous media monitoring.

### 🔌 Integrate with any app

Smart Article Extractor connects to any cloud service via [Apify integrations](https://apify.com/integrations):

- [**Make**](https://docs.apify.com/platform/integrations/make) - Automate multi-step workflows
- [**Zapier**](https://docs.apify.com/platform/integrations/zapier) - Connect with 5,000+ apps
- [**Slack**](https://docs.apify.com/platform/integrations/slack) - Post article summaries to channels
- [**Airbyte**](https://docs.apify.com/platform/integrations/airbyte) - Pipe articles into your warehouse
- [**GitHub**](https://docs.apify.com/platform/integrations/github) - Trigger runs from commits
- [**Google Drive**](https://docs.apify.com/platform/integrations/drive) - Export articles to Docs

You can also use webhooks to trigger summarization and alerting pipelines when new articles finish extracting.

***

### 🔗 Recommended Actors

- [**🤖 RAG Web Browser**](https://apify.com/parseforge/rag-web-browser) - Search or fetch URLs with LLM-ready output
- [**🕸️ Website Content Crawler**](https://apify.com/parseforge/website-content-crawler) - Deep-crawl a domain with depth control
- [**🔍 Google Search Scraper**](https://apify.com/parseforge/google-search-scraper) - SERP results with rank and description
- [**📈 Google Trends Scraper**](https://apify.com/parseforge/google-trends-scraper) - Interest over time and related queries
- [**📧 Contact Info Scraper**](https://apify.com/parseforge/contact-info-scraper) - Emails, phones, and socials from URLs

> 💡 **Pro Tip:** browse the complete [ParseForge collection](https://apify.com/parseforge) for more content-extraction tools.

***

**🆘 Need Help?** [**Open our contact form**](https://tally.so/r/BzdKgA) to request a new scraper, propose a custom data project, or report an issue.

***

> **⚠️ Disclaimer:** this Actor is an independent tool and is not affiliated with any publisher, news outlet, or readability library. Only publicly accessible article URLs are processed. Respect the copyright and terms of service of every publisher you extract from.

# Actor input Schema

## `startUrls` (type: `array`):

URLs to process

## `maxItems` (type: `integer`):

Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://www.bbc.com/news/articles/c86w8elez74o"
    }
  ],
  "maxItems": 10
}
```

# Actor output Schema

## `results` (type: `string`):

Complete dataset

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://www.bbc.com/news/articles/c86w8elez74o"
        }
    ],
    "maxItems": 10
};

// Run the Actor and wait for it to finish
const run = await client.actor("parseforge/article-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://www.bbc.com/news/articles/c86w8elez74o" }],
    "maxItems": 10,
}

# Run the Actor and wait for it to finish
run = client.actor("parseforge/article-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://www.bbc.com/news/articles/c86w8elez74o"
    }
  ],
  "maxItems": 10
}' |
apify call parseforge/article-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=parseforge/article-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Smart Article Extractor",
        "description": "Extract clean article content from any news, blog, or publisher site! Pull full body text, author, publish date, word count, language, reading time, images, and metadata at scale. Ideal for content research, media monitoring, SEO audits, and AI training. Start extracting articles in minutes!",
        "version": "1.0",
        "x-build-id": "q2HclEytgtANZjzTv"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/parseforge~article-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-parseforge-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/parseforge~article-extractor/runs": {
            "post": {
                "operationId": "runs-sync-parseforge-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/parseforge~article-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-parseforge-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Article URLs",
                        "type": "array",
                        "description": "URLs to process",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 1,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000"
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
Input	Type	Default	Behavior
startUrls	array of URLs	required	One or more article URLs to extract.
maxItems	integer	10	Articles returned. Free plan caps at 10, paid plan at 1,000,000.
#### 📰 News Aggregation - Build custom news feeds across sources - Deduplicate stories across outlets - Normalize article structure for downstream apps - Feed summarization pipelines	#### 🧠 AI & Summarization - Extract clean text for LLM summaries - Build news datasets for fine-tuning - Ground chatbots with current media - Power question-answering over news
#### 📡 Media Monitoring - Track brand mentions across outlets - Monitor coverage of products or events - Capture executive quotes and bylines - Detect paywalled coverage to license	#### 📚 Research & Archives - Build academic text corpora - Archive public journalism - Extract metadata for bibliographies - Preserve retracted or deleted articles
#### 🎓 Research and academia - Empirical datasets for papers, thesis work, and coursework - Longitudinal studies tracking changes across snapshots - Reproducible research with cited, versioned data pulls - Classroom exercises on data analysis and ethical scraping	#### 🎨 Personal and creative - Side projects, portfolio demos, and indie app launches - Data visualizations, dashboards, and infographics - Content research for bloggers, YouTubers, and podcasters - Hobbyist collections and personal trackers
#### 🤝 Non-profit and civic - Transparency reporting and accountability projects - Advocacy campaigns backed by public-interest data - Community-run databases for local issues - Investigative journalism on public records	#### 🧪 Experimentation - Prototype AI and machine-learning pipelines with real data - Validate product-market hypotheses before engineering spend - Train small domain-specific models on niche corpora - Test dashboard concepts with live input