# Article Extraction API (`tugelbay/article-extractor`) Actor

Extract clean article text and metadata from URLs as Markdown, text, or HTML for RAG, AI agents, monitoring, and research. Guide: https://konabayev.com/tools/article-extractor/?utm\_source=apify\_info\&utm\_medium=referral\&utm\_campaign=article-extractor

- **URL**: https://apify.com/tugelbay/article-extractor.md
- **Developed by:** [Tugelbay Konabayev](https://apify.com/tugelbay) (community)
- **Categories:** AI, Developer tools, SEO tools
- **Stats:** 41 total users, 12 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $3.50 / 1,000 articles

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Article Extraction API — URL to Markdown for RAG & LLMs

> **Clean article content from known URLs** — extract title, author, date, Markdown/text/HTML, images, links, and metadata from news, blogs, docs, and knowledge-base pages.
> **Built for AI ingestion** — compact Markdown output for RAG pipelines, vector databases, LLM prompts, content monitoring, and research datasets.
> **Reliable HTTP extraction** — no browser overhead; retries transient 408/429/5xx responses and uses a conservative default concurrency for production runs.

<a href="https://apify.com/tugelbay/article-extractor">
  <img src="https://api.apify.com/v2/key-value-stores/bplRdpnd85eGQkW1N/records/article-extractor-hero.png" alt="Article Extractor overview: clean article content, Markdown output, metadata, and images" width="100%">
</a>

<p>
  <img src="https://api.apify.com/v2/key-value-stores/bplRdpnd85eGQkW1N/records/article-extractor-input-output.png" alt="Article Extractor input and output example" width="49%">
  <img src="https://api.apify.com/v2/key-value-stores/bplRdpnd85eGQkW1N/records/article-extractor-dataset-preview.png" alt="Article Extractor dataset preview" width="49%">
</p>

Convert article URLs into clean, readable content. Article Extractor removes ads, navigation, sidebars, and boilerplate, then returns the main article text with source metadata. Output as Markdown, plain text, or clean HTML for AI/LLM workflows, content analysis, and data pipelines.

Perfect for building **RAG pipelines**, **AI training datasets**, **knowledge bases**, and **content monitoring systems**.

For implementation recipes, production examples, and SEO/GEO notes, see the [Article Extractor guide](https://konabayev.com/tools/article-extractor/?utm_source=apify_readme&utm_medium=referral&utm_campaign=article-extractor) on Konabayev.com.

### From Sample to Production

Use the default one-URL run to check extraction quality for your source. When the output has clean Markdown, title, date, and enough body text, move to a production batch instead of repeating tiny tests:

1. Put 100-1,000 known article URLs from RSS, search, sitemaps, competitor monitoring, or another actor into `urls`.
2. Keep `outputFormat: "markdown"` for RAG and summarization workflows, or switch to `text` when you only need NLP-ready plain text.
3. Use Apify dataset export, integrations, webhooks, or the Apify API to send the results into your vector database, Google Sheets, Make/Zapier scenario, or internal pipeline.

Typical paid workflows are daily news monitoring, competitor content archives, knowledge-base ingestion, newsletter curation, and RAG dataset refreshes.

### Article Extraction API for RAG and AI Agents

Call it from the Apify API, MCP tools, LangChain, Make, Zapier, or scheduled Apify runs. Give the actor known article URLs and get structured records that are ready to index, summarize, classify, or store.

### URL to Markdown, Text, or Clean HTML

Paste a URL and get clean article text with title, author, publish date, body content, images, links, Open Graph data, canonical URL, and word count. Markdown is the default because it preserves headings and lists while keeping token usage lower than raw HTML.

### Web Content Scraper for RAG Pipelines

Feed clean article text directly into your vector database or LLM prompt. Output as Markdown or plain text — ready for embedding.

### Bulk Article Extraction (Up to 10K URLs)

Process hundreds or thousands of URLs in a single run. Perfect for news monitoring, content aggregation, and research datasets.

### What does Article Extractor do?

This actor takes a list of URLs and extracts the main article content from each page using Mozilla's Readability algorithm (the same technology behind Firefox Reader View). It returns structured data including:

- **Article text** in Markdown, plain text, or clean HTML
- **Metadata**: title, author, published date, description, language
- **Structured data**: JSON-LD and Open Graph metadata parsing
- **Media**: images, Open Graph image, links found in the article
- **Stats**: word count, HTTP status code, extraction timestamp

You provide article-style public URLs and the actor extracts clean content without custom selectors or per-site CSS parsing. Some protected, app-rendered, or non-article pages may return partial content.

### Why use this instead of a generic web scraper?

| Need                    | Generic scraper                                  | Website crawler                         | Article Extractor                                |
| ----------------------- | ------------------------------------------------ | --------------------------------------- | ------------------------------------------------ |
| Main content extraction | Usually custom CSS selectors                     | Whole-page or crawl-oriented extraction | Readability-based article detection              |
| Output cleanup          | You remove nav, ads, cookie banners, and footers | Good for broad site ingestion           | Focused on clean article body text               |
| Setup time              | Write and maintain selectors per site            | Configure crawl scope and depth         | Add URLs and run                                 |
| LLM-ready output        | Requires post-processing                         | Good for site-wide RAG                  | Markdown/text/HTML per known URL                 |
| Metadata                | Manual extraction                                | Varies by crawler                       | Author, date, description, language, OG, JSON-LD |
| Best fit                | Site-specific scraping logic                     | Crawling whole websites                 | News, blogs, docs, research, monitoring          |

#### vs. Website Content Crawler

Apify's **Website Content Crawler** crawls entire websites and extracts content across discovered pages. Article Extractor is different:

- **Focused extraction**: Only extracts the main article content, not the entire page
- **Cleaner output**: Strips navigation, ads, sidebars, related articles — just the article
- **Richer metadata**: Automatically extracts author, publish date, JSON-LD, Open Graph
- **Faster**: Uses HTTP requests (no browser), processes pages in parallel
- **Known-URL workflow**: Optimized when your pipeline already has article URLs from RSS, search, sitemaps, feeds, monitoring, or another actor

**When to use which:**

- Use **Article Extractor** when you need clean article text from known URLs (news, blogs, docs)
- Use **Website Content Crawler** when you need to crawl an entire website following links
- Use **RAG Web Browser** when pages require browser rendering or search-to-content workflows

### Best-fit workflows

- **URL to Markdown API** — convert a queue of article URLs into clean Markdown documents.
- **News and blog monitoring** — combine with RSS, search, or scheduled runs to archive new articles.
- **RAG ingestion** — push Markdown content and metadata into vector stores with source traceability.
- **Content research** — collect article titles, authors, dates, descriptions, word counts, and clean bodies.
- **SEO content analysis** — compare competitor articles without scraping menus, ads, and unrelated widgets.

### Features

- Smart article extraction using Mozilla Readability algorithm
- Markdown output optimized for LLM consumption and RAG pipelines
- Automatic metadata extraction (author, date, description, language)
- JSON-LD and Open Graph metadata parsing
- Image and link extraction from article body
- Concurrent processing with a reliability-first default of 5 pages at a time and an advanced limit up to 50
- Retries transient timeout, rate-limit, and 5xx responses before returning an error item
- Proxy support for geo-restricted content
- Best for public article-style pages such as news sites, blogs, and documentation
- 5MB page size limit to prevent memory issues
- Pay-per-event pricing for extracted articles

### Input examples

#### Extract articles as Markdown (default)

```json
{
  "urls": [{ "url": "https://blog.apify.com/what-is-web-scraping/" }],
  "outputFormat": "markdown",
  "maxItems": 1
}
````

#### Extract as plain text for NLP analysis

```json
{
  "urls": [{ "url": "https://blog.apify.com/how-to-build-a-web-scraper/" }],
  "outputFormat": "text",
  "extractImages": false
}
```

#### Bulk extraction with proxy (100+ articles)

```json
{
  "urls": [
    { "url": "https://example.com/article-1" },
    { "url": "https://example.com/article-2" },
    { "url": "https://example.com/article-3" }
  ],
  "outputFormat": "markdown",
  "maxConcurrency": 5,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
```

#### Extract with all metadata (images + links)

```json
{
  "urls": [{ "url": "https://blog.apify.com/what-is-web-scraping/" }],
  "outputFormat": "markdown",
  "extractImages": true,
  "extractLinks": true
}
```

### Input parameters

| Parameter            | Type    | Default      | Required | Description                                                                                                          |
| -------------------- | ------- | ------------ | -------- | -------------------------------------------------------------------------------------------------------------------- |
| `urls`               | Array   | —            | Yes      | List of article/page URLs to extract content from                                                                    |
| `outputFormat`       | String  | `"markdown"` | No       | Output format: `"markdown"`, `"text"`, or `"html"`                                                                   |
| `maxItems`           | Integer | 10           | No       | Maximum number of articles to extract (1–10,000)                                                                     |
| `extractImages`      | Boolean | `true`       | No       | Include image URLs found in the article                                                                              |
| `extractLinks`       | Boolean | `false`      | No       | Include links found in the article                                                                                   |
| `timeout`            | Integer | 30           | No       | Maximum seconds to wait for each page to load (5–120)                                                                |
| `maxConcurrency`     | Integer | 5            | No       | Number of pages to process simultaneously (1–50). Keep the default for reliability; raise it for fast, stable sites. |
| `proxyConfiguration` | Object  | None         | No       | Proxy settings for accessing geo-restricted content                                                                  |

### Output format

Each item in the dataset contains:

| Field           | Type    | Description                                                |
| --------------- | ------- | ---------------------------------------------------------- |
| `url`           | String  | Final page URL (after redirects)                           |
| `canonicalUrl`  | String  | Canonical URL if specified by the page                     |
| `title`         | String  | Article title                                              |
| `author`        | String  | Article author (from meta tags, JSON-LD, or byline)        |
| `publishedDate` | String  | Publication date (ISO 8601)                                |
| `description`   | String  | Meta description or article summary                        |
| `content`       | String  | Extracted article in requested format (Markdown/text/HTML) |
| `wordCount`     | Integer | Number of words in the article                             |
| `language`      | String  | Detected content language code                             |
| `siteName`      | String  | Website name (from Open Graph)                             |
| `images`        | Array   | Image URLs from the article (if `extractImages: true`)     |
| `links`         | Array   | Links from the article (if `extractLinks: true`)           |
| `ogImage`       | String  | Open Graph image URL                                       |
| `statusCode`    | Integer | HTTP response status code                                  |
| `error`         | String  | Error message if extraction failed (null on success)       |
| `extractedAt`   | String  | Extraction timestamp (ISO 8601)                            |

#### Example output

```json
{
  "url": "https://blog.apify.com/what-is-web-scraping/",
  "canonicalUrl": "https://blog.apify.com/what-is-web-scraping/",
  "title": "What is web scraping? A beginner's guide",
  "author": "Apify Team",
  "publishedDate": "2024-03-15T10:00:00Z",
  "description": "Learn what web scraping is, how it works, and why it matters.",
  "content": "# What is web scraping?\n\nWeb scraping is the process of automatically extracting data from websites...\n\n## How does web scraping work?\n\n1. **Send HTTP request** to the target URL\n2. **Parse the HTML** response\n3. **Extract the data** you need\n4. **Store the results** in a structured format",
  "wordCount": 2450,
  "language": "en",
  "siteName": "Apify Blog",
  "images": ["https://blog.apify.com/content/images/web-scraping-hero.jpg"],
  "links": [],
  "ogImage": "https://blog.apify.com/content/images/og-web-scraping.jpg",
  "statusCode": 200,
  "error": null,
  "extractedAt": "2026-03-29T12:00:00+00:00"
}
```

### Integrations

#### Apify MCP Server (Claude, AI agents)

Use as a tool in Claude Desktop, Claude Code, or any MCP-compatible AI agent framework. The actor is PPE-priced, making it native to AI agent workflows where each task triggers a separate extraction.

#### Python integration

```python
from apify_client import ApifyClient

client = ApifyClient("your-apify-api-token")

## Extract articles
run = client.actor("tugelbay/article-extractor").call(
    run_input={
        "urls": [
            {"url": "https://blog.apify.com/what-is-web-scraping/"},
            {"url": "https://en.wikipedia.org/wiki/Web_scraping"},
        ],
        "outputFormat": "markdown",
    }
)

## Read results
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(f"Title: {item['title']}")
    print(f"Author: {item.get('author', 'Unknown')}")
    print(f"Words: {item['wordCount']}")
    print(f"Content preview: {item['content'][:200]}...")
    print()
```

#### JavaScript/TypeScript integration

```javascript
import { ApifyClient } from "apify-client";

const client = new ApifyClient({ token: "your-apify-api-token" });

const run = await client.actor("tugelbay/article-extractor").call({
  urls: [
    { url: "https://blog.apify.com/what-is-web-scraping/" },
    { url: "https://en.wikipedia.org/wiki/Web_scraping" },
  ],
  outputFormat: "markdown",
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
for (const item of items) {
  console.log(`${item.title} (${item.wordCount} words)`);
  console.log(item.content?.substring(0, 200));
}
```

#### LangChain (RAG pipeline)

```python
from langchain_community.utilities import ApifyWrapper
from langchain_core.documents import Document

apify = ApifyWrapper(apify_api_token="your-apify-api-token")

docs = apify.call_actor(
    actor_id="tugelbay/article-extractor",
    run_input={
        "urls": [{"url": "https://example.com/article"}],
        "outputFormat": "markdown",
    },
    dataset_mapping_function=lambda item: Document(
        page_content=item.get("content", ""),
        metadata={
            "url": item.get("url"),
            "title": item.get("title"),
            "author": item.get("author"),
            "publishedDate": item.get("publishedDate"),
        },
    ),
)
```

#### Webhooks and integrations

The actor works with Apify's integration ecosystem:

- **Google Sheets** — export extracted articles directly to a spreadsheet
- **Zapier / Make** — trigger workflows on new results
- **Slack** — get notifications when extraction completes
- **Email** — receive dataset as email attachment
- **API** — call programmatically via Apify REST API

### Use cases

- **LLM training data** — extract clean text from web pages for fine-tuning datasets
- **RAG pipelines** — feed article content into vector databases for retrieval-augmented generation
- **Content analysis** — analyze articles at scale for sentiment, topics, and trends
- **News monitoring** — extract and archive news articles automatically on a schedule
- **Research** — collect and structure academic or industry content for literature reviews
- **SEO analysis** — extract competitor content for gap analysis and content strategy
- **Knowledge base** — build searchable archives from documentation sites and blogs
- **Content migration** — extract content from legacy sites during CMS migrations
- **AI agents** — give your AI agent structured article content from public article-style URLs
- **Newsletter curation** — automatically extract and summarize articles for newsletters
- **Compliance monitoring** — track content changes on regulatory or competitor pages

### Cost estimation (PPE pricing)

| Event               | Description                         |
| ------------------- | ----------------------------------- |
| `article-extracted` | Each article successfully extracted |

Apify currently displays pricing from **$3.50 / 1,000 articles**. The exact price depends on the user's Apify pricing tier; the examples below show the displayed "from" price and exclude the small actor-start event.

**Example costs at the displayed "from" price:**

| Scenario                                | Articles    | From cost    |
| --------------------------------------- | ----------- | ------------ |
| 10 blog posts                           | 10          | ~$0.04       |
| 100 news articles                       | 100         | ~$0.35       |
| 1,000 documentation pages               | 1,000       | ~$3.50       |
| Daily news monitoring (50 articles/day) | 1,500/month | ~$5.25/month |
| Large-scale extraction                  | 10,000      | ~$35         |

**Tip:** Set `extractImages: false` and `extractLinks: false` to speed up extraction and reduce output size when you only need the text content.

### FAQ

#### What types of pages work best?

Article Extractor works best on **article-style pages**: news articles, blog posts, documentation pages, Wikipedia articles, and similar content. The Readability algorithm is designed to identify the "main content" of a page and strip everything else.

#### Does it work on JavaScript-rendered pages (SPAs)?

No. Article Extractor uses fast HTTP requests (no browser). Pages that require JavaScript to render content (React SPAs, Angular apps) will return empty or minimal content. For those pages, use [RAG Web Browser](https://apify.com/tugelbay/rag-web-browser) which has automatic browser fallback.

#### How fast is it?

Very fast. Since it uses HTTP requests (no browser), it can process **100 articles in 2–3 minutes** with default concurrency. Increase `maxConcurrency` to 50 for even faster processing.

#### Can I extract content behind login walls or paywalls?

No. Article Extractor only works with publicly accessible pages. It cannot bypass login walls, paywalls, or CAPTCHA-protected content.

#### What's the maximum page size?

5MB per page. Larger pages are truncated to prevent memory issues. This covers 99%+ of normal web articles.

#### Can I run this on a schedule?

Yes. Set up a [Schedule](https://docs.apify.com/platform/schedules) in Apify Console to run the actor at any interval — hourly, daily, or custom cron expressions. Perfect for news monitoring and content tracking.

#### Why Markdown output?

Markdown is the most LLM-friendly format:

- Preserves semantic structure (headers, emphasis, lists, code blocks)
- Compact — fits more content in LLM context windows
- Renders cleanly in chat interfaces and documentation tools
- Easy to parse for downstream processing

#### How does it handle errors?

If a page fails to load (timeout, 404, blocked), the actor returns the URL with an `error` field explaining what went wrong and a null `content` field. Other pages in the batch continue processing normally.

### Troubleshooting

#### Empty or very short content extraction

- **Cause**: The page is a SPA (Single Page Application) that renders content with JavaScript
- **Fix**: Use [RAG Web Browser](https://apify.com/tugelbay/rag-web-browser) instead, which has browser fallback
- **Alternative**: Very short pages (<100 words) may not have enough content for Readability to detect the main article

#### Missing author or publish date

- **Cause**: The page doesn't include author/date in meta tags, JSON-LD, or standard HTML patterns
- **Fix**: This is expected — not all pages provide this metadata. The fields will be null.

#### Timeout errors on some pages

- **Cause**: The target page is slow to respond
- **Fix**: Increase the `timeout` parameter (default: 30 seconds, max: 120 seconds)
- **Alternative**: Reduce `maxConcurrency` if you're scraping many pages from the same domain

#### Proxy-related errors

- **Cause**: Some sites block datacenter IPs
- **Fix**: Enable Apify proxy with residential proxy groups in `proxyConfiguration`

### Limitations

- Only works with publicly accessible pages (no login-protected or paywalled content)
- JavaScript-rendered content (SPAs) will not extract fully — use a browser-based solution for those
- Very short pages (under 100 words) may not have enough content for Readability to detect
- Maximum page size: 5MB (larger pages are truncated)
- Maximum 10,000 articles per run (use multiple runs for larger datasets)
- Metadata extraction depends on the page having proper meta tags, JSON-LD, or Open Graph markup

### Changelog

#### v1.0 (2026-03-29)

- Initial release
- Markdown, plain text, and clean HTML output formats
- Mozilla Readability-based article extraction
- Metadata extraction (author, date, description, JSON-LD, Open Graph)
- Image and link extraction
- Concurrent processing with configurable concurrency (1–50)
- Proxy support
- Pay-per-event pricing

### Related Actors

- [Website Content Crawler](https://apify.com/tugelbay/website-content-crawler) — Crawl websites and extract Markdown for RAG/LLMs
- [RAG Web Browser](https://apify.com/tugelbay/rag-web-browser) — Search Google + extract as Markdown for AI agents
- [YouTube Transcript Extractor](https://apify.com/tugelbay/youtube-transcript) — Bulk extract video transcripts as SRT/VTT/Markdown
- [Website Tech Stack Detector](https://apify.com/tugelbay/website-tech-stack-detector) — Identify 80+ technologies on any website
- [Google Maps Lead Extractor](https://apify.com/tugelbay/google-maps-leads) — Extract business leads with emails from Google Maps

See all actors: [apify.com/tugelbay](https://apify.com/tugelbay)

# Actor input Schema

## `urls` (type: `array`):

List of article/page URLs to extract content from. Supports news sites, blogs, documentation, and any content page.

## `outputFormat` (type: `string`):

Format for the extracted article text

## `maxItems` (type: `integer`):

Maximum number of articles to extract

## `extractImages` (type: `boolean`):

Include image URLs found in the article

## `extractLinks` (type: `boolean`):

Include links found in the article

## `timeout` (type: `integer`):

Maximum time to wait for each page to load

## `maxConcurrency` (type: `integer`):

Number of pages to process simultaneously. Keep the default for reliability; raise it only for fast, stable sites.

## `proxyConfiguration` (type: `object`):

Proxy settings for accessing geo-restricted or protected content

## Actor input object example

```json
{
  "urls": [
    {
      "url": "https://blog.apify.com/what-is-web-scraping/"
    }
  ],
  "outputFormat": "markdown",
  "maxItems": 10,
  "extractImages": true,
  "extractLinks": false,
  "timeout": 30,
  "maxConcurrency": 5
}
```

# Actor output Schema

## `dataset` (type: `string`):

Dataset with extracted articles: URL, title, author, date, content (Markdown/text/HTML), word count, images, links, Open Graph data.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        {
            "url": "https://blog.apify.com/what-is-web-scraping/"
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("tugelbay/article-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "urls": [{ "url": "https://blog.apify.com/what-is-web-scraping/" }] }

# Run the Actor and wait for it to finish
run = client.actor("tugelbay/article-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    {
      "url": "https://blog.apify.com/what-is-web-scraping/"
    }
  ]
}' |
apify call tugelbay/article-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=tugelbay/article-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Article Extraction API",
        "description": "Extract clean article text and metadata from URLs as Markdown, text, or HTML for RAG, AI agents, monitoring, and research. Guide: https://konabayev.com/tools/article-extractor/?utm_source=apify_info&utm_medium=referral&utm_campaign=article-extractor",
        "version": "1.0",
        "x-build-id": "IWlkbKC2yDC7hYJj7"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/tugelbay~article-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-tugelbay-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/tugelbay~article-extractor/runs": {
            "post": {
                "operationId": "runs-sync-tugelbay-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/tugelbay~article-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-tugelbay-article-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls"
                ],
                "properties": {
                    "urls": {
                        "title": "URLs to extract",
                        "type": "array",
                        "description": "List of article/page URLs to extract content from. Supports news sites, blogs, documentation, and any content page.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "outputFormat": {
                        "title": "Output format",
                        "enum": [
                            "markdown",
                            "text",
                            "html"
                        ],
                        "type": "string",
                        "description": "Format for the extracted article text",
                        "default": "markdown"
                    },
                    "maxItems": {
                        "title": "Max articles",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "Maximum number of articles to extract",
                        "default": 10
                    },
                    "extractImages": {
                        "title": "Extract images",
                        "type": "boolean",
                        "description": "Include image URLs found in the article",
                        "default": true
                    },
                    "extractLinks": {
                        "title": "Extract links",
                        "type": "boolean",
                        "description": "Include links found in the article",
                        "default": false
                    },
                    "timeout": {
                        "title": "Timeout per page (seconds)",
                        "minimum": 5,
                        "maximum": 120,
                        "type": "integer",
                        "description": "Maximum time to wait for each page to load",
                        "default": 30
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "maximum": 50,
                        "type": "integer",
                        "description": "Number of pages to process simultaneously. Keep the default for reliability; raise it only for fast, stable sites.",
                        "default": 5
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Proxy settings for accessing geo-restricted or protected content"
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
