# Smart Article Extractor (`lukaskrivka/article-extractor-smart`) Actor

📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.

- **URL**: https://apify.com/lukaskrivka/article-extractor-smart.md
- **Developed by:** [Lukáš Křivka](https://apify.com/lukaskrivka) (Apify)
- **Categories:** News, AI
- **Stats:** 7,559 total users, 427 monthly users, 99.8% runs succeeded, 189 bookmarks
- **User rating**: 4.15 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Smart Article Extractor 
Smart Article Extractor scrapes articles from any academic, scientific, or news website or blog with just a single click. It uses a smart algorithm to decide what pages are actually articles and automatically extracts information from them.

### What does Smart Article Extractor do?
If you want to download articles from websites, this tool will help you extract content using smart scraping features:

✅ Allows opening pages with a browser (Puppeteer) which can wait for dynamically loaded data

✅ Allows extraction of articles from any number of URLs

✅ Smart article recognition - the extractor can decide what pages on a website are in fact articles to be scraped (this function is customizable)

✅ Additional filters - date of articles, minimum words, and more

✅ Allows custom scraping function - you can add/overwrite your own fields from the parsed HTML

✅ Allows usage of Google Bot headers (bypassing paywalls)

### Why extract articles with Smart Article Extractor?
👉 Academic research: You can use Smart Article Extractor to download multiple articles and build a corpus from them for research and article citations.

👉 Journalism: If you want to know more about how extracting articles with this tool can help text analysis and data journalism, you might like to read [Terror or Clickbait?](https://apify.com/success-stories/terror-or-clickbait) or [Czech media and their word choices](https://blog.apify.com/czech-media-and-their-word-choices-before-and-after-the-russian-invasion-of-ukraine-in-february-2022/).

👉 Fight fake news: Monitor content by selected media to react promptly if they publish misinformation.

👉 Save time: Whatever your reason for collecting articles with Smart Article Extractor, you will definitely save a lot of time and energy.

#### Is it legal to extract articles?
Extracting articles is legal, as you are scraping publicly available content. Please be aware that most articles are protected by copyright laws. Before you publish extracted articles anywhere, check the terms of use of the scraped website.


### How many results can you scrape with Smart Article Extractor?

Smart Article Extractor can return thousands of results on average. However, you have to keep in mind that scraping news websites has many variables to it and may cause the results to fluctuate case by case. There’s no one-size-fits-all-use-cases number. The maximum number of results may vary depending on the complexity of the input, location, and other factors. Some of the most frequent cases are:

- website gives a different number of results depending on the type/value of the input
- website has an internal limit that no scraper can cross
- scraper has a limit that we are working on improving

Therefore, while we regularly run Actor tests to keep the benchmarks in check, the results may also fluctuate without our knowing. The best way to know for sure for your particular use case is to do a test run yourself.

### How much will scraping articles with Smart Article Extractor cost you?

When it comes to scraping, it can be challenging to estimate the resources needed to extract data as use cases may vary significantly. That's why the best course of action is to run a test scrape with a small sample of input data and limited output. You’ll get your price per scrape, which you’ll then multiply by the number of scrapes you intend to do. 

[Watch this video](https://www.youtube.com/watch?v=-wyz2iscZ30) for a few helpful tips. And don't forget that choosing a higher plan will save you money in the long run.

⚠️ This can be a high-consumption actor if you don't set limits. Please make sure you set a compute unit limit in the `Limit CU` consumption field. ⚠️

#### How do I extract articles with Smart Article Extractor?
Smart Article Extractor can be run as an [Apify actor](https://apify.com/actors) on the Apify platform where it is seamlessly integrated with a nice input UI. You can also run it locally or on any other infrastructure.

On the Apify platform:
1. Click on *Try for free*.
2. Enter the URL of the website(s) you want to scrape (and other input fields to narrow down the search).
3. Click on *Save & Start*.
4. When Smart Article Extractor has finished, preview or download your results from the Output tab.

For more detailed instructions, read our [step-by-step guide](https://blog.apify.com/how-to-extract-and-download-news-articles-online) on how to extract articles.

### Output example
If you run Smart Article Extractor on the [Apify platform](https://apify.com), you can get the output in many formats, like JSON, CSV, XML, Excel, RSS, and more. Here is a JSON example:

```json
{
  "url": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
  "loadedUrl": "https://www.thetimes.co.uk/edition/news/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
  "title": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
  "softTitle": "Ex-MP Charlie Elphicke sang ‘I’m a naughty Tory’ after groping woman, court told",
  "date": "2020-07-07T12:13:00.000Z",
  "author": [
    "Fariha Karim"
  ],
  "publisher": null,
  "copyright": "Times Newspapers Limited 2020",
  "favicon": "/d/img/icons/favicon-ab3ea01fbe.ico",
  "description": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.The woman, who cannot be identified for legal reasons, told",
  "lang": "en",
  "canonicalLink": "https://www.thetimes.co.uk/article/ex-mp-charlie-elphicke-sang-i-m-a-naughty-tory-after-groping-woman-court-told-nnr6nlw89",
  "tags": [],
  "image": "https://www.thetimes.co.uk/imageserver/image/%2Fmethode%2Ftimes%2Fprod%2Fweb%2Fbin%2Fdfdec16c-bf85-11ea-bb37-3d3cce807650.jpg?crop=3023%2C1700%2C238%2C316&resize=685",
  "videos": [],
  "links": [],
  "text": "A woman broke down in tears as she told a court today how a former Tory MP sexually assaulted her at his home while his children were in bed.\n\nThe woman, who cannot be identified for legal reasons, told Southwark crown court that Charlie Elphicke had invited her for a drink in 2007 while his wife Natalie was away on a business trip.\n\nShe said that the children were in bed and she had a cup of tea while Mr Elphicke drank wine in the garden and they chatted.\n\nAfter about an hour, she said, “the weather changed so he suggested they go inside to the lounge” and they shared a £40 bottle of wine.\n\nShe said they carried on talking in the living room"
}
````

#### Extend output function

You can use this optional function to update the default output of this actor. This function gets a JQuery handle `$` as an argument, so you can choose what data from the page you want to scrape. It also receives the `currentItem` parameter, which is the default output parsed by the scraper so you can explore any fields. The output from this function will get merged with the default output.

The return value of this function has to be an object!

You can return fields to achieve 3 different things:

- Add a new field - Return an object with a field that is not in the default output
- Change a field - Return an existing field with a new value
- Remove a field - Return an existing field with a value `undefined`

Let's say you want to accomplish this:

- Remove `links` and `videos` fields from the output
- Add a `pageTitle` field
- Change the date selector (In rare cases the scraper is not able to find it)
- Save the original date parsed so you can compare with your date

```javascript
($, currentItem) => {
    return {
        links: undefined,
        videos: undefined,
        pageTitle: $('title').text(),
        date: $('.my-date-selector').text(),
        originalDate: currentItem.date,
    }
}
```

### Integrations and Smart Article Extractor

Last but not least, Smart Article Extractor can be connected with almost any cloud service or web app thanks to  <a href="https://apify.com/integrations"  target="_blank"> integrations on the Apify platform</a>. You can integrate with Make, Zapier, Slack, Airbyte, GitHub, Google Sheets, Google Drive, <a href="https://docs.apify.com/integrations"  target="_blank"> and more</a>. Or you can use <a href="https://docs.apify.com/integrations/webhooks"  target="_blank"> webhooks</a> to carry out an action whenever an event occurs, e.g. get a notification whenever Smart Article Extractor successfully finishes a run.

### Using Smart Article Extractor with the Apify API

The Apify API gives you programmatic access to the Apify platform. The API is organized around RESTful HTTP endpoints that enable you to manage, schedule, and run Apify actors. The API also lets you access any datasets, monitor actor performance, fetch results, create and update versions, and more.

To access the API using Node.js, use the apify-client NPM package. To access the API using Python, use the apify-client PyPI package.

Check out the <a href="https://docs.apify.com/api/v2"  target="_blank"> Apify API reference</a> docs for full details or click on the <a href="https://apify.com/lukaskrivka/article-extractor-smart/api"  target="_blank"> API tab</a> for code examples.

### Not your cup of tea? Build your own scraper

Smart Article Extractor doesn’t exactly do what you need? You can always build your own! We have various [scraper templates](https://apify.com/templates) in Python, JavaScript, and TypeScript to get you started. Alternatively, you can write it from scratch using our [open-source library Crawlee](https://crawlee.dev/). You can keep the scraper to yourself or make it public by adding it to Apify Store (and [find users](https://blog.apify.com/make-regular-passive-income-developing-web-automation-actors-b0392278d085/) for it).

Or let us know if you need a [custom scraping solution](https://apify.com/custom-solutions).

### Your feedback

We’re always working on improving the performance of our Actors. So if you’ve got any technical feedback for Smart Article Extractor or simply found a bug, please create an issue on the Actor’s [Issues tab](https://console.apify.com/actors/hy5TYiCBwQ9o8uRKG/issues) in Apify Console.

# Actor input Schema

## `startUrls` (type: `array`):

These could be the main page URL or any category/subpage URL, e.g. https://www.bbc.com/. Article pages are detected and crawled from these. If you prefer to use direct article URLs, use `articleUrls` input instead

## `articleUrls` (type: `array`):

These are direct URLs for the articles to be extracted, e.g. https://www.bbc.com/news/uk-62836057. No extra pages are crawled from article pages.

## `onlyNewArticles` (type: `boolean`):

This option is only viable for smaller runs. If you plan to use this on a large scale, use the 'Only new articles (saved per domain)' option below instead. If this function is selected, the extractor will only scrape new articles each time you run it. (Scraped URLs are saved in a dataset named `articles-state`, and are compared with new ones.)

## `onlyNewArticlesPerDomain` (type: `boolean`):

If this function is selected, the extractor will only scrape only new articles each time you run it. (Scraped articles are saved in one dataset, named 'ARTICLES-SCRAPED-domain', per each domain, and compared with new ones.)

## `onlyInsideArticles` (type: `boolean`):

If this function is selected, the extractor will only scrape articles that are on the domain from where they are linked. If the domain presents links to articles on different domains, those articles will not be scraped, e.g. https://www.bbc.com/ vs. https://www.bbc.co.uk/.

## `enqueueFromArticles` (type: `boolean`):

Normally, the scraper only extracts articles from category pages. This option allows the scraper to also extract articles linked within articles.

## `crawlWholeSubdomain` (type: `boolean`):

Automatically enqueue categories and articles from whole subdomain with the same path. E.g. if Start URL is https://apify.com/store, it will enqueue all pages starting with https://apify.com/store

## `onlySubdomainArticles` (type: `boolean`):

Only loads articles which URL begins with the same path as Start URL. E.g. if Start URL is https://apify.com/store, it will only load articles starting with https://apify.com/store

## `scanSitemaps` (type: `boolean`):

We recommend using `Sitemap URLs` instead.
If this function is selected, the extractor will scan different sitemaps from the initial article URL. Keep in mind that this option can lead to the loading of a huge amount of (sometimes old) articles, in which case the time and cost of the scrape will increase.

## `sitemapUrls` (type: `array`):

You can provide selected sitemap URLs that include the articles you need to extract.

## `saveHtml` (type: `boolean`):

If this function is selected, the scraper will save the full HTML of the article page, but this will make the data less readable.

## `saveHtmlAsLink` (type: `boolean`):

If this function is selected, the scraper will save the full HTML of the article page as a URL to keep the dataset clean and small.

## `saveSnapshots` (type: `boolean`):

Stores a screenshot for each article page to Key-Value Store and provides that as screenshotUrl. Useful for debugging.

## `useGoogleBotHeaders` (type: `boolean`):

This option will allow you to bypass protection and paywalls on some websites. Use with caution as it might lead to getting blocked.

## `minWords` (type: `integer`):

The article needs to contain at least this number of words to be extracted

## `dateFrom` (type: `string`):

Only articles from this day on will be scraped. If empty, all articles will be scraped. Format is YYYY-MM-DD, e.g. 2019-12-31, or number type e.g. 1 week or 20 days

## `onlyArticlesForLastDays` (type: `integer`):

Only get posts that were published in the last X days from time the scraping starts. Use either this or the absolute date.

## `mustHaveDate` (type: `boolean`):

If checked, the article must have a date of release to be extracted.

## `isUrlArticleDefinition` (type: `object`):

Here you can input JSON settings to define what URLs should be considered articles by the scraper. If any of them is `true`, then the link will be opened and the article extracted.

## `pseudoUrls` (type: `array`):

This function can be used to enqueue more pages, i.e. include more links like pagination or categories. This doesn't work for articles, as they are recognized by the recognition system.

## `linkSelector` (type: `string`):

You can limit the <a> tags whose links will be enqueued. This field is empty by default. Add `a.some-class` to activate it

## `maxDepth` (type: `integer`):

Maximum depth of crawling, i.e. how many times the scraper picks up a link to other webpages. Level 0 refers to the start URLs, 1 are the first level links, and so on. This is only valid for pseudo URLs

## `maxPagesPerCrawl` (type: `integer`):

Maximum number of total pages crawled. It includes the home page, pagination pages, invalid articles, and so on. The crawler will stop automatically after reaching this number.

## `maxArticlesPerCrawl` (type: `integer`):

Maximum number of valid articles scraped. The crawler will stop automatically after reaching this number.

## `maxArticlesPerStartUrl` (type: `integer`):

Maximum number of articles scraped per start URL.

## `maxConcurrency` (type: `integer`):

You can limit the speed of the scraper to avoid getting blocked.

## `proxyConfiguration` (type: `object`):

Proxy configuration

## `overrideProxyGroup` (type: `string`):

If you want to override the default proxy group, you can specify it here. This is useful if you want to use a different proxy group for each crawler.

## `useBrowser` (type: `boolean`):

This option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.

## `pageWaitMs` (type: `integer`):

How many milliseconds to wait on each page before extracting data

## `navigationWaitUntil` (type: `string`):

What to wait until the navigation is finished. `domcontentloaded` happens when initial HTML loads and is fastest. `load` happens when JS is executed and it is default. `networkidle0`, `networkidle2` wait for background network but cannot cause infinite loading.

## `pageWaitSelectorCategory` (type: `string`):

For what selector to wait on each page before extracting data

## `pageWaitSelectorArticle` (type: `string`):

For what selector to wait on each page before extracting data

## `scrollToBottom` (type: `boolean`):

Scroll to the botton of the page, loading dynamic articles.

## `scrollToBottomButtonSelector` (type: `string`):

CSS selector for a button to load more articles

## `scrollToBottomMaxSecs` (type: `integer`):

Limit for how long the scrolling can run so it does no go infinite.

## `extendOutputFunction` (type: `string`):

This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article.

## `stopAfterCUs` (type: `integer`):

The scraper will stop running after reaching this number of compute units.

## `notificationEmails` (type: `array`):

Notifications will be sent to these email addresses.

## `notifyAfterCUs` (type: `integer`):

The scraper will send notifications to the provided email when it reaches this number of CUs.

## `notifyAfterCUsPeriodically` (type: `integer`):

The scraper will send notifications to the provided email every time this number of CUs is reached since the last notification.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://www.theguardian.com"
    }
  ],
  "articleUrls": [
    {
      "url": "https://www.bbc.com/news/uk-62836057"
    }
  ],
  "onlyNewArticles": false,
  "onlyNewArticlesPerDomain": false,
  "onlyInsideArticles": true,
  "enqueueFromArticles": false,
  "crawlWholeSubdomain": false,
  "onlySubdomainArticles": false,
  "scanSitemaps": false,
  "sitemapUrls": [
    {
      "url": "https://www.theguardian.com/sitemaps/news.xml"
    }
  ],
  "saveSnapshots": false,
  "useGoogleBotHeaders": false,
  "minWords": 150,
  "dateFrom": "2024-01-01",
  "onlyArticlesForLastDays": 7,
  "mustHaveDate": true,
  "isUrlArticleDefinition": {
    "minDashes": 4,
    "hasDate": true,
    "linkIncludes": [
      "article",
      "storyid",
      "?p=",
      "id=",
      "/fpss/track",
      ".html",
      "/content/"
    ]
  },
  "pseudoUrls": [
    {
      "purl": "https://www.theguardian.com/technology/[.*]"
    }
  ],
  "linkSelector": "a.article-link",
  "maxDepth": 3,
  "maxPagesPerCrawl": 100,
  "maxArticlesPerCrawl": 50,
  "maxArticlesPerStartUrl": 20,
  "maxConcurrency": 5,
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "overrideProxyGroup": "SHADER",
  "useBrowser": false,
  "pageWaitMs": 1000,
  "navigationWaitUntil": "load",
  "pageWaitSelectorCategory": ".article-list",
  "pageWaitSelectorArticle": ".article-body",
  "scrollToBottomButtonSelector": ".load-more-button",
  "scrollToBottomMaxSecs": 30,
  "extendOutputFunction": "($) => {\n    const result = {};\n    // Uncomment to add a title to the output\n    // result.pageTitle = $('title').text().trim();\n\n    return result;\n}",
  "stopAfterCUs": 10,
  "notificationEmails": [
    "user@example.com"
  ],
  "notifyAfterCUs": 5,
  "notifyAfterCUsPeriodically": 5
}
```

# Actor output Schema

## `dataset` (type: `string`):

Dataset containing all scraped articles

## `files` (type: `string`):

Key-value store containing saved HTML pages and screenshots

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://www.theguardian.com"
        }
    ],
    "isUrlArticleDefinition": {
        "minDashes": 4,
        "hasDate": true,
        "linkIncludes": [
            "article",
            "storyid",
            "?p=",
            "id=",
            "/fpss/track",
            ".html",
            "/content/"
        ]
    },
    "proxyConfiguration": {
        "useApifyProxy": true
    },
    "extendOutputFunction": ($) => {
        const result = {};
        // Uncomment to add a title to the output
        // result.pageTitle = $('title').text().trim();
    
        return result;
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("lukaskrivka/article-extractor-smart").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://www.theguardian.com" }],
    "isUrlArticleDefinition": {
        "minDashes": 4,
        "hasDate": True,
        "linkIncludes": [
            "article",
            "storyid",
            "?p=",
            "id=",
            "/fpss/track",
            ".html",
            "/content/",
        ],
    },
    "proxyConfiguration": { "useApifyProxy": True },
    "extendOutputFunction": """($) => {
    const result = {};
    // Uncomment to add a title to the output
    // result.pageTitle = $('title').text().trim();

    return result;
}""",
}

# Run the Actor and wait for it to finish
run = client.actor("lukaskrivka/article-extractor-smart").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://www.theguardian.com"
    }
  ],
  "isUrlArticleDefinition": {
    "minDashes": 4,
    "hasDate": true,
    "linkIncludes": [
      "article",
      "storyid",
      "?p=",
      "id=",
      "/fpss/track",
      ".html",
      "/content/"
    ]
  },
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "extendOutputFunction": "($) => {\\n    const result = {};\\n    // Uncomment to add a title to the output\\n    // result.pageTitle = $('\''title'\'').text().trim();\\n\\n    return result;\\n}"
}' |
apify call lukaskrivka/article-extractor-smart --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=lukaskrivka/article-extractor-smart",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Smart Article Extractor",
        "description": "📰 Smart Article Extractor extracts articles from any scientific, academic, or news website with just one click. The extractor crawls the whole website and automatically distinguishes articles from other web pages. Download your data as HTML table, JSON, Excel, RSS feed, and more.",
        "version": "1.0",
        "x-build-id": "PBYpgrVrgU3tdckGJ"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/lukaskrivka~article-extractor-smart/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-lukaskrivka-article-extractor-smart",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/lukaskrivka~article-extractor-smart/runs": {
            "post": {
                "operationId": "runs-sync-lukaskrivka-article-extractor-smart",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/lukaskrivka~article-extractor-smart/run-sync": {
            "post": {
                "operationId": "run-sync-lukaskrivka-article-extractor-smart",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Website/category URLs",
                        "type": "array",
                        "description": "These could be the main page URL or any category/subpage URL, e.g. https://www.bbc.com/. Article pages are detected and crawled from these. If you prefer to use direct article URLs, use `articleUrls` input instead",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "articleUrls": {
                        "title": "Article URLs",
                        "type": "array",
                        "description": "These are direct URLs for the articles to be extracted, e.g. https://www.bbc.com/news/uk-62836057. No extra pages are crawled from article pages.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "onlyNewArticles": {
                        "title": "Only new articles (only for small runs)",
                        "type": "boolean",
                        "description": "This option is only viable for smaller runs. If you plan to use this on a large scale, use the 'Only new articles (saved per domain)' option below instead. If this function is selected, the extractor will only scrape new articles each time you run it. (Scraped URLs are saved in a dataset named `articles-state`, and are compared with new ones.)",
                        "default": false
                    },
                    "onlyNewArticlesPerDomain": {
                        "title": "Only new articles (saved per domain, preferable)",
                        "type": "boolean",
                        "description": "If this function is selected, the extractor will only scrape only new articles each time you run it. (Scraped articles are saved in one dataset, named 'ARTICLES-SCRAPED-domain', per each domain, and compared with new ones.)",
                        "default": false
                    },
                    "onlyInsideArticles": {
                        "title": "Only inside domain articles",
                        "type": "boolean",
                        "description": "If this function is selected, the extractor will only scrape articles that are on the domain from where they are linked. If the domain presents links to articles on different domains, those articles will not be scraped, e.g. https://www.bbc.com/ vs. https://www.bbc.co.uk/.",
                        "default": true
                    },
                    "enqueueFromArticles": {
                        "title": "Enqueue articles from articles",
                        "type": "boolean",
                        "description": "Normally, the scraper only extracts articles from category pages. This option allows the scraper to also extract articles linked within articles.",
                        "default": false
                    },
                    "crawlWholeSubdomain": {
                        "title": "Crawl whole subdomain (same base as Start URL)",
                        "type": "boolean",
                        "description": "Automatically enqueue categories and articles from whole subdomain with the same path. E.g. if Start URL is https://apify.com/store, it will enqueue all pages starting with https://apify.com/store",
                        "default": false
                    },
                    "onlySubdomainArticles": {
                        "title": "Limit articles to only from subdomain",
                        "type": "boolean",
                        "description": "Only loads articles which URL begins with the same path as Start URL. E.g. if Start URL is https://apify.com/store, it will only load articles starting with https://apify.com/store",
                        "default": false
                    },
                    "scanSitemaps": {
                        "title": "Find articles in sitemaps (caution)",
                        "type": "boolean",
                        "description": "We recommend using `Sitemap URLs` instead. \n If this function is selected, the extractor will scan different sitemaps from the initial article URL. Keep in mind that this option can lead to the loading of a huge amount of (sometimes old) articles, in which case the time and cost of the scrape will increase.",
                        "default": false
                    },
                    "sitemapUrls": {
                        "title": "Sitemap URLs (safer)",
                        "type": "array",
                        "description": "You can provide selected sitemap URLs that include the articles you need to extract.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "saveHtml": {
                        "title": "Save full HTML",
                        "type": "boolean",
                        "description": "If this function is selected, the scraper will save the full HTML of the article page, but this will make the data less readable."
                    },
                    "saveHtmlAsLink": {
                        "title": "Save full HTML (only as link to it)",
                        "type": "boolean",
                        "description": "If this function is selected, the scraper will save the full HTML of the article page as a URL to keep the dataset clean and small."
                    },
                    "saveSnapshots": {
                        "title": "Save screenshots of article pages (browser only)",
                        "type": "boolean",
                        "description": "Stores a screenshot for each article page to Key-Value Store and provides that as screenshotUrl. Useful for debugging.",
                        "default": false
                    },
                    "useGoogleBotHeaders": {
                        "title": "Use Googlebot headers",
                        "type": "boolean",
                        "description": "This option will allow you to bypass protection and paywalls on some websites. Use with caution as it might lead to getting blocked.",
                        "default": false
                    },
                    "minWords": {
                        "title": "Minimum words",
                        "type": "integer",
                        "description": "The article needs to contain at least this number of words to be extracted",
                        "default": 150
                    },
                    "dateFrom": {
                        "title": "Extract articles from [date]",
                        "type": "string",
                        "description": "Only articles from this day on will be scraped. If empty, all articles will be scraped. Format is YYYY-MM-DD, e.g. 2019-12-31, or number type e.g. 1 week or 20 days"
                    },
                    "onlyArticlesForLastDays": {
                        "title": "Only articles for last X days",
                        "type": "integer",
                        "description": "Only get posts that were published in the last X days from time the scraping starts. Use either this or the absolute date."
                    },
                    "mustHaveDate": {
                        "title": "Must have date",
                        "type": "boolean",
                        "description": "If checked, the article must have a date of release to be extracted.",
                        "default": true
                    },
                    "isUrlArticleDefinition": {
                        "title": "Is the URL an article?",
                        "type": "object",
                        "description": "Here you can input JSON settings to define what URLs should be considered articles by the scraper. If any of them is `true`, then the link will be opened and the article extracted."
                    },
                    "pseudoUrls": {
                        "title": "Pseudo URLs",
                        "type": "array",
                        "description": "This function can be used to enqueue more pages, i.e. include more links like pagination or categories. This doesn't work for articles, as they are recognized by the recognition system.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "linkSelector": {
                        "title": "Link selector",
                        "type": "string",
                        "description": "You can limit the <a> tags whose links will be enqueued. This field is empty by default. Add `a.some-class` to activate it"
                    },
                    "maxDepth": {
                        "title": "Max depth",
                        "type": "integer",
                        "description": "Maximum depth of crawling, i.e. how many times the scraper picks up a link to other webpages. Level 0 refers to the start URLs, 1 are the first level links, and so on. This is only valid for pseudo URLs"
                    },
                    "maxPagesPerCrawl": {
                        "title": "Max pages per crawl",
                        "type": "integer",
                        "description": "Maximum number of total pages crawled. It includes the home page, pagination pages, invalid articles, and so on. The crawler will stop automatically after reaching this number."
                    },
                    "maxArticlesPerCrawl": {
                        "title": "Max articles per crawl",
                        "type": "integer",
                        "description": "Maximum number of valid articles scraped. The crawler will stop automatically after reaching this number."
                    },
                    "maxArticlesPerStartUrl": {
                        "title": "Max articles per start URL",
                        "type": "integer",
                        "description": "Maximum number of articles scraped per start URL."
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "type": "integer",
                        "description": "You can limit the speed of the scraper to avoid getting blocked."
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Proxy configuration"
                    },
                    "overrideProxyGroup": {
                        "title": "Override proxy group",
                        "type": "string",
                        "description": "If you want to override the default proxy group, you can specify it here. This is useful if you want to use a different proxy group for each crawler."
                    },
                    "useBrowser": {
                        "title": "Use browser (Puppeteer)",
                        "type": "boolean",
                        "description": "This option is more expensive, but it allows you to evaluate JavaScript and wait for dynamically loaded data.",
                        "default": false
                    },
                    "pageWaitMs": {
                        "title": "Wait on each page (ms)",
                        "type": "integer",
                        "description": "How many milliseconds to wait on each page before extracting data"
                    },
                    "navigationWaitUntil": {
                        "title": "Wait until navigation event is finished",
                        "enum": [
                            "load",
                            "domcontentloaded",
                            "networkidle0",
                            "networkidle2"
                        ],
                        "type": "string",
                        "description": "What to wait until the navigation is finished. `domcontentloaded` happens when initial HTML loads and is fastest. `load` happens when JS is executed and it is default. `networkidle0`, `networkidle2` wait for background network but cannot cause infinite loading.",
                        "default": "load"
                    },
                    "pageWaitSelectorCategory": {
                        "title": "Wait for selector on each category page",
                        "type": "string",
                        "description": "For what selector to wait on each page before extracting data"
                    },
                    "pageWaitSelectorArticle": {
                        "title": "Wait for selector on each article page",
                        "type": "string",
                        "description": "For what selector to wait on each page before extracting data"
                    },
                    "scrollToBottom": {
                        "title": "Scroll to bottom of the page (infinite scroll)",
                        "type": "boolean",
                        "description": "Scroll to the botton of the page, loading dynamic articles."
                    },
                    "scrollToBottomButtonSelector": {
                        "title": "Scroll to bottom button selector",
                        "type": "string",
                        "description": "CSS selector for a button to load more articles"
                    },
                    "scrollToBottomMaxSecs": {
                        "title": "Scroll to bottom max seconds",
                        "type": "integer",
                        "description": "Limit for how long the scrolling can run so it does no go infinite."
                    },
                    "extendOutputFunction": {
                        "title": "Extend output function",
                        "type": "string",
                        "description": "This function allows you to merge your custom extraction with the default one. You can only return an object from this function. This object will be merged/overwritten with the default output for each article."
                    },
                    "stopAfterCUs": {
                        "title": "Limit CU consumption",
                        "type": "integer",
                        "description": "The scraper will stop running after reaching this number of compute units."
                    },
                    "notificationEmails": {
                        "title": "Emails address for notifications",
                        "type": "array",
                        "description": "Notifications will be sent to these email addresses.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "notifyAfterCUs": {
                        "title": "Notify after [number] CUs",
                        "type": "integer",
                        "description": "The scraper will send notifications to the provided email when it reaches this number of CUs."
                    },
                    "notifyAfterCUsPeriodically": {
                        "title": "Notify every [number] CUs",
                        "type": "integer",
                        "description": "The scraper will send notifications to the provided email every time this number of CUs is reached since the last notification."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
