# Cheerio Scraper (`apify/cheerio-scraper`) Actor

Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.

- **URL**: https://apify.com/apify/cheerio-scraper.md
- **Developed by:** [Apify](https://apify.com/apify) (Apify)
- **Categories:** Developer tools, Open source
- **Stats:** 17,453 total users, 1,173 monthly users, 100.0% runs succeeded, 294 bookmarks
- **User rating**: 4.60 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

Cheerio Scraper is a ready-made solution for crawling websites using plain HTTP requests. It retrieves the HTML pages, parses them using the [Cheerio](https://cheerio.js.org) Node.js library and lets you extract any data from them. Fast.

Cheerio is a server-side version of the popular [jQuery](https://jquery.com) library. It does not require a
browser but instead constructs a DOM from an HTML string. It then provides the user an API to work with that DOM.

Cheerio Scraper is ideal for scraping web pages that do not rely on client-side JavaScript to serve their content and can be up to 20 times faster than using a full-browser solution such as Puppeteer.

If you're unfamiliar with web scraping or web development in general,
you might prefer to start with [**Scraping with Web Scraper**](https://docs.apify.com/tutorials/apify-scrapers/web-scraper) tutorial from the Apify documentation and then continue with [**Scraping with Cheerio Scraper**](https://docs.apify.com/tutorials/apify-scrapers/cheerio-scraper), a tutorial which will walk you through all the steps and provide a number of examples.

### Cost of usage

You can find the average usage cost for this Actor on the [pricing page](https://apify.com/pricing) under the `Which plan do I need?` section. Cheerio Scraper is equivalent to `Simple HTML pages` while Web Scraper, Puppeteer Scraper and Playwright Scraper are equivalent to `Full web pages`. These cost estimates are based on averages and might be lower or higher depending on how heavy the pages you scrape are.

### Usage

To get started with Cheerio Scraper, you only need two things. First, tell the scraper which web pages
it should load. Second, tell it how to extract data from each page.

The scraper starts by loading the pages specified in the [**Start URLs**](#start-urls) field.
You can make the scraper follow page links on the fly by setting a [**Link selector**](#link-selector), **[Glob Patterns](#glob-patterns)** and/or **[Pseudo-URLs](#pseudo-urls)** to tell the scraper which links it should add to the crawling queue. This is useful for the recursive crawling of entire websites, e.g. to find all products in an online store.

To tell the scraper how to extract data from web pages, you need to provide a [**Page function**](#page-function). This is JavaScript code that is executed for every web page loaded. Since the scraper does not use the full web browser, writing the **Page function** is equivalent to writing server-side Node.js code - it uses the server-side library [Cheerio](https://cheerio.js.org).

In summary, Cheerio Scraper works as follows:

1. Adds each [Start URL](#start-urls) to the crawling queue.
2. Fetches the first URL from the queue and constructs a DOM from the fetched HTML string.
3. Executes the [**Page function**](#page-function) on the loaded page and saves its results.
4. Optionally, finds all links from the page using the [**Link selector**](#link-selector).
   If a link matches any of the **[Glob Patterns](#glob-patterns)** and/or **[Pseudo-URLs](#pseudo-urls)** and has not yet been visited, adds it to the queue.
5. If there are more items in the queue, repeats step 2, otherwise finishes.

Cheerio Scraper has a number of advanced configuration settings to improve performance, set cookies for login to websites, limit the number of records, etc. See their tooltips for more information.

Under the hood, Cheerio Scraper is built using the [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) class
from Crawlee. If you'd like to learn more about the inner workings of the scraper, see the respective documentation.

### Content types

By default, Cheerio Scraper only processes web pages with the `text/html`, `application/json`, `application/xml`, `application/xhtml+xml` MIME content types (as reported by the `Content-Type` HTTP header),
and skips pages with other content types.
If you want the crawler to process other content types,
use the **Additional MIME types** (`additionalMimeTypes`) input option.

Note that while the default `Accept` HTTP header will allow any content type to be received,
HTML and XML are preferred over JSON and other types. Thus, if you're allowing additional MIME
types, and you're still receiving invalid responses, be sure to override the `Accept`
HTTP header setting in the requests from the scraper,
either in [**Start URLs**](#start-urls), [**Pseudo URLs**](#pseudo-urls) or in the **Prepare request function**.

The web pages with various content types are parsed differently and
thus the `context` parameter of the [**Page function**](#page-function) will have different values:

| **Content types**                                       | [`context.body`](#body-stringbuffer) | [`context.$`](#-function) | [`context.json`](#json-object) |
| ------------------------------------------------------- | ------------------------------------ | ------------------------- | ------------------------------ |
| `text/html`, `application/xhtml+xml`, `application/xml` | `String`                             | `Function`                | `null`                         |
| `application/json`                                      | `String`                             | `null`                    | `Object`                       |
| Other                                                   | `Buffer`                             | `null`                    | `null`                         |

The `Content-Type` HTTP header of the web page is parsed using the
<a href="https://www.npmjs.com/package/content-type" target="_blank">content-type</a> NPM package
and the result is stored in the [`context.contentType`](#contenttype-object) object.

### Limitations

The Actor does not employ a full-featured web browser such as Chromium or Firefox, so it will not be sufficient for web pages that render their content dynamically using client-side JavaScript. To scrape such sites, you might prefer to use [**Web Scraper**](https://apify.com/apify/web-scraper) (`apify/web-scraper`), which loads pages in a full browser and renders dynamic content.

Since Cheerio Scraper's **Page function** is executed in the context of the server, it only supports server-side code running in Node.js. If you need to combine client- and server-side libraries in Chromium using the [Puppeteer](https://github.com/puppeteer/puppeteer) library, you might prefer to use
[**Puppeteer Scraper**](https://apify.com/apify/puppeteer-scraper) (`apify/puppeteer-scraper`). If you prefer Firefox and/or [Playwright](https://github.com/microsoft/playwright), check out [**Playwright Scraper**](https://apify.com/apify/playwright-scraper) (`apify/playwright-scraper`). For even more flexibility and control, you might develop a new Actor from scratch in Node.js using [Apify SDK](https://sdk.apify.com/) and [Crawlee](https://crawlee.dev).

In the [**Page function**](#page-function) and **Prepare request function**,
you can only use NPM modules that are already installed in this Actor.
If you require other modules for your scraping, you'll need to develop a completely new Actor.
You can use the [`CheerioCrawler`](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) class
from Crawlee to get most of the functionality of Cheerio Scraper out of the box.

### Input configuration

As input, Cheerio Scraper Actor accepts a number of configurations. These can be entered either manually in the user interface in [Apify Console](https://console.apify.com), or programmatically in a JSON object using the [Apify API](https://apify.com/docs/api/v2#/reference/actors/run-collection/run-actor). For a complete list of input fields and their types, please visit the [Input](https://apify.com/apify/cheerio-scraper/input-schema) tab.

#### Start URLs

The **Start URLs** (`startUrls`) field represents the initial list of pages that the scraper will visit.
You can either enter the URLs manually one by one, upload them in a CSV file, or [link URLs from a Google Sheet](https://help.apify.com/en/articles/2906022-scraping-a-list-of-urls-from-a-google-sheets-document) document.
Each URL must start with either a `http://` or `https://` protocol prefix.

The scraper supports adding new URLs to scrape on the fly, either using the [**Link selector**](#link-selector) and **[Glob Patterns](#glob-patterns)**/**[Pseudo-URLs](#pseudo-urls)** options or by calling `context.enqueueRequest()` inside the [**Page function**](#page-function).

Optionally, each URL can be associated with custom user data - a JSON object that can be referenced from
your JavaScript code in the [**Page function**](#page-function) under `context.request.userData`.
This is useful for determining which start URL is currently loaded, in order to perform some page-specific actions. For example, when crawling an online store, you might want to perform different actions on a page listing the products vs. a product detail page. For details, see the [**Web scraping tutorial**](https://docs.apify.com/tutorials/apify-scrapers/getting-started#the-start-url) in the Apify documentation.

<!-- TODO: Describe how the queue works, unique key etc. plus link -->

#### Link selector

The **Link selector** (`linkSelector`) field contains a CSS selector that is used to find links to other web pages, i.e. `<a>` elements with the `href` attribute. On every page loaded, the scraper looks for all links matching the **Link selector**. It checks that the target URL matches one of the [**Glob Patterns**](#glob-patterns)/[**Pseudo-URLs**](#pseudo-urls), and if so then adds the URL to the request queue, to be loaded by the scraper later.

By default, new scrapers are created with the following selector that matches all links:

````

a\[href]

```

If the **Link selector** is empty, page links are ignored, and the scraper only loads pages that were specified in the [**Start URLs**](#start-urls) input or that were manually added to the request queue by calling `context.enqueueRequest()` in the [**Page function**](#page-function).

#### Glob Patterns

The **Glob Patterns** (`globs`) field specifies which types of URLs found by **[Link selector](#link-selector)** should be added to the request queue.

A glob pattern is simply a string with wildcard characters.

For example, a glob pattern `http://www.example.com/pages/**/*` will match all the
following URLs:

- `http://www.example.com/pages/deeper-level/page`
- `http://www.example.com/pages/my-awesome-page`
- `http://www.example.com/pages/something`

Note that you don't need to use the **Glob Patterns** setting at all, because you can completely control which pages the scraper will access by calling `await context.enqueueRequest()` from the **[Page function](#page-function)**.

#### Pseudo-URLs

The **Pseudo-URLs** (`pseudoUrls`) field specifies which types of URLs found by **[Link selector](#link-selector)** should be added to the request queue.

A pseudo-URL is simply a URL with special directives enclosed in `[]` brackets.
Currently, the only supported directive is `[regexp]`, which defines
a JavaScript-style regular expression to match against the URL.

For example, a pseudo-URL `http://www.example.com/pages/[(\w|-)*]` will match all the
following URLs:

- `http://www.example.com/pages/`
- `http://www.example.com/pages/my-awesome-page`
- `http://www.example.com/pages/something`

If either "`[`" or "`]`" are part of the normal query string, the symbol must be encoded as `[\x5B]` or `[\x5D]`, respectively. For example, the following pseudo-URL:

```

http://www.example.com/search?do\[\x5B]load\[\x5D]=1

```

will match the URL:

```

http://www.example.com/search?do\[load]=1

````

Optionally, each pseudo-URL can be associated with user data that can be referenced from your **[Page function](#page-function)**
using `context.request.label` to determine which kind of page is currently loaded in the browser.

Note that you don't need to use the **Pseudo-URLs** setting at all,
because you can completely control which pages the scraper will access by calling `await context.enqueueRequest()`
from the **[Page function](#page-function)**.

#### Page function

The **Page function** (`pageFunction`) field contains a single JavaScript function that enables the user to extract data from the web page, access its DOM, add new URLs to the request queue, and otherwise control Cheerio Scraper's operation.

Example:

```javascript
async function pageFunction(context) {
    const { $, request, log } = context;

    // The "$" property contains the Cheerio object which is useful
    // for querying DOM elements and extracting data from them.
    const pageTitle = $('title').first().text();

    // The "request" property contains various information about the web page loaded.
    const url = request.url;

    // Use "log" object to print information to Actor log.
    log.info('Page scraped', { url, pageTitle });

    // Return an object with the data extracted from the page.
    // It will be stored to the resulting dataset.
    return {
        url,
        pageTitle,
    };
}
````

The code runs in [Node.js 16](https://nodejs.org/) and the function accepts a single argument, the `context` object, whose properties are listed below.

The return value of the page function is an object (or an array of objects) representing the data extracted from the web page. The return value must be stringify-able to JSON, i.e. it can only contain basic types and no circular references. If you prefer not to extract any data from the page and skip it in the clean results, simply return `null` or `undefined`.

The **Page function** supports the JavaScript ES6 syntax and is asynchronous, which means you can use the `await` keyword to wait for background operations to finish. To learn more about `async` functions,
visit the [Mozilla documentation](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Statements/async_function).

**Properties of the `context` object:**

- ##### **`$: Function`**

  A reference to the [Cheerio](https://cheerio.js.org/)'s function representing the root scope of the DOM
  of the current HTML page.

  This function is the starting point for traversing the DOM document and extracting data from it.
  Like with [jQuery](https://jquery.com/), it is the primary method for selecting elements in the document,
  but unlike jQuery it is built on top of the [`css-select`](https://www.npmjs.com/package/css-select) library,
  which implements most of the [`Sizzle`](https://github.com/jquery/sizzle/wiki) selectors.

  For more information, see the [Cheerio](https://cheerio.js.org/) documentation.

  Example:

  ```html
  <ul id="movies">
      <li class="fun-movie">Fun Movie</li>
      <li class="sad-movie">Sad Movie</li>
      <li class="horror-movie">Horror Movie</li>
  </ul>
  ```

  ```javascript
  $('#movies', '.fun-movie').text();
  //=> Fun Movie
  $('ul .sad-movie').attr('class');
  //=> sad-movie
  $('li[class=horror-movie]').html();
  //=> Horror Movie
  ```

- ##### **`Actor: Object`**

  A reference to the [Actor](https://sdk.apify.com/api/apify/class/Actor) object from [Apify SDK](https://sdk.apify.com/).
  This is equivalent to:

  ```javascript
  import { Actor } from 'apify';
  ```

- ##### **`Apify: Object`**

  A reference to the [Actor](https://sdk.apify.com/api/apify/class/Actor) object from [Apify SDK](https://sdk.apify.com/). Included for backward compatibility.

- ##### **`crawler: Object`**

  A reference to the `CheerioCrawler` object, see [Crawlee docs](https://crawlee.dev/api/cheerio-crawler/class/CheerioCrawler) for more information.

- ##### **`body: String|Buffer`**

  The body from the target web page. If the web page is in HTML or XML format, the `body` will be a string that contains the HTML or XML content.
  In other cases, the `body` with be a [Buffer](https://nodejs.org/api/buffer.html).
  If you need to process the `body` as a string, you can use the information from `contentType` property to convert
  the binary data into a string.

  Example:

  ```javascript
  const stringBody = context.body.toString(context.contentType.encoding);
  ```

- ##### **`cheerio: Object`**

  Reference to the [`Cheerio`](https://cheerio.js.org) module. Being the server-side version of the [jQuery](https://jquery.com) library, Cheerio features a very similar API with nearly identical selector implementation. This means DOM traversing, manipulation, querying, and data extraction are just as easy as with jQuery.

  This is equivalent to:

  ```javascript
  import * as cheerio from 'cheerio';
  ```

- ##### **`contentType: Object`**

  The `Content-Type` HTTP header parsed into an object with 2 properties, `type` and `encoding`.

  Example:

  ```javascript
  // Content-Type: application/json; charset=utf-8
  const mimeType = contentType.type; // "application/json"
  const encoding = contentType.encoding; // "utf-8"
  ```

- ##### **`customData: Object`**

  Contains the object provided in the **Custom data** (`customData`) input field.
  This is useful for passing dynamic parameters to your Cheerio Scraper using API.

- ##### **`enqueueRequest(request, [options]): AsyncFunction`**

  Adds a new URL to the request queue, if it wasn't already there.

  The `request` parameter is an object containing details of the request, with properties such as `url`, `userData`, `headers` etc. For the full list of the supported properties, see the [`Request`](https://crawlee.dev/api/core/class/Request) object's constructor in Crawlee's documentation.

  The optional `options` parameter is an object with additional options. Currently, it only supports the `forefront` boolean flag. If `true`, the request is added to the beginning of the queue. By default, requests are added to the end.

  Example:

  ```javascript
  await context.enqueueRequest({ url: 'https://www.example.com' });
  await context.enqueueRequest(
      { url: 'https://www.example.com/first' },
      { forefront: true },
  );
  ```

- ##### **`env: Object`**

  A map of all relevant values set by the Apify platform to the Actor run via the `APIFY_` environment variable. For example, here you can find information such as Actor run ID, timeouts, Actor run memory, etc.
  For the full list of available values, see the [`Actor.getEnv()`](https://sdk.apify.com/api/apify/class/Actor#getEnv) function in the Apify SDK documentation.

  Example:

  ```javascript
  console.log(`Actor run ID: ${context.env.actorRunId}`);
  ```

- ##### **`getValue(key): AsyncFunction`**

  Gets a value from the default key-value store associated with the Actor run. The key-value store is useful for persisting named data records, such as state objects, files, etc. The function is very similar to the [`Actor.getValue()`](https://sdk.apify.com/api/apify/class/Actor#getValue) function in Apify SDK.

  To set the value, use the dual function `context.setValue(key, value)`.

  Example:

  ```javascript
  const value = await context.getValue('my-key');
  console.dir(value);
  ```

- ##### **`globalStore: Object`**

  Represents an in-memory store that can be used to share data across page function invocations, e.g. state variables, API responses, or other data. The `globalStore` object has an interface similar to JavaScript's [`Map`](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Map) object, with a few important differences:

  - All `globalStore` functions are `async`; use `await` when calling them.
  - Keys must be strings and values must be JSON stringify-able.
  - The `forEach()` function is not supported.

  Note that stored data is not persisted. If the Actor run is restarted or migrated to another worker server,
  the content of `globalStore` is reset. Therefore, never depend on a specific value to be present
  in the store.

  Example:

  ```javascript
  let movies = await context.globalStore.get('cached-movies');
  if (!movies) {
      movies = await fetch('http://example.com/movies.json');
      await context.globalStore.set('cached-movies', movies);
  }
  console.dir(movies);
  ```

- ##### **`input: Object`**

  An object containing the Actor run input, i.e. Cheerio Scraper's configuration. Each page function invocation gets a fresh copy of the `input` object, so changing its properties has no effect.

- ##### **`json: Object`**

  The parsed object from a JSON string if the response contains the content type `application/json`.

- ##### **`log: Object`**

  An object containing logging functions, with the same interface as provided by the
  [`crawlee.utils.log`](https://crawlee.dev/api/core/class/Log) object in Crawlee. The log messages are written directly to the Actor run log, which is useful for monitoring and debugging.
  Note that `log.debug()` only logs messages if the **Debug log** input setting is set.

  Example:

  ```javascript
  const log = context.log;
  log.debug('Debug message', { hello: 'world!' });
  log.info('Information message', { all: 'good' });
  log.warning('Warning message');
  log.error('Error message', { details: 'This is bad!' });
  try {
      throw new Error('Not good!');
  } catch (e) {
      log.exception(e, 'Exception occurred', {
          details: 'This is really bad!',
      });
  }
  ```

- ##### **`saveSnapshot(): AsyncFunction`**

  Saves the full HTML of the current page to the key-value store
  associated with the Actor run, under the `SNAPSHOT-BODY` key.
  This feature is useful when debugging your scraper.

  Note that each snapshot overwrites the previous one and the `saveSnapshot()` calls are throttled to at most one call in two seconds, in order to avoid excess consumption of resources and slowdown of the Actor.

- ##### **`setValue(key, data, options): AsyncFunction`**

  Sets a value to the default key-value store associated with the Actor run. The key-value store is useful for persisting named data records, such as state objects, files, etc. The function is very similar to the [`Actor.setValue()`](https://sdk.apify.com/api/apify/class/Actor#setValue) function in Apify SDK.

  To get the value, use the dual function `context.getValue(key)`.

  Example:

  ```javascript
  await context.setValue('my-key', { hello: 'world' });
  ```

- ##### **`skipLinks(): AsyncFunction`**

  Calling this function ensures that page links from the current page will not be added to the request queue, even if they match the [**Link selector**](#link-selector) and/or **[Glob Patterns](#glob-patterns)**/**[Pseudo-URLs](#pseudo-urls)** settings. This is useful to programmatically stop recursive crawling, e.g. if you know there are no more interesting links on the current page to follow.

- ##### **`request: Object`**

  An object containing information about the currently loaded web page, such as the URL, number of retries, a unique key, etc. Its properties are equivalent to the [`Request`](https://crawlee.dev/api/core/class/Request) object in Crawlee.

- ##### **`response: Object`**

  An object containing information about the HTTP response from the web server. Currently, it only contains the `status` and `headers` properties. For example:

  ```javascript
  {
    // HTTP status code
    status: 200,

    // HTTP headers
    headers: {
      'content-type': 'text/html; charset=utf-8',
      'date': 'Wed, 06 Nov 2019 16:01:53 GMT',
      'cache-control': 'no-cache',
      'content-encoding': 'gzip',
    }
  }
  ```

### Proxy configuration

The **Proxy configuration** (`proxyConfiguration`) option enables you to set
proxies that will be used by the scraper in order to prevent its detection by target web pages.
You can use both the [Apify Proxy](https://apify.com/proxy) and custom HTTP or SOCKS5 proxy servers.

Proxy is required to run the scraper. The following table lists the available options of the proxy configuration setting:

<table class="table table-bordered table-condensed">
    <tbody>
    <tr>
        <th><b>Apify&nbsp;Proxy&nbsp;(automatic)</b></td>
        <td>
            The scraper will load all web pages using the <a href="https://apify.com/proxy">Apify Proxy</a>
            in automatic mode. In this mode, the proxy uses all proxy groups that are available to the user. For each new web page it automatically selects the proxy that hasn't been used in the longest time for the specific hostname in order to reduce the chance of detection by the web page.
            You can view the list of available proxy groups on the <a href="https://console.apify.com/proxy" target="_blank" rel="noopener">Proxy</a> page in Apify Console.
        </td>
    </tr>
    <tr>
        <th><b>Apify&nbsp;Proxy&nbsp;(selected&nbsp;groups)</b></td>
        <td>
            The scraper will load all web pages using the <a href="https://apify.com/proxy">Apify Proxy</a>
            with specific groups of target proxy servers.
        </td>
    </tr>
    <tr>
        <th><b>Custom&nbsp;proxies</b></td>
        <td>
            <p>
            The scraper will use a custom list of proxy servers.
            The proxies must be specified in the <code>scheme://user:password@host:port</code> format.
            Multiple proxies should be separated by a space or new line. The URL scheme can be either <code>http</code> or <code>socks5</code>. User and password might be omitted, but the port must always be present.
            </p>
            <p>
                Example:
            </p>
            <pre><code class="language-none">http://bob:password@proxy1.example.com:8000<br>http://bob:password@proxy2.example.com:8000</code></pre>
        </td>
    </tr>
    </tbody>
</table>

The proxy configuration can be set programmatically when calling the Actor using the API
by setting the `proxyConfiguration` field.
It accepts a JSON object with the following structure:

```javascript
{
    // Indicates whether to use the Apify Proxy or not.
    "useApifyProxy": Boolean,

    // Array of Apify Proxy groups, only used if "useApifyProxy" is true.
    // If missing or null, the Apify Proxy will use automatic mode.
    "apifyProxyGroups": String[],

    // Array of custom proxy URLs, in "scheme://user:password@host:port" format.
    // If missing or null, custom proxies are not used.
    "proxyUrls": String[],
}
```

### Advanced Configuration

#### Pre-navigation hooks

This is an array of functions that will be executed **BEFORE** the main `pageFunction` is run. A similar `context` object is passed into each of these functions as is passed into the `pageFunction`; however, a second `gotOptions` object is also passed in.

The available options can be seen here:

```JavaScript
preNavigationHooks: [
    async ({ id, request, session, proxyInfo, customData, Actor }, { url, method, headers, proxyUrl }) => {}
]
```

Check out the docs for [Pre-navigation hooks](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#preNavigationHooks) and the [CheerioHook type](https://crawlee.dev/api/cheerio-crawler#CheerioHook) for more info regarding the objects passed into these functions. The available properties are extended with `Actor` (alternatively `Apify`) and `customData` in this scraper.

#### Post-navigation hooks

An array of functions that will be executed **AFTER** the main `pageFunction` is run. The only available parameter is the [CrawlingContext](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlingContext) object. The available properties are extended with `Actor` (alternatively `Apify`) and `customData` in this scraper.

```JavaScript
postNavigationHooks: [
    async ({ id, request, session, proxyInfo, response, customData, Actor }) => {}
]
```

Check out the docs for [Pre-navigation hooks](https://crawlee.dev/api/cheerio-crawler/interface/CheerioCrawlerOptions#preNavigationHooks) for more info regarding the objects passed into these functions.

### Results

The scraping results returned by [**Page function**](#page-function) are stored in the default dataset associated with the Actor run, from where you can export them to formats such as JSON, XML, CSV or Excel.
For each object returned by the [**Page function**](#page-function), Cheerio Scraper pushes one record into the dataset and extends it with metadata such as the URL of the web page where the results come from.

For example, if your page function returned the following object:

```js
{
    message: 'Hello world!';
}
```

The full object stored in the dataset will look as follows
(in JSON format, including the metadata fields `#error` and `#debug`):

```json
{
    "message": "Hello world!",
    "#error": false,
    "#debug": {
        "requestId": "fvwscO2UJLdr10B",
        "url": "https://www.example.com/",
        "loadedUrl": "https://www.example.com/",
        "method": "GET",
        "retryCount": 0,
        "errorMessages": null,
        "statusCode": 200
    }
}
```

To download the results, call the
[Get dataset items](https://docs.apify.com/api/v2#/reference/datasets/item-collection)
API endpoint:

```
https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json
```

where `[DATASET_ID]` is the ID of the Actor's run dataset, in which you can find the Run object returned when starting the Actor. Alternatively, you'll find the download links for the results in Apify Console.

To skip the `#error` and `#debug` metadata fields from the results and not include empty result records,
simply add the `clean=true` query parameter to the API URL, or select the **Clean items** option when downloading the dataset in Apify Console.

To get the results in other formats, set the `format` query parameter to `xml`, `xlsx`, `csv`, `html`, etc.
For more information, see [Datasets](https://docs.apify.com/storage#dataset) in documentation
or the [Get dataset items](https://docs.apify.com/api/v2#/reference/datasets/item-collection)
endpoint in Apify API reference.

### Additional resources

Congratulations! You've learned how Cheerio Scraper works.
You might also want to see these other resources:

- [Web scraping tutorial](https://docs.apify.com/tutorials/apify-scrapers) -
  An introduction to web scraping with Apify.
- [Scraping with Cheerio Scraper](https://docs.apify.com/tutorials/apify-scrapers/cheerio-scraper) -
  A step-by-step tutorial on how to use Cheerio Scraper, with a detailed explanation and examples.
- **Web Scraper** ([apify/web-scraper](https://apify.com/apify/web-scraper)) -
  Apify's basic tool for web crawling and scraping. It uses a full Chrome browser to render dynamic content.
  A similar web scraping Actor to Puppeteer Scraper, but is simpler to use and only runs in the context of the browser.
  Uses the [Puppeteer](https://github.com/GoogleChrome/puppeteer) library.
- **Puppeteer Scraper** ([apify/puppeteer-scraper](https://apify.com/apify/puppeteer-scraper)) -
  An Actor similar to Web Scraper, which provides lower-level control of the underlying
  [Puppeteer](https://github.com/GoogleChrome/puppeteer) library and the ability to use server-side libraries.
- **Playwright Scraper** ([apify/playwright-scraper](https://apify.com/apify/playwright-scraper)) -
  A similar web scraping Actor to Puppeteer Scraper, but using the [Playwright](https://github.com/microsoft/playwright) library instead.
- [Actors documentation](https://docs.apify.com/actors) -
  Documentation for the Apify Actors cloud computing platform.
- [Apify SDK documentation](https://sdk.apify.com) - Learn more about the tools required to run your own Apify Actors.
- [Crawlee documentation](https://crawlee.dev) - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.

# Actor input Schema

## `startUrls` (type: `array`):

A static list of URLs to scrape. <br><br>For details, see the <a href='https://apify.com/apify/cheerio-scraper#start-urls' target='_blank' rel='noopener'>Start URLs</a> section in the README.

## `keepUrlFragments` (type: `boolean`):

Indicates that URL fragments (e.g. <code>http://example.com<b>#fragment</b></code>) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such cases, this option should be enabled.

## `respectRobotsTxtFile` (type: `boolean`):

If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.

## `globs` (type: `array`):

Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.

## `pseudoUrls` (type: `array`):

Specifies what kind of URLs found by the <b>Link selector</b> should be added to the request queue. A pseudo-URL is a URL with <b>regular expressions</b> enclosed in <code>\[]</code> brackets, e.g. <code>http://www.example.com/\[.\*]</code>. <br><br>If <b>Pseudo-URLs</b> are omitted, the Actor enqueues all links matched by the <b>Link selector</b>.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#pseudo-urls' target='_blank' rel='noopener'>Pseudo-URLs</a> in README.

## `excludes` (type: `array`):

Glob patterns to match links in the page that you want to exclude from being enqueued.

## `linkSelector` (type: `string`):

A CSS selector stating which links on the page (<code>\<a></code> elements with <code>href</code> attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the <b>Pseudo-URLs</b> and/or <b>Glob patterns</b> field.<br><br>If the <b>Link selector</b> is empty, the page links are ignored.<br><br>For details, see the <a href='https://apify.com/apify/cheerio-scraper#link-selector' target='_blank' rel='noopener'>Link selector</a> in README.

## `pageFunction` (type: `string`):

A JavaScript function that is executed for every page loaded server-side in Node.js 12. Use it to scrape data from the page, perform actions or add new URLs to the request queue.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#page-function' target='_blank' rel='noopener'>Page function</a> in README.

## `proxyConfiguration` (type: `object`):

Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.

## `proxyRotation` (type: `string`):

This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.

## `sessionPoolName` (type: `string`):

<b>Use only english alphanumeric characters dashes and underscores.</b> A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple Actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new Actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window.

## `initialCookies` (type: `array`):

A JSON array with cookies that will be send with every HTTP request made by the Cheerio Scraper, in the format accepted by the <a href='https://www.npmjs.com/package/tough-cookie' target='_blank' rel='noopener noreferrer'>tough-cookie</a> NPM package. This option is useful for transferring a logged-in session from an external web browser. For details how to do this, read this <a href='https://help.apify.com/en/articles/1444249-log-in-to-website-by-transferring-cookies-from-web-browser-legacy' target='_blank' rel='noopener'>help article</a>.

## `additionalMimeTypes` (type: `array`):

A JSON array specifying additional MIME content types of web pages to support. By default, Cheerio Scraper supports the <code>text/html</code> and <code>application/xhtml+xml</code> content types, and skips all other resources. For details, see <a href='https://apify.com/apify/cheerio-scraper#content-types' target='_blank' rel='noopener'>Content types</a> in README.

## `suggestResponseEncoding` (type: `string`):

The scraper automatically determines response encoding from the response headers. If the headers are invalid or information is missing, malformed responses may be produced. Use the Suggest response encoding option to provide a fall-back encoding to the Scraper for cases where it could not be determined.

## `forceResponseEncoding` (type: `boolean`):

If enabled, the suggested response encoding will be used even if a valid response encoding is provided by the target website. Use this only when you've inspected the responses thoroughly and are sure that they are the ones doing it wrong.

## `ignoreSslErrors` (type: `boolean`):

If enabled, the scraper will ignore SSL/TLS certificate errors. Use at your own risk.

## `preNavigationHooks` (type: `string`):

Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `requestAsBrowserOptions`, which are passed to the `requestAsBrowser()` function the crawler calls to navigate.

## `postNavigationHooks` (type: `string`):

Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter.

## `maxRequestRetries` (type: `integer`):

The maximum number of times the scraper will retry to load each web page on error, in case of a page load error or an exception thrown by the <b>Page function</b>.<br><br>If set to <code>0</code>, the page will be considered failed right after the first error.

## `maxPagesPerCrawl` (type: `integer`):

The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It is always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.<br><br>If set to <code>0</code>, there is no limit.

## `maxResultsPerCrawl` (type: `integer`):

The maximum number of records that will be saved to the resulting dataset. The scraper will stop when this limit is reached. <br><br>If set to <code>0</code>, there is no limit.

## `maxCrawlingDepth` (type: `integer`):

Specifies how many links away from the <b>Start URLs</b> the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. Note that pages added using <code>context.enqueuePage()</code> in <b>Page function</b> are not subject to the maximum depth constraint. <br><br>If set to <code>0</code>, there is no limit.

## `maxConcurrency` (type: `integer`):

Specifies the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target web server.

## `pageLoadTimeoutSecs` (type: `integer`):

The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to <b>Max page retries</b>), similarly as with other page load errors.

## `pageFunctionTimeoutSecs` (type: `integer`):

The maximum amount of time the scraper will wait for the <b>Page function</b> to execute, in seconds. It is always a good idea to set this limit, to ensure that unexpected behavior in page function will not get the scraper stuck.

## `debugLog` (type: `boolean`):

If enabled, the Actor log will include debug messages. Beware that this can be quite verbose. Use <code>context.log.debug('message')</code> to log your own debug messages from the <b>Page function</b>.

## `customData` (type: `object`):

A custom JSON object that is passed to the <b>Page function</b> as <code>context.customData</code>. This setting is useful when invoking the scraper via API, in order to pass some arbitrary parameters to your code.

## `datasetName` (type: `string`):

Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used.

## `keyValueStoreName` (type: `string`):

Name or ID of the key-value store that will be used for storing records. If left empty, the default key-value store of the run will be used.

## `requestQueueName` (type: `string`):

Name of the request queue that will be used for storing requests. If left empty, the default request queue of the run will be used.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://crawlee.dev/js"
    }
  ],
  "keepUrlFragments": false,
  "respectRobotsTxtFile": true,
  "globs": [
    {
      "glob": "https://crawlee.dev/js/*/*"
    }
  ],
  "pseudoUrls": [],
  "excludes": [
    {
      "glob": "/**/*.{png,jpg,jpeg,pdf}"
    }
  ],
  "linkSelector": "a[href]",
  "pageFunction": "async function pageFunction(context) {\n    const { $, request, log } = context;\n\n    // The \"$\" property contains the Cheerio object which is useful\n    // for querying DOM elements and extracting data from them.\n    const pageTitle = $('title').first().text();\n\n    // The \"request\" property contains various information about the web page loaded. \n    const url = request.url;\n    \n    // Use \"log\" object to print information to Actor log.\n    log.info('Page scraped', { url, pageTitle });\n\n    // Return an object with the data extracted from the page.\n    // It will be stored to the resulting dataset.\n    return {\n        url,\n        pageTitle\n    };\n}",
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "proxyRotation": "RECOMMENDED",
  "initialCookies": [],
  "additionalMimeTypes": [],
  "forceResponseEncoding": false,
  "ignoreSslErrors": false,
  "preNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept two arguments: the \"crawlingContext\" object\n// and \"requestAsBrowserOptions\" which are passed to the `requestAsBrowser()`\n// function the crawler calls to navigate..\n[\n    async (crawlingContext, requestAsBrowserOptions) => {\n        // ...\n    }\n]",
  "postNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept a single argument: the \"crawlingContext\" object.\n[\n    async (crawlingContext) => {\n        // ...\n    },\n]",
  "maxRequestRetries": 3,
  "maxPagesPerCrawl": 0,
  "maxResultsPerCrawl": 0,
  "maxCrawlingDepth": 0,
  "maxConcurrency": 50,
  "pageLoadTimeoutSecs": 60,
  "pageFunctionTimeoutSecs": 60,
  "debugLog": false,
  "customData": {}
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://crawlee.dev/js"
        }
    ],
    "respectRobotsTxtFile": true,
    "globs": [
        {
            "glob": "https://crawlee.dev/js/*/*"
        }
    ],
    "pseudoUrls": [],
    "excludes": [
        {
            "glob": "/**/*.{png,jpg,jpeg,pdf}"
        }
    ],
    "linkSelector": "a[href]",
    "pageFunction": async function pageFunction(context) {
        const { $, request, log } = context;
    
        // The "$" property contains the Cheerio object which is useful
        // for querying DOM elements and extracting data from them.
        const pageTitle = $('title').first().text();
    
        // The "request" property contains various information about the web page loaded. 
        const url = request.url;
        
        // Use "log" object to print information to Actor log.
        log.info('Page scraped', { url, pageTitle });
    
        // Return an object with the data extracted from the page.
        // It will be stored to the resulting dataset.
        return {
            url,
            pageTitle
        };
    },
    "proxyConfiguration": {
        "useApifyProxy": true
    },
    "initialCookies": [],
    "additionalMimeTypes": [],
    "preNavigationHooks": `// We need to return array of (possibly async) functions here.
// The functions accept two arguments: the "crawlingContext" object
// and "requestAsBrowserOptions" which are passed to the `requestAsBrowser()`
// function the crawler calls to navigate..
[
    async (crawlingContext, requestAsBrowserOptions) => {
        // ...
    }
]`,
    "postNavigationHooks": `// We need to return array of (possibly async) functions here.
// The functions accept a single argument: the "crawlingContext" object.
[
    async (crawlingContext) => {
        // ...
    },
]`,
    "customData": {}
};

// Run the Actor and wait for it to finish
const run = await client.actor("apify/cheerio-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://crawlee.dev/js" }],
    "respectRobotsTxtFile": True,
    "globs": [{ "glob": "https://crawlee.dev/js/*/*" }],
    "pseudoUrls": [],
    "excludes": [{ "glob": "/**/*.{png,jpg,jpeg,pdf}" }],
    "linkSelector": "a[href]",
    "pageFunction": """async function pageFunction(context) {
    const { $, request, log } = context;

    // The \"$\" property contains the Cheerio object which is useful
    // for querying DOM elements and extracting data from them.
    const pageTitle = $('title').first().text();

    // The \"request\" property contains various information about the web page loaded. 
    const url = request.url;
    
    // Use \"log\" object to print information to Actor log.
    log.info('Page scraped', { url, pageTitle });

    // Return an object with the data extracted from the page.
    // It will be stored to the resulting dataset.
    return {
        url,
        pageTitle
    };
}""",
    "proxyConfiguration": { "useApifyProxy": True },
    "initialCookies": [],
    "additionalMimeTypes": [],
    "preNavigationHooks": """// We need to return array of (possibly async) functions here.
// The functions accept two arguments: the \"crawlingContext\" object
// and \"requestAsBrowserOptions\" which are passed to the `requestAsBrowser()`
// function the crawler calls to navigate..
[
    async (crawlingContext, requestAsBrowserOptions) => {
        // ...
    }
]""",
    "postNavigationHooks": """// We need to return array of (possibly async) functions here.
// The functions accept a single argument: the \"crawlingContext\" object.
[
    async (crawlingContext) => {
        // ...
    },
]""",
    "customData": {},
}

# Run the Actor and wait for it to finish
run = client.actor("apify/cheerio-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://crawlee.dev/js"
    }
  ],
  "respectRobotsTxtFile": true,
  "globs": [
    {
      "glob": "https://crawlee.dev/js/*/*"
    }
  ],
  "pseudoUrls": [],
  "excludes": [
    {
      "glob": "/**/*.{png,jpg,jpeg,pdf}"
    }
  ],
  "linkSelector": "a[href]",
  "pageFunction": "async function pageFunction(context) {\\n    const { $, request, log } = context;\\n\\n    // The \\"$\\" property contains the Cheerio object which is useful\\n    // for querying DOM elements and extracting data from them.\\n    const pageTitle = $('\''title'\'').first().text();\\n\\n    // The \\"request\\" property contains various information about the web page loaded. \\n    const url = request.url;\\n    \\n    // Use \\"log\\" object to print information to Actor log.\\n    log.info('\''Page scraped'\'', { url, pageTitle });\\n\\n    // Return an object with the data extracted from the page.\\n    // It will be stored to the resulting dataset.\\n    return {\\n        url,\\n        pageTitle\\n    };\\n}",
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "initialCookies": [],
  "additionalMimeTypes": [],
  "preNavigationHooks": "// We need to return array of (possibly async) functions here.\\n// The functions accept two arguments: the \\"crawlingContext\\" object\\n// and \\"requestAsBrowserOptions\\" which are passed to the `requestAsBrowser()`\\n// function the crawler calls to navigate..\\n[\\n    async (crawlingContext, requestAsBrowserOptions) => {\\n        // ...\\n    }\\n]",
  "postNavigationHooks": "// We need to return array of (possibly async) functions here.\\n// The functions accept a single argument: the \\"crawlingContext\\" object.\\n[\\n    async (crawlingContext) => {\\n        // ...\\n    },\\n]",
  "customData": {}
}' |
apify call apify/cheerio-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=apify/cheerio-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Cheerio Scraper",
        "description": "Crawls websites using raw HTTP requests, parses the HTML with the Cheerio library, and extracts data from the pages using a Node.js code. Supports both recursive crawling and lists of URLs. This actor is a high-performance alternative to apify/web-scraper for websites that do not require JavaScript.",
        "version": "3.0",
        "x-build-id": "08x8ZbPItPuPbLOXw"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/apify~cheerio-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-apify-cheerio-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/apify~cheerio-scraper/runs": {
            "post": {
                "operationId": "runs-sync-apify-cheerio-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/apify~cheerio-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-apify-cheerio-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls",
                    "pageFunction",
                    "proxyConfiguration"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "A static list of URLs to scrape. <br><br>For details, see the <a href='https://apify.com/apify/cheerio-scraper#start-urls' target='_blank' rel='noopener'>Start URLs</a> section in the README.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "keepUrlFragments": {
                        "title": "URL #fragments identify unique pages",
                        "type": "boolean",
                        "description": "Indicates that URL fragments (e.g. <code>http://example.com<b>#fragment</b></code>) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such cases, this option should be enabled.",
                        "default": false
                    },
                    "respectRobotsTxtFile": {
                        "title": "Respect the robots.txt file",
                        "type": "boolean",
                        "description": "If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.",
                        "default": false
                    },
                    "globs": {
                        "title": "Glob Patterns",
                        "type": "array",
                        "description": "Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "glob"
                            ],
                            "properties": {
                                "glob": {
                                    "type": "string",
                                    "title": "Glob of a web page"
                                }
                            }
                        }
                    },
                    "pseudoUrls": {
                        "title": "Pseudo-URLs",
                        "type": "array",
                        "description": "Specifies what kind of URLs found by the <b>Link selector</b> should be added to the request queue. A pseudo-URL is a URL with <b>regular expressions</b> enclosed in <code>[]</code> brackets, e.g. <code>http://www.example.com/[.*]</code>. <br><br>If <b>Pseudo-URLs</b> are omitted, the Actor enqueues all links matched by the <b>Link selector</b>.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#pseudo-urls' target='_blank' rel='noopener'>Pseudo-URLs</a> in README.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "purl"
                            ],
                            "properties": {
                                "purl": {
                                    "type": "string",
                                    "title": "Pseudo-URL of a web page"
                                }
                            }
                        }
                    },
                    "excludes": {
                        "title": "Exclude Glob Patterns",
                        "type": "array",
                        "description": "Glob patterns to match links in the page that you want to exclude from being enqueued.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "glob"
                            ],
                            "properties": {
                                "glob": {
                                    "type": "string",
                                    "title": "Glob of a web page"
                                }
                            }
                        }
                    },
                    "linkSelector": {
                        "title": "Link selector",
                        "type": "string",
                        "description": "A CSS selector stating which links on the page (<code>&lt;a&gt;</code> elements with <code>href</code> attribute) shall be followed and added to the request queue. To filter the links added to the queue, use the <b>Pseudo-URLs</b> and/or <b>Glob patterns</b> field.<br><br>If the <b>Link selector</b> is empty, the page links are ignored.<br><br>For details, see the <a href='https://apify.com/apify/cheerio-scraper#link-selector' target='_blank' rel='noopener'>Link selector</a> in README."
                    },
                    "pageFunction": {
                        "title": "Page function",
                        "type": "string",
                        "description": "A JavaScript function that is executed for every page loaded server-side in Node.js 12. Use it to scrape data from the page, perform actions or add new URLs to the request queue.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#page-function' target='_blank' rel='noopener'>Page function</a> in README."
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/cheerio-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.",
                        "default": {
                            "useApifyProxy": true
                        }
                    },
                    "proxyRotation": {
                        "title": "Proxy rotation",
                        "enum": [
                            "RECOMMENDED",
                            "PER_REQUEST",
                            "UNTIL_FAILURE"
                        ],
                        "type": "string",
                        "description": "This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.",
                        "default": "RECOMMENDED"
                    },
                    "sessionPoolName": {
                        "title": "Session pool name",
                        "pattern": "[0-9A-z-]",
                        "minLength": 3,
                        "maxLength": 200,
                        "type": "string",
                        "description": "<b>Use only english alphanumeric characters dashes and underscores.</b> A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple Actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new Actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window."
                    },
                    "initialCookies": {
                        "title": "Initial cookies",
                        "type": "array",
                        "description": "A JSON array with cookies that will be send with every HTTP request made by the Cheerio Scraper, in the format accepted by the <a href='https://www.npmjs.com/package/tough-cookie' target='_blank' rel='noopener noreferrer'>tough-cookie</a> NPM package. This option is useful for transferring a logged-in session from an external web browser. For details how to do this, read this <a href='https://help.apify.com/en/articles/1444249-log-in-to-website-by-transferring-cookies-from-web-browser-legacy' target='_blank' rel='noopener'>help article</a>.",
                        "default": []
                    },
                    "additionalMimeTypes": {
                        "title": "Additional MIME types",
                        "type": "array",
                        "description": "A JSON array specifying additional MIME content types of web pages to support. By default, Cheerio Scraper supports the <code>text/html</code> and <code>application/xhtml+xml</code> content types, and skips all other resources. For details, see <a href='https://apify.com/apify/cheerio-scraper#content-types' target='_blank' rel='noopener'>Content types</a> in README.",
                        "default": []
                    },
                    "suggestResponseEncoding": {
                        "title": "Suggest response encoding",
                        "type": "string",
                        "description": "The scraper automatically determines response encoding from the response headers. If the headers are invalid or information is missing, malformed responses may be produced. Use the Suggest response encoding option to provide a fall-back encoding to the Scraper for cases where it could not be determined."
                    },
                    "forceResponseEncoding": {
                        "title": "Force response encoding",
                        "type": "boolean",
                        "description": "If enabled, the suggested response encoding will be used even if a valid response encoding is provided by the target website. Use this only when you've inspected the responses thoroughly and are sure that they are the ones doing it wrong.",
                        "default": false
                    },
                    "ignoreSslErrors": {
                        "title": "Ignore SSL errors",
                        "type": "boolean",
                        "description": "If enabled, the scraper will ignore SSL/TLS certificate errors. Use at your own risk.",
                        "default": false
                    },
                    "preNavigationHooks": {
                        "title": "Pre-navigation hooks",
                        "type": "string",
                        "description": "Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `requestAsBrowserOptions`, which are passed to the `requestAsBrowser()` function the crawler calls to navigate."
                    },
                    "postNavigationHooks": {
                        "title": "Post-navigation hooks",
                        "type": "string",
                        "description": "Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter."
                    },
                    "maxRequestRetries": {
                        "title": "Max request retries",
                        "minimum": 0,
                        "type": "integer",
                        "description": "The maximum number of times the scraper will retry to load each web page on error, in case of a page load error or an exception thrown by the <b>Page function</b>.<br><br>If set to <code>0</code>, the page will be considered failed right after the first error.",
                        "default": 3
                    },
                    "maxPagesPerCrawl": {
                        "title": "Max pages per run",
                        "minimum": 0,
                        "type": "integer",
                        "description": "The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It is always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.<br><br>If set to <code>0</code>, there is no limit.",
                        "default": 0
                    },
                    "maxResultsPerCrawl": {
                        "title": "Max result records",
                        "minimum": 0,
                        "type": "integer",
                        "description": "The maximum number of records that will be saved to the resulting dataset. The scraper will stop when this limit is reached. <br><br>If set to <code>0</code>, there is no limit.",
                        "default": 0
                    },
                    "maxCrawlingDepth": {
                        "title": "Max crawling depth",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Specifies how many links away from the <b>Start URLs</b> the scraper will descend. This value is a safeguard against infinite crawling depths for misconfigured scrapers. Note that pages added using <code>context.enqueuePage()</code> in <b>Page function</b> are not subject to the maximum depth constraint. <br><br>If set to <code>0</code>, there is no limit.",
                        "default": 0
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Specifies the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target web server.",
                        "default": 50
                    },
                    "pageLoadTimeoutSecs": {
                        "title": "Page load timeout",
                        "minimum": 1,
                        "type": "integer",
                        "description": "The maximum amount of time the scraper will wait for a web page to load, in seconds. If the web page does not load in this timeframe, it is considered to have failed and will be retried (subject to <b>Max page retries</b>), similarly as with other page load errors.",
                        "default": 60
                    },
                    "pageFunctionTimeoutSecs": {
                        "title": "Page function timeout",
                        "minimum": 1,
                        "type": "integer",
                        "description": "The maximum amount of time the scraper will wait for the <b>Page function</b> to execute, in seconds. It is always a good idea to set this limit, to ensure that unexpected behavior in page function will not get the scraper stuck.",
                        "default": 60
                    },
                    "debugLog": {
                        "title": "Enable debug log",
                        "type": "boolean",
                        "description": "If enabled, the Actor log will include debug messages. Beware that this can be quite verbose. Use <code>context.log.debug('message')</code> to log your own debug messages from the <b>Page function</b>.",
                        "default": false
                    },
                    "customData": {
                        "title": "Custom data",
                        "type": "object",
                        "description": "A custom JSON object that is passed to the <b>Page function</b> as <code>context.customData</code>. This setting is useful when invoking the scraper via API, in order to pass some arbitrary parameters to your code.",
                        "default": {}
                    },
                    "datasetName": {
                        "title": "Dataset name",
                        "type": "string",
                        "description": "Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used."
                    },
                    "keyValueStoreName": {
                        "title": "Key-value store name",
                        "type": "string",
                        "description": "Name or ID of the key-value store that will be used for storing records. If left empty, the default key-value store of the run will be used."
                    },
                    "requestQueueName": {
                        "title": "Request queue name",
                        "type": "string",
                        "description": "Name of the request queue that will be used for storing requests. If left empty, the default request queue of the run will be used."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
