# Puppeteer Scraper (`apify/puppeteer-scraper`) Actor

Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.

- **URL**: https://apify.com/apify/puppeteer-scraper.md
- **Developed by:** [Apify](https://apify.com/apify) (Apify)
- **Categories:** Developer tools, Open source
- **Stats:** 14,524 total users, 1,062 monthly users, 99.9% runs succeeded, 297 bookmarks
- **User rating**: 4.98 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

Puppeteer Scraper is one of the most powerful scraper tools in our arsenal (aside from developing your own Actors).

It uses the Puppeteer library to programmatically control a headless Chrome browser, and it can make it do almost anything. If using [Web Scraper](https://apify.com/apify/web-scraper) doesn't cut it for your use case, then Puppeteer Scraper is what you need.

[Puppeteer](https://github.com/puppeteer/puppeteer) is a Node.js library, so knowledge of Node.js and its paradigms is expected when working with this Actor.

If you need either a faster, or a simpler tool, check out [Cheerio Scraper](https://apify.com/apify/cheerio-scraper) for optimization and speed, or [Web Scraper](https://apify.com/apify/web-scraper) for simplicity.

If you are having any difficulty deciding which of the four main Apify "Scraper" Actors to use, check out the [Web Scraper vs Puppeteer Scraper](https://help.apify.com/en/articles/3195646-when-to-use-puppeteer-scraper), [Cheerio Scraper](https://blog.apify.com/how-to-super-efficiently-scrape-any-website-for-beginners/) and [Playwright Scraper](https://blog.apify.com/how-to-scrape-the-web-with-playwright-ece1ced75f73/) articles on the Apify blog.

### Cost of usage

You can find the average usage cost for this Actor on the [pricing page](https://apify.com/pricing) under the `Which plan do I need?` section. Cheerio Scraper is equivalent to `Simple HTML pages` while Web Scraper, Puppeteer Scraper and Playwright Scraper are equivalent to `Full web pages`. These cost estimates are based on averages and might be lower or higher depending on how heavy the pages you scrape are.

### Usage

To get started with Puppeteer Scraper, you only need a few things. First, with `Start URLs`, tell the scraper which web pages it should load. Then, tell it how to handle each request and extract data from each page.

The scraper starts by loading pages specified in the [**Start URLs**](#start-urls) input setting. You can make the scraper follow page links on the fly by setting a **[Link selector](#link-selector)**, **[Glob Patterns](#glob-patterns)** and/or **[Pseudo-URLs](#pseudo-urls)** to tell the scraper which links it should add to the crawler's request queue. This is useful for the recursive crawling of entire websites (e.g. finding all products available in an online store).

To tell the scraper how to handle requests and extra data, you need to provide a **[Page function](#page-function)**, and optionally arrays of **[Pre-navigation hooks](#pre-navigation-hooks)** and **[Post-navigation hooks](#post-navigation-hooks)**. This is JavaScript code that is executed in the Node.js environment. Since the scraper uses the full-featured Chromium browser, client-side logic to be executed within the context of the web-page can be done using the **[`page`](#page)** object within the Page function's context.

In summary, Puppeteer Scraper works as follows:

1. Adds each URL from [Start URLs](#start-urls) to the request queue.
2. For each request:
    - Evaluates all hooks in [Pre-navigation hooks](#pre-navigation-hooks)
    - Executes the [Page function](#page-function) on the loaded page
    - Optionally, finds all links from the page using [Link selector](#link-selector). If a link matches any of the [Glob Patterns](#glob-patterns) and/or [Pseudo URLs](#pseudo-urls) and has not yet been requested, it is added to the queue.
    - Evaluates [Post-navigation hooks](#post-navigation-hooks)
3. If there are more items in the queue, repeats step 2. Otherwise, finishes the crawl.

Puppeteer Scraper has a number of other configuration settings to improve performance, set cookies for login to websites, mask the web browser, etc... See [Advanced configuration](#advanced-configuration) below for the complete list of settings.

### Limitations

The Actor employs a fully-featured Chromium web browser, which is resource-intensive and might be an overkill for websites that do not render the content dynamically using client-side JavaScript. To achieve better performance for scraping such sites, you might prefer to use [**Cheerio Scraper**](https://apify.com/apify/cheerio-scraper), which downloads and processes raw HTML pages without the overheads of a web browser.

For non-seasoned developers, Puppeteer Scraper may be too complex. For a simpler setup process check out [Web Scraper](https://apify.com/apify/web-scraper), which also uses Puppeteer under the hood.

### Input Configuration

On input, the Puppeteer Scraper Actor accepts a number of configuration settings. These can be entered either manually in the user interface in [Apify Console](https://console.apify.com), or programmatically in a JSON object using the [Apify API](https://docs.apify.com/api/v2#/reference/actors/run-collection/run-actor). For a complete list of input fields and their types, please see the outline of the Actor's [Input-schema](https://apify.com/apify/puppeteer-scraper/input-schema).

#### Start URLs

The **Start URLs** (`startUrls`) field represent the initial list of URLs of pages that the scraper will visit. You can either enter these URLs manually one by one, upload them in a CSV file or [link URLs from a Google Sheet](https://help.apify.com/en/articles/2906022-scraping-a-list-of-urls-from-a-google-sheets-document) document. Note that each URL must start with either a `http://` or `https://` protocol prefix.

The scraper supports adding new URLs to scrape on the fly, either using the **[Link selector](#link-selector)** and **[Glob Patterns](#glob-patterns)**/**[Pseudo-URLs](#pseudo-urls)** options, or by calling `await context.enqueueRequest()`inside the **[Page function](#page-function)**.

Optionally, each URL can be associated with custom user data - a JSON object that can be referenced from your JavaScript code in **[Page function](#page-function)** under `context.request.userData`. This is useful for determining which start URL is currently loaded, allowing the ability to perform some page-specific actions. For example, when crawling an online store, you might want to perform different actions on a page listing the products vs. a product detail page. For details, refer to **[Web scraping tutorial](https://docs.apify.com/tutorials/apify-scrapers/getting-started#the-start-url)** within the Apify documentation.

<!-- TODO: Describe how the queue works, unique key etc. plus link -->

#### Link selector

The **Link selector** (`linkSelector`) field contains a CSS selector that is used to find links to other web pages (items with `href` attributes, e.g. `<div class="my-class" href="...">`).

On every page loaded, the scraper looks for all links matching **Link selector**, and checks that the target URL matches one of the [**Glob Patterns**](#glob-patterns)/[**Pseudo-URLs**](#pseudo-urls). If it is a match, it then adds the URL to the request queue so that it's loaded by the scraper later on.

By default, new scrapers are created with the following selector that matches all links on any page:

````

a\[href]

```

If **Link selector** is empty, the page links are ignored, and the scraper only loads pages that were specified in **[Start URLs](#start-urls)** or that were manually added to the request queue by calling `await context.enqueueRequest()` in **[Page function](#page-function)**.

#### Glob Patterns

The **Glob Patterns** (`globs`) field specifies which types of URLs found by **[Link selector](#link-selector)** should be added to the request queue.

A glob pattern is simply a string with wildcard characters.

For example, a glob pattern `http://www.example.com/pages/**/*` will match all the
following URLs:

- `http://www.example.com/pages/deeper-level/page`
- `http://www.example.com/pages/my-awesome-page`
- `http://www.example.com/pages/something`

Note that you don't need to use the **Glob Patterns** setting at all, because you can completely control which pages the scraper will access by calling `await context.enqueueRequest()` from the **[Page function](#page-function)**.

#### Pseudo URLs

The **Pseudo-URLs** (`pseudoUrls`) field specifies which types of URLs found by **[Link selector](#link-selector)** should be added to the request queue.

A pseudo-URL is simply a URL with special directives enclosed in `[]` brackets.
Currently, the only supported directive is `[regexp]`, which defines
a JavaScript-style regular expression to match against the URL.

For example, a pseudo-URL `http://www.example.com/pages/[(\w|-)*]` will match all the
following URLs:

- `http://www.example.com/pages/`
- `http://www.example.com/pages/my-awesome-page`
- `http://www.example.com/pages/something`

If either "`[`" or "`]`" are part of the normal query string, the symbol must be encoded as `[\x5B]` or `[\x5D]`, respectively. For example, the following pseudo-URL:

```

http://www.example.com/search?do\[\x5B]load\[\x5D]=1

```

will match the URL:

```

http://www.example.com/search?do\[load]=1

````

Optionally, each pseudo-URL can be associated with user data that can be referenced from your **[Page function](#page-function)** using `context.request.label` to determine which kind of page is currently loaded in the browser.

Note that you don't need to use the **Pseudo-URLs** setting at all, because you can completely control which pages the scraper will access by calling `await context.enqueueRequest()` from the **[Page function](#page-function)**.

#### Clickable elements selector

For pages where the links you want to add to the crawler's request queue aren't included in elements with `href` attributes, you can pass a CSS Selector to the **Clickable elements selector**. This CSS selector should match elements that lead to the URL you want to queue up.

The scraper will mouse click the specified CSS selector after the page function finishes. Any triggered requests, navigations, or open tabs will be intercepted, and the target URLs will be filtered using Globs and/or Pseudo URLs. Finally, these filtered URLs will be added to the request queue. Leave this field empty to prevent the scraper from clicking in the page.

It's important to note that _using this setting can impact performance._

#### Page function

Page function `context` as it appears within `Page function`:

```JavaScript
const context = {
    // USEFUL DATA
    input, // Input data in JSON format.
    env, // Contains information about the run, such as actorId and runId.
    customData, // Value of the 'Custom data' scraper option.

    // EXPOSED OBJECTS
    page, // Puppeteer.Page object.
    request, // Crawlee.Request object.
    response, // Response object holding the status code and headers.
    session, // Reference to the currently used session.
    proxyInfo, // Object holding the url and other information about currently used Proxy.
    crawler, // Reference to the crawler object, with access to `browserPool`, `autoscaledPool`, and more.
    globalStore, // Represents an in memory store that can be used to share data across pageFunction invocations.
    log, // Reference to Crawlee.utils.log.
    Actor, // Reference to the Actor class of Apify SDK.
    Apify, // Alias to the Actor class for back compatibility.

    // EXPOSED FUNCTIONS
    setValue, // Reference to the Actor.setValue() function.
    getValue, // Reference to the Actor.getValue() function.
    saveSnapshot, // Saves a screenshot and full HTML of the current page to the key value store.
    skipLinks, // Prevents enqueueing more links via Glob patterns/Pseudo URLs on the current page.
    enqueueRequest, // Adds a page to the request queue.

    // PUPPETEER CONTEXT-AWARE UTILITY FUNCTIONS
    injectJQuery, // Injects the jQuery library into a Puppeteer page.
    sendRequest, // Sends request using got-scraping.
    parseWithCheerio, // Returns Cheerio handle for page.content(), allowing to work with the data same way as with CheerioCrawler.
};
````

##### **`input`**

| Type   | Arguments | Returns      |
| ------ | --------- | ------------ |
| Object | -         | Input object |

The Actor's input as it was received from the UI. Each `pageFunction` invocation gets a fresh copy. Note that the Actor's input cannot be modified by changing the values in this object.

##### **`env`**

| Type   | Arguments | Returns                                                                                |
| ------ | --------- | -------------------------------------------------------------------------------------- |
| Object | -         | Return value of [`Actor.getEnv()`](https://sdk.apify.com/api/apify/class/Actor#getEnv) |

A map of all the relevant environment variables that you may want to use.

##### **`customData`**

| Type   | Arguments | Returns            |
| ------ | --------- | ------------------ |
| Object | -         | Custom data object |

Since the input UI is fixed, it does not support adding of other fields that may be needed for all specific use cases. If you need to pass arbitrary data to the scraper, use the [Custom data](#custom-data) input field within [Advanced configuration](#advanced-configuration) and its contents will be available under the `customData` context key as an object.

##### **`page`**

| Type   | Arguments | Returns                                                      |
| ------ | --------- | ------------------------------------------------------------ |
| Object | -         | [Puppeteer Page](https://pptr.dev/api/puppeteer.page) object |

This is a reference to the Puppeteer Page object, which enables you to use the full power of Puppeteer in your Page functions. If you are not familiar with the Page API already, you can refer to [their documentation](https://pptr.dev/api/puppeteer.page).

##### **`request`**

| Type   | Arguments | Returns                                                            |
| ------ | --------- | ------------------------------------------------------------------ |
| Object | -         | Apify [Request](https://crawlee.dev/api/core/class/Request) object |

An object with metadata about the currently crawled page, such as its URL, headers, and the number of retries.

```JavaScript
const request = {
    id,
    url,
    loadedUrl,
    uniqueKey,
    method,
    payload,
    noRetry,
    retryCount,
    errorMessages,
    headers,
    userData,
    handledAt
}
```

See the [Request class](https://crawlee.dev/api/core/class/Request) for a preview of the structure and full documentation.

##### **`response`**

| Type   | Arguments | Returns         |
| ------ | --------- | --------------- |
| Object | -         | Response object |

The response object is produced by Puppeteer. Currently, we only pass the response's HTTP status code and headers to the `response` object.

##### **`session`**

| Type   | Arguments | Returns                                                      |
| ------ | --------- | ------------------------------------------------------------ |
| Object | -         | [Session](https://crawlee.dev/api/core/class/Session) object |

Reference to the currently used session. See the [official documentation](https://crawlee.dev/api/core/class/Session) for more information.

##### **`proxyInfo`**

| Type   | Arguments | Returns                                                              |
| ------ | --------- | -------------------------------------------------------------------- |
| Object | -         | [ProxyInfo](https://crawlee.dev/api/core/interface/ProxyInfo) object |

Object holding the url and other information about currently used Proxy. See the [official documentation](https://crawlee.dev/api/core/interface/ProxyInfo) for more information.

##### **`crawler`**

| Type   | Arguments | Returns                                                                                     |
| ------ | --------- | ------------------------------------------------------------------------------------------- |
| Object | -         | [PuppeteerCrawler](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler) object |

To access the current `AutoscaledPool` or `BrowserPool` instance, we can use the `crawler` object. This object includes the following properties:

```JavaScript
const crawler = {
    stats,
    requestList,
    requestQueue,
    sessionPool,
    proxyConfiguration,
    browserPool,
    autoscaledPool
}
```

Refer to the [official documentation](https://crawlee.dev/api/puppeteer-crawler/class/PuppeteerCrawler) for more information.

##### **`globalStore`**

| Type   | Arguments | Returns               |
| ------ | --------- | --------------------- |
| Object | -         | Global store contents |

`globalStore` represents an instance of a very simple in-memory store that is not scoped to the individual `pageFunction` invocation. This enables you to easily share global data such as API responses and tokens between all requests. Since the stored data needs to cross from the browser to the Node.js process, it must be formatted into JSON stringifiable objects. You cannot store DOM objects, functions, circular objects, etc.

##### **`log`**

| Type   | Arguments | Returns                                                            |
| ------ | --------- | ------------------------------------------------------------------ |
| Object | -         | [Crawlee.utils.log](https://crawlee.dev/api/core/class/Log) object |

This should be used instead of JavaScript's built in `console.log` when logging in the Node.js context, as it automatically color-tags your logs, as well as allows the toggling of the visibility of log messages using options such as [Debug log](#debug-log) in [Advanced configuration](#advanced-configuration).

The most common `log` methods include:

- `context.log.info()`
- `context.log.debug()`
- `context.log.warning()`
- `context.log.error()`
- `context.log.exception()`

##### **`Actor`**

| Type   | Arguments | Returns                                                           |
| ------ | --------- | ----------------------------------------------------------------- |
| Object | -         | [Actor class](https://sdk.apify.com/api/apify/class/Actor) object |

A reference to the full power of the Actor class of Apify SDK. See [the docs](https://sdk.apify.com/api/apify/class/Actor) for more information.

> Caution: Since we're making the Actor class available with this option, and Puppeteer Scraper already runs using the Actor class, some edge case manipulations may lead to inconsistencies. Use `Actor` class with caution, and avoid making global changes unless you know what you're doing.

##### **`Apify`**

An alias for [`Actor`](#actor) class for back compatibility.

##### **`setValue`**

| Type     | Arguments                                          | Returns          |
| -------- | -------------------------------------------------- | ---------------- |
| Function | (key: *string*, data: *object*, options: *object*) | *Promise\<void>* |

> This function is async! Don't forget the `await` keyword!

Allows you to save data to the default key-value store. The `key` is the name of the item in the store (which can later be used to retrieve this stored data), and the `data` is an object containing all the data you want to store.

Usage:

```JavaScript
await context.setValue('my-value', { message: 'hello' })
```

Refer to [Key-Value store documentation](https://crawlee.dev/api/core/class/KeyValueStore#setValue) for more information.

##### **`getValue`**

| Type     | Arguments       | Returns            |
| -------- | --------------- | ------------------ |
| Function | (key: *string*) | *Promise\<object>* |

> This function is async! Don't forget the `await` keyword!

Retrieve previously saved data in the key-value store via the `key` specified when using the [`setValue`](#setvalue) function.

Usage:

```JavaScript
const { message } = await context.getValue('my-value')
```

Refer to [Key-Value store documentation](https://crawlee.dev/api/core/class/KeyValueStore#getValue) for more information.

##### **`saveSnapshot`**

| Type     | Arguments | Returns          |
| -------- | --------- | ---------------- |
| Function | ()        | *Promise\<void>* |

> This function is async! Don't forget the `await` keyword!

A helper function that enables saving a snapshot of the current page's HTML and a screenshot of the current page into the default key-value store. Each snapshot overwrites the previous one, and the `pageFunction`'s invocations will also be throttled if `saveSnapshot` is invoked more than once in 2 seconds (this is a measure put in place to prevent abuse). *Make sure you don't call it for every single request.*

Usage:

```JavaScript
await context.saveSnapshot()
```

You can find the latest screenshot under the `SNAPSHOT-SCREENSHOT` key and the HTML under the `SNAPSHOT-BODY` key.

##### **`skipLinks`**

| Type     | Arguments | Returns          |
| -------- | --------- | ---------------- |
| Function | ()        | *Promise\<void>* |

> This function is async! Don't forget the `await` keyword!

With each invocation of the pageFunction, the scraper attempts to extract new URLs from the page using the Link selector and Glob patterns/Pseudo-URLs provided in the input UI. If you want to prevent this behavior in certain cases, call the `skipLinks` function, and no URLs will be added to the queue for the given page.

Usage:

```JavaScript
await context.skipLinks()
```

##### **`enqueueRequest`**

| Type     | Arguments                    | Returns          |
| -------- | ---------------------------- | ---------------- |
| Function | (request: *Request|object*) | *Promise\<void>* |

> This function is async! Don't forget the `await` keyword!

To enqueue a specific URL manually instead of automatically by a combination of a Link selector and a Pseudo URL/Glob pattern, use the enqueueRequest function. It accepts a plain object as argument that needs to have the structure to construct a [Request object](https://crawlee.dev/api/core/class/Request), but frankly, you just need an object with a `url` key.

Usage:

```JavaScript
await context.enqueueRequest({ url: 'https://www.example.com' })
```

This method is a nice shorthand for

```JavaScript
await context.crawler.requestQueue.addRequest({ url: 'https://foo.bar/baz' })
```

##### **`injectJQuery`**

| Type     | Arguments | Returns          |
| -------- | --------- | ---------------- |
| Function | ()        | *Promise\<void>* |

> This function is async! Don't forget the `await` keyword!

Injects the [jQuery](https://jquery.com/) library into a Puppeteer page. The injected jQuery will be set to the `window.$` variable, and will survive page navigations and reloads. Note that `injectJQuery()` does not affect the Puppeteer [`page.$()`](https://pptr.dev/api/puppeteer.page._) function in any way.

Usage:

```JavaScript
await context.injectJQuery();
```

##### **`sendRequest`**

| Type     | Arguments                                    | Returns          |
| -------- | -------------------------------------------- | ---------------- |
| Function | (overrideOptions?: Partial\<GotOptionsInit>) | *Promise\<void>* |

> This function is async! Don't forget the `await` keyword!

This is a helper function that allows processing the context bound `Request` object through [`got-scraping`](https://github.com/apify/got-scraping). Some options, such as `url` or `method` could be overridden by providing `overrideOptions`. See the [official documentation](https://crawlee.dev/docs/guides/got-scraping#sendrequest-api) for full list of possible `overrideOptions` and more information.

Usage:

```JavaScript
// Without overrideOptions
await context.sendRequest();
// With overrideOptions.url
await context.sendRequest({ url: 'https://www.example.com' });
```

##### **`parseWithCheerio`**

| Type     | Arguments | Returns                 |
| -------- | --------- | ----------------------- |
| Function | ()        | *Promise\<CheerioRoot>* |

Returns Cheerio handle for `page.content()`, allowing to work with the data same way as with CheerioCrawler.

Usage:

```JavaScript
const $ = await context.parseWithCheerio();
```

### Proxy Configuration

The **Proxy configuration** (`proxyConfiguration`) option enables you to set proxies
that will be used by the scraper in order to prevent its detection by target websites.
You can use both [Apify Proxy](https://apify.com/proxy)
and custom HTTP or SOCKS5 proxy servers.

Proxy is required to run the scraper. The following table lists the available options of the proxy configuration setting:

| Option                        | Description                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| ----------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Apify Proxy (automatic)       | The scraper will load all web pages using [Apify Proxy](https://apify.com/proxy) in the automatic mode. In this mode, the proxy uses all proxy groups that are available to the user, and for each new web page it automatically selects the proxy that hasn't been used in the longest time for the specific hostname, in order to reduce the chance of detection by the website. You can view the list of available proxy groups on the [Proxy](https://console.apify.com/proxy) page in Apify Console. |
| Apify Proxy (selected groups) | The scraper will load all web pages using [Apify Proxy](https://apify.com/proxy) with specific groups of target proxy servers.                                                                                                                                                                                                                                                                                                                                                                            |
| Custom proxies                | The scraper will use a custom list of proxy servers. The proxies must be specified in the `scheme://user:password@host:port` format, and multiple proxies should be separated by a space of a new line. The URL scheme can be either `http` or `socks5`. Username and password can be omitted if the proxy doesn't require authorization, but the port must always be present.                                                                                                                            |

Custom proxy example:

```
http://bob:password@proxy1.example.com:8000
http://bobby:password123@proxy2.example.com:3001
```

The proxy configuration can be set programmatically when calling the Actor using the API by setting the `proxyConfiguration` field. It accepts a JSON object with the following structure:

```JavaScript
{
    // Indicates whether to use Apify Proxy or not.
    "useApifyProxy": Boolean,

    // Array of Apify Proxy groups, only used if "useApifyProxy" is true.
    // If missing or null, Apify Proxy will use the automatic mode.
    "apifyProxyGroups": String[],

    // Array of custom proxy URLs, in "scheme://user:password@host:port" format.
    // If missing or null, custom proxies are not used.
    "proxyUrls": String[],
}
```

### Advanced Configuration

#### Pre-navigation hooks

This is an array of functions that will be executed **BEFORE** the main `pageFunction` is run. A similar `context` object is passed into each of these functions as is passed into the `pageFunction`; however, a second "[DirectNavigationOptions](https://crawlee.dev/api/puppeteer-crawler/namespace/puppeteerUtils#DirectNavigationOptions)" object is also passed in. `Apify` is an alias for `Actor` class in this case.

The available options can be seen here:

```JavaScript
preNavigationHooks: [
    async ({ id, request, session, proxyInfo, customData, Actor, Apify }, { timeout, waitUntil, referer }) => {}
]
```

Check out the docs for [Pre-navigation hooks](https://crawlee.dev/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions#preNavigationHooks) and the [PuppeteerHook type](https://crawlee.dev/api/puppeteer-crawler/interface/PuppeteerHook) for more info regarding the objects passed into these functions. The available properties are extended with `Actor` (previously `Apify`) class and `customData` in this scraper.

#### Post-navigation hooks

An array of functions that will be executed **AFTER** the main `pageFunction` is run. The only available parameter is the [PuppeteerCrawlingContext](https://crawlee.dev/api/puppeteer-crawler/interface/PuppeteerCrawlingContext) object. The available properties are extended with `Actor` (alternatively `Apify`) and `customData` in this scraper. `Apify` is an alias for `Actor` class in this case.

```JavaScript
postNavigationHooks: [
    async ({ id, request, session, proxyInfo, response, customData, Actor, Apify }) => {}
]
```

Check out the docs for [Post-navigation hooks](https://crawlee.dev/api/puppeteer-crawler/interface/PuppeteerCrawlerOptions#preNavigationHooks) and the [PuppeteerHook type](https://crawlee.dev/api/puppeteer-crawler/interface/PuppeteerHook) for more info regarding the objects passed into these functions.

#### Debug log

*boolean*

When set to true, debug messages will be included in the log. Use `context.log.debug('message')` to log your own debug messages.

#### Browser log

*boolean*

When set to true, console messages from the browser will be included in the Actor's log. This may result in the log being flooded by error messages, warnings and other messages of little value (especially with a high concurrency).

#### Custom data

Since the input UI is fixed, it does not support adding of other fields that may be needed for all specific use cases. If you need to pass arbitrary data to the scraper, use the [Custom data](#custom-data) input field within [Advanced configuration](#advanced-configuration) and its contents will be available under the `customData` context key as an object within the [pageFunction](#page-function).

#### Custom namings

With the final three options in the **Advanced configuration**, you can set custom names for the following:

- Dataset
- Key-value store
- Request queue

Leave the storage unnamed if you only want the data within it to be persisted on the Apify platform for a number of days corresponding to your [plan](https://apify.com/pricing) (after which it will expire). Named storages are retained indefinitely. Additionally, using a named storage allows you to share it across multiple runs (e.g. instead of having 10 different unnamed datasets for 10 different runs, all the data from all 10 runs can be accumulated into a single named dataset). Learn more [here](https://docs.apify.com/storage#named-and-unnamed-storages).

### Results

The scraping results returned by **[Page function](#page-function)** are stored in the default dataset associated with the Actor run, from which you can export them to formats such as JSON, XML, CSV or Excel.
For each object returned by the **[Page function](#page-function)**, Puppeteer Scraper pushes one record into the dataset, and extends it with metadata such as the URL of the web page where the results come from.

For example, if you were scraping the HTML `<title>` of [Apify](https://apify.com) and returning the following object from the `pageFunction`:

```JavaScript
return {
  title: "Web Scraping, Data Extraction and Automation - Apify"
}
```

The full object stored in the dataset would look as follows (in JSON format, including the metadata fields `#error` and `#debug`):

```JSON
{
  "title": "Web Scraping, Data Extraction and Automation - Apify",
  "#error": false,
  "#debug": {
    "requestId": "fvwscO2UJLdr10B",
    "url": "https://apify.com",
    "loadedUrl": "https://apify.com/",
    "method": "GET",
    "retryCount": 0,
    "errorMessages": null,
    "statusCode": 200
  }
}
```

To download the results, call the [Get dataset items](https://docs.apify.com/api/v2#/reference/datasets/item-collection) API endpoint:

```
https://api.apify.com/v2/datasets/[DATASET_ID]/items?format=json
```

`[DATASET_ID]` is the ID of the Actor's run dataset, in which you can find the Run object returned when starting the Actor. Alternatively, you'll find the download links for the results in Apify Console.

To skip the `#error` and `#debug` metadata fields from the results and not include empty result records, simply add the `clean=true` query parameter to the API URL, or select the **Clean items** option when downloading the dataset in Apify Console.

To get the results in other formats, set the `format` query parameter to `xml`, `xlsx`, `csv`, `html`, etc.
For more information, see [Datasets](https://apify.com/docs/storage#dataset) in documentation or the [Get dataset items](https://apify.com/docs/api/v2#/reference/datasets/item-collection) endpoint in the Apify API reference.

### Additional Resources

That's it! You might also want to check out these other resources:

- [Actors documentation](https://apify.com/docs/actor) - Documentation for the Apify Actors cloud computing platform.
- [Apify SDK documentation](https://sdk.apify.com) - Learn more about the tools required to run your own Apify Actors.
- [Crawlee documentation](https://crawlee.dev) - Learn how to build a new web scraping project from scratch using the world's most popular web crawling and scraping library for Node.js.
- [Playwright Scraper](https://apify.com/apify/playwright-scraper) -
  A similar web scraping Actor to Puppeteer Scraper, but using the [Playwright](https://github.com/microsoft/playwright) library instead.
- [Web Scraper](https://apify.com/apify/web-scraper) - A similar web scraping Actor to Playwright Scraper, but is simpler to use and only runs in the context of the browser. Uses the [Puppeteer](https://github.com/puppeteer/puppeteer) library.
- [Cheerio Scraper](https://apify.com/apify/cheerio-scraper) - Another web scraping Actor that downloads and processes pages in raw HTML for much higher performance.

# Actor input Schema

## `startUrls` (type: `array`):

URLs to start with

## `globs` (type: `array`):

Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.

## `pseudoUrls` (type: `array`):

Pseudo-URLs to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Pseudo-URLs will cause the scraper to enqueue all links matched by the Link selector.

## `excludes` (type: `array`):

Glob patterns to match links in the page that you want to exclude from being enqueued.

## `linkSelector` (type: `string`):

CSS selector matching elements with 'href' attributes that should be enqueued. To enqueue urls from <code><div class="my-class" href=...></code> tags, you would enter <strong>div.my-class</strong>. Leave empty to ignore all links.

## `clickableElementsSelector` (type: `string`):

For pages where simple 'href' links are not available, this attribute allows you to specify a CSS selector matching elements that the scraper will mouse click after the page function finishes. Any triggered requests, navigations or open tabs will be intercepted and the target URLs will be filtered using Pseudo URLs and/or Glob patterns and subsequently added to the request queue. Leave empty to prevent the scraper from clicking in the page. Using this setting will have a performance impact.

## `keepUrlFragments` (type: `boolean`):

URL fragments (the parts of URL after a <code>#</code>) are not considered when the scraper determines whether a URL has already been visited. This means that when adding URLs such as <code>https://example.com/#foo</code> and <code>https://example.com/#bar</code>, only the first will be visited. Turn this option on to tell the scraper to visit both.

## `respectRobotsTxtFile` (type: `boolean`):

If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.

## `pageFunction` (type: `string`):

Function executed for each request

## `proxyConfiguration` (type: `object`):

Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/puppeteer-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.

## `proxyRotation` (type: `string`):

This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.

## `sessionPoolName` (type: `string`):

<b>Use only english alphanumeric characters dashes and underscores.</b> A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple Actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new Actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window.

## `initialCookies` (type: `array`):

The provided cookies will be pre-set to all pages the scraper opens.

## `useChrome` (type: `boolean`):

The scraper will use a real Chrome browser instead of a Chromium masking as Chrome. Using this option may help with bypassing certain anti-scraping protections, but risks that the scraper will be unstable or not work at all.

## `headless` (type: `boolean`):

By default, browsers run in headless mode. You can toggle this off to run them in headful mode, which can help with certain rare anti-scraping protections but is slower and more costly.

## `ignoreSslErrors` (type: `boolean`):

Scraper will ignore SSL certificate errors.

## `ignoreCorsAndCsp` (type: `boolean`):

Scraper will ignore CSP (content security policy) and CORS (cross origin resource sharing) settings of visited pages and requested domains. This enables you to freely use XHR/Fetch to make HTTP requests from the scraper.

## `downloadMedia` (type: `boolean`):

Scraper will download media such as images, fonts, videos and sounds. Disabling this may speed up the scrape, but certain websites could stop working correctly.

## `downloadCss` (type: `boolean`):

Scraper will download CSS stylesheets. Disabling this may speed up the scrape, but certain websites could stop working correctly.

## `maxRequestRetries` (type: `integer`):

Maximum number of times the request for the page will be retried in case of an error. Setting it to 0 means that the request will be attempted once and will not be retried if it fails.

## `maxPagesPerCrawl` (type: `integer`):

Maximum number of pages that the scraper will open. 0 means unlimited.

## `maxResultsPerCrawl` (type: `integer`):

Maximum number of results that will be saved to dataset. The scraper will terminate afterwards. 0 means unlimited.

## `maxCrawlingDepth` (type: `integer`):

Defines how many links away from the StartURLs will the scraper descend. 0 means unlimited.

## `maxConcurrency` (type: `integer`):

Defines how many pages can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. Use this option to set a hard limit.

## `pageLoadTimeoutSecs` (type: `integer`):

Maximum time the scraper will allow a web page to load in seconds.

## `pageFunctionTimeoutSecs` (type: `integer`):

Maximum time the scraper will wait for the page function to execute in seconds.

## `waitUntil` (type: `array`):

The scraper will wait until the selected events are triggered in the page before executing the page function. Available events are <code>domcontentloaded</code>, <code>load</code>, <code>networkidle2</code> and <code>networkidle0</code>. <a href="https://pptr.dev/#?product=Puppeteer&show=api-pagegotourl-options" target="_blank">See Puppeteer docs</a>.

## `preNavigationHooks` (type: `string`):

Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotoOptions`, which are passed to the `page.goto()` function the crawler calls to navigate.

## `postNavigationHooks` (type: `string`):

Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter.

## `closeCookieModals` (type: `boolean`):

Using the [I don't care about cookies](https://addons.mozilla.org/en-US/firefox/addon/i-dont-care-about-cookies/) browser extension. When on, the crawler will automatically try to dismiss cookie consent modals. This can be useful when crawling European websites that show cookie consent modals.

## `maxScrollHeightPixels` (type: `integer`):

The crawler will scroll down the page until all content is loaded or the maximum scrolling distance is reached. Setting this to `0` disables scrolling altogether.

## `debugLog` (type: `boolean`):

Debug messages will be included in the log. Use <code>context.log.debug('message')</code> to log your own debug messages.

## `browserLog` (type: `boolean`):

Console messages from the Browser will be included in the log. This may result in the log being flooded by error messages, warnings and other messages of little value, especially with high concurrency.

## `customData` (type: `object`):

This object will be available on pageFunction's context as customData.

## `datasetName` (type: `string`):

Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used.

## `keyValueStoreName` (type: `string`):

Name or ID of the key-value store that will be used for storing records. If left empty, the default key-value store of the run will be used.

## `requestQueueName` (type: `string`):

Name of the request queue that will be used for storing requests. If left empty, the default request queue of the run will be used.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://crawlee.dev/js"
    }
  ],
  "globs": [
    {
      "glob": "https://crawlee.dev/js/*/*"
    }
  ],
  "pseudoUrls": [],
  "excludes": [
    {
      "glob": "/**/*.{png,jpg,jpeg,pdf}"
    }
  ],
  "linkSelector": "a",
  "keepUrlFragments": false,
  "respectRobotsTxtFile": true,
  "pageFunction": "async function pageFunction(context) {\n    const { page, request, log } = context;\n    const title = await page.title();\n    log.info(`URL: ${request.url} TITLE: ${title}`);\n    return {\n        url: request.url,\n        title\n    };\n}",
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "proxyRotation": "RECOMMENDED",
  "initialCookies": [],
  "useChrome": false,
  "headless": true,
  "ignoreSslErrors": false,
  "ignoreCorsAndCsp": false,
  "downloadMedia": true,
  "downloadCss": true,
  "maxRequestRetries": 3,
  "maxPagesPerCrawl": 0,
  "maxResultsPerCrawl": 0,
  "maxCrawlingDepth": 0,
  "maxConcurrency": 50,
  "pageLoadTimeoutSecs": 60,
  "pageFunctionTimeoutSecs": 60,
  "waitUntil": [
    "networkidle2"
  ],
  "preNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept two arguments: the \"crawlingContext\" object\n// and \"gotoOptions\".\n[\n    async (crawlingContext, gotoOptions) => {\n        const { page } = crawlingContext;\n        // ...\n    },\n]",
  "postNavigationHooks": "// We need to return array of (possibly async) functions here.\n// The functions accept a single argument: the \"crawlingContext\" object.\n[\n    async (crawlingContext) => {\n        const { page } = crawlingContext;\n        // ...\n    },\n]",
  "closeCookieModals": false,
  "maxScrollHeightPixels": 5000,
  "debugLog": false,
  "browserLog": false,
  "customData": {}
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://crawlee.dev/js"
        }
    ],
    "globs": [
        {
            "glob": "https://crawlee.dev/js/*/*"
        }
    ],
    "pseudoUrls": [],
    "excludes": [
        {
            "glob": "/**/*.{png,jpg,jpeg,pdf}"
        }
    ],
    "linkSelector": "a",
    "respectRobotsTxtFile": true,
    "pageFunction": async function pageFunction(context) {
        const { page, request, log } = context;
        const title = await page.title();
        log.info(`URL: ${request.url} TITLE: ${title}`);
        return {
            url: request.url,
            title
        };
    },
    "proxyConfiguration": {
        "useApifyProxy": true
    },
    "initialCookies": [],
    "waitUntil": [
        "networkidle2"
    ],
    "preNavigationHooks": `// We need to return array of (possibly async) functions here.
// The functions accept two arguments: the "crawlingContext" object
// and "gotoOptions".
[
    async (crawlingContext, gotoOptions) => {
        const { page } = crawlingContext;
        // ...
    },
]`,
    "postNavigationHooks": `// We need to return array of (possibly async) functions here.
// The functions accept a single argument: the "crawlingContext" object.
[
    async (crawlingContext) => {
        const { page } = crawlingContext;
        // ...
    },
]`,
    "customData": {}
};

// Run the Actor and wait for it to finish
const run = await client.actor("apify/puppeteer-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://crawlee.dev/js" }],
    "globs": [{ "glob": "https://crawlee.dev/js/*/*" }],
    "pseudoUrls": [],
    "excludes": [{ "glob": "/**/*.{png,jpg,jpeg,pdf}" }],
    "linkSelector": "a",
    "respectRobotsTxtFile": True,
    "pageFunction": """async function pageFunction(context) {
    const { page, request, log } = context;
    const title = await page.title();
    log.info(`URL: ${request.url} TITLE: ${title}`);
    return {
        url: request.url,
        title
    };
}""",
    "proxyConfiguration": { "useApifyProxy": True },
    "initialCookies": [],
    "waitUntil": ["networkidle2"],
    "preNavigationHooks": """// We need to return array of (possibly async) functions here.
// The functions accept two arguments: the \"crawlingContext\" object
// and \"gotoOptions\".
[
    async (crawlingContext, gotoOptions) => {
        const { page } = crawlingContext;
        // ...
    },
]""",
    "postNavigationHooks": """// We need to return array of (possibly async) functions here.
// The functions accept a single argument: the \"crawlingContext\" object.
[
    async (crawlingContext) => {
        const { page } = crawlingContext;
        // ...
    },
]""",
    "customData": {},
}

# Run the Actor and wait for it to finish
run = client.actor("apify/puppeteer-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://crawlee.dev/js"
    }
  ],
  "globs": [
    {
      "glob": "https://crawlee.dev/js/*/*"
    }
  ],
  "pseudoUrls": [],
  "excludes": [
    {
      "glob": "/**/*.{png,jpg,jpeg,pdf}"
    }
  ],
  "linkSelector": "a",
  "respectRobotsTxtFile": true,
  "pageFunction": "async function pageFunction(context) {\\n    const { page, request, log } = context;\\n    const title = await page.title();\\n    log.info(`URL: ${request.url} TITLE: ${title}`);\\n    return {\\n        url: request.url,\\n        title\\n    };\\n}",
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "initialCookies": [],
  "waitUntil": [
    "networkidle2"
  ],
  "preNavigationHooks": "// We need to return array of (possibly async) functions here.\\n// The functions accept two arguments: the \\"crawlingContext\\" object\\n// and \\"gotoOptions\\".\\n[\\n    async (crawlingContext, gotoOptions) => {\\n        const { page } = crawlingContext;\\n        // ...\\n    },\\n]",
  "postNavigationHooks": "// We need to return array of (possibly async) functions here.\\n// The functions accept a single argument: the \\"crawlingContext\\" object.\\n[\\n    async (crawlingContext) => {\\n        const { page } = crawlingContext;\\n        // ...\\n    },\\n]",
  "customData": {}
}' |
apify call apify/puppeteer-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=apify/puppeteer-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Puppeteer Scraper",
        "description": "Crawls websites with the headless Chrome and Puppeteer library using a provided server-side Node.js code. This crawler is an alternative to apify/web-scraper that gives you finer control over the process. Supports both recursive crawling and list of URLs. Supports login to website.",
        "version": "3.0",
        "x-build-id": "g6G5r98rF5fM6ecm3"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/apify~puppeteer-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-apify-puppeteer-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/apify~puppeteer-scraper/runs": {
            "post": {
                "operationId": "runs-sync-apify-puppeteer-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/apify~puppeteer-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-apify-puppeteer-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls",
                    "pageFunction",
                    "proxyConfiguration"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "URLs to start with",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "globs": {
                        "title": "Glob Patterns",
                        "type": "array",
                        "description": "Glob patterns to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Glob patterns will cause the scraper to enqueue all links matched by the Link selector.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "glob"
                            ],
                            "properties": {
                                "glob": {
                                    "type": "string",
                                    "title": "Glob of a web page"
                                }
                            }
                        }
                    },
                    "pseudoUrls": {
                        "title": "Pseudo-URLs",
                        "type": "array",
                        "description": "Pseudo-URLs to match links in the page that you want to enqueue. Combine with Link selector to tell the scraper where to find links. Omitting the Pseudo-URLs will cause the scraper to enqueue all links matched by the Link selector.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "purl"
                            ],
                            "properties": {
                                "purl": {
                                    "type": "string",
                                    "title": "Pseudo-URL of a web page"
                                }
                            }
                        }
                    },
                    "excludes": {
                        "title": "Exclude Glob Patterns",
                        "type": "array",
                        "description": "Glob patterns to match links in the page that you want to exclude from being enqueued.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "glob"
                            ],
                            "properties": {
                                "glob": {
                                    "type": "string",
                                    "title": "Glob of a web page"
                                }
                            }
                        }
                    },
                    "linkSelector": {
                        "title": "Link selector",
                        "type": "string",
                        "description": "CSS selector matching elements with 'href' attributes that should be enqueued. To enqueue urls from <code><div class=\"my-class\" href=...></code> tags, you would enter <strong>div.my-class</strong>. Leave empty to ignore all links."
                    },
                    "clickableElementsSelector": {
                        "title": "Clickable elements selector",
                        "type": "string",
                        "description": "For pages where simple 'href' links are not available, this attribute allows you to specify a CSS selector matching elements that the scraper will mouse click after the page function finishes. Any triggered requests, navigations or open tabs will be intercepted and the target URLs will be filtered using Pseudo URLs and/or Glob patterns and subsequently added to the request queue. Leave empty to prevent the scraper from clicking in the page. Using this setting will have a performance impact."
                    },
                    "keepUrlFragments": {
                        "title": "Keep URL fragments",
                        "type": "boolean",
                        "description": "URL fragments (the parts of URL after a <code>#</code>) are not considered when the scraper determines whether a URL has already been visited. This means that when adding URLs such as <code>https://example.com/#foo</code> and <code>https://example.com/#bar</code>, only the first will be visited. Turn this option on to tell the scraper to visit both.",
                        "default": false
                    },
                    "respectRobotsTxtFile": {
                        "title": "Respect the robots.txt file",
                        "type": "boolean",
                        "description": "If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.",
                        "default": false
                    },
                    "pageFunction": {
                        "title": "Page function",
                        "type": "string",
                        "description": "Function executed for each request"
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/puppeteer-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.",
                        "default": {
                            "useApifyProxy": true
                        }
                    },
                    "proxyRotation": {
                        "title": "Proxy rotation",
                        "enum": [
                            "RECOMMENDED",
                            "PER_REQUEST",
                            "UNTIL_FAILURE"
                        ],
                        "type": "string",
                        "description": "This property indicates the strategy of proxy rotation and can only be used in conjunction with Apify Proxy. The recommended setting automatically picks the best proxies from your available pool and rotates them evenly, discarding proxies that become blocked or unresponsive. If this strategy does not work for you for any reason, you may configure the scraper to either use a new proxy for each request, or to use one proxy as long as possible, until the proxy fails. IMPORTANT: This setting will only use your available Apify Proxy pool, so if you don't have enough proxies for a given task, no rotation setting will produce satisfactory results.",
                        "default": "RECOMMENDED"
                    },
                    "sessionPoolName": {
                        "title": "Session pool name",
                        "pattern": "[0-9A-z-]",
                        "minLength": 3,
                        "maxLength": 200,
                        "type": "string",
                        "description": "<b>Use only english alphanumeric characters dashes and underscores.</b> A session is a representation of a user. It has it's own IP and cookies which are then used together to emulate a real user. Usage of the sessions is controlled by the Proxy rotation option. By providing a session pool name, you enable sharing of those sessions across multiple Actor runs. This is very useful when you need specific cookies for accessing the websites or when a lot of your proxies are already blocked. Instead of trying randomly, a list of working sessions will be saved and a new Actor run can reuse those sessions. Note that the IP lock on sessions expires after 24 hours, unless the session is used again in that window."
                    },
                    "initialCookies": {
                        "title": "Initial cookies",
                        "type": "array",
                        "description": "The provided cookies will be pre-set to all pages the scraper opens.",
                        "default": []
                    },
                    "useChrome": {
                        "title": "Use Chrome",
                        "type": "boolean",
                        "description": "The scraper will use a real Chrome browser instead of a Chromium masking as Chrome. Using this option may help with bypassing certain anti-scraping protections, but risks that the scraper will be unstable or not work at all.",
                        "default": false
                    },
                    "headless": {
                        "title": "Run browsers in headless mode",
                        "type": "boolean",
                        "description": "By default, browsers run in headless mode. You can toggle this off to run them in headful mode, which can help with certain rare anti-scraping protections but is slower and more costly.",
                        "default": true
                    },
                    "ignoreSslErrors": {
                        "title": "Ignore SSL errors",
                        "type": "boolean",
                        "description": "Scraper will ignore SSL certificate errors.",
                        "default": false
                    },
                    "ignoreCorsAndCsp": {
                        "title": "Ignore CORS and CSP",
                        "type": "boolean",
                        "description": "Scraper will ignore CSP (content security policy) and CORS (cross origin resource sharing) settings of visited pages and requested domains. This enables you to freely use XHR/Fetch to make HTTP requests from the scraper.",
                        "default": false
                    },
                    "downloadMedia": {
                        "title": "Download media",
                        "type": "boolean",
                        "description": "Scraper will download media such as images, fonts, videos and sounds. Disabling this may speed up the scrape, but certain websites could stop working correctly.",
                        "default": true
                    },
                    "downloadCss": {
                        "title": "Download CSS",
                        "type": "boolean",
                        "description": "Scraper will download CSS stylesheets. Disabling this may speed up the scrape, but certain websites could stop working correctly.",
                        "default": true
                    },
                    "maxRequestRetries": {
                        "title": "Max request retries",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of times the request for the page will be retried in case of an error. Setting it to 0 means that the request will be attempted once and will not be retried if it fails.",
                        "default": 3
                    },
                    "maxPagesPerCrawl": {
                        "title": "Max pages per run",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of pages that the scraper will open. 0 means unlimited.",
                        "default": 0
                    },
                    "maxResultsPerCrawl": {
                        "title": "Max result records",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of results that will be saved to dataset. The scraper will terminate afterwards. 0 means unlimited.",
                        "default": 0
                    },
                    "maxCrawlingDepth": {
                        "title": "Max crawling depth",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Defines how many links away from the StartURLs will the scraper descend. 0 means unlimited.",
                        "default": 0
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Defines how many pages can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. Use this option to set a hard limit.",
                        "default": 50
                    },
                    "pageLoadTimeoutSecs": {
                        "title": "Page load timeout",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum time the scraper will allow a web page to load in seconds.",
                        "default": 60
                    },
                    "pageFunctionTimeoutSecs": {
                        "title": "Page function timeout",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum time the scraper will wait for the page function to execute in seconds.",
                        "default": 60
                    },
                    "waitUntil": {
                        "title": "Navigation wait until",
                        "type": "array",
                        "description": "The scraper will wait until the selected events are triggered in the page before executing the page function. Available events are <code>domcontentloaded</code>, <code>load</code>, <code>networkidle2</code> and <code>networkidle0</code>. <a href=\"https://pptr.dev/#?product=Puppeteer&show=api-pagegotourl-options\" target=\"_blank\">See Puppeteer docs</a>.",
                        "default": [
                            "networkidle2"
                        ]
                    },
                    "preNavigationHooks": {
                        "title": "Pre-navigation hooks",
                        "type": "string",
                        "description": "Async functions that are sequentially evaluated before the navigation. Good for setting additional cookies or browser properties before navigation. The function accepts two parameters, `crawlingContext` and `gotoOptions`, which are passed to the `page.goto()` function the crawler calls to navigate."
                    },
                    "postNavigationHooks": {
                        "title": "Post-navigation hooks",
                        "type": "string",
                        "description": "Async functions that are sequentially evaluated after the navigation. Good for checking if the navigation was successful. The function accepts `crawlingContext` as the only parameter."
                    },
                    "closeCookieModals": {
                        "title": "Dismiss cookie modals",
                        "type": "boolean",
                        "description": "Using the [I don't care about cookies](https://addons.mozilla.org/en-US/firefox/addon/i-dont-care-about-cookies/) browser extension. When on, the crawler will automatically try to dismiss cookie consent modals. This can be useful when crawling European websites that show cookie consent modals.",
                        "default": false
                    },
                    "maxScrollHeightPixels": {
                        "title": "Maximum scrolling distance in pixels",
                        "type": "integer",
                        "description": "The crawler will scroll down the page until all content is loaded or the maximum scrolling distance is reached. Setting this to `0` disables scrolling altogether.",
                        "default": 5000
                    },
                    "debugLog": {
                        "title": "Debug log",
                        "type": "boolean",
                        "description": "Debug messages will be included in the log. Use <code>context.log.debug('message')</code> to log your own debug messages.",
                        "default": false
                    },
                    "browserLog": {
                        "title": "Browser log",
                        "type": "boolean",
                        "description": "Console messages from the Browser will be included in the log. This may result in the log being flooded by error messages, warnings and other messages of little value, especially with high concurrency.",
                        "default": false
                    },
                    "customData": {
                        "title": "Custom data",
                        "type": "object",
                        "description": "This object will be available on pageFunction's context as customData.",
                        "default": {}
                    },
                    "datasetName": {
                        "title": "Dataset name",
                        "type": "string",
                        "description": "Name or ID of the dataset that will be used for storing results. If left empty, the default dataset of the run will be used."
                    },
                    "keyValueStoreName": {
                        "title": "Key-value store name",
                        "type": "string",
                        "description": "Name or ID of the key-value store that will be used for storing records. If left empty, the default key-value store of the run will be used."
                    },
                    "requestQueueName": {
                        "title": "Request queue name",
                        "type": "string",
                        "description": "Name of the request queue that will be used for storing requests. If left empty, the default request queue of the run will be used."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
