# Website Content Crawler (`apify/website-content-crawler`) Actor

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

- **URL**: https://apify.com/apify/website-content-crawler.md
- **Developed by:** [Apify](https://apify.com/apify) (Apify)
- **Categories:** AI, Developer tools
- **Stats:** 133,105 total users, 7,689 monthly users, 99.5% runs succeeded, 2,580 bookmarks
- **User rating**: 4.58 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

Website Content Crawler is an [Apify Actor](https://docs.apify.com/platform/actors) that can perform
a deep crawl of one or more websites and extract text content and files from the web pages.

It is useful for extracting web data from websites such as documentation,
knowledge bases, help sites, or blogs for feeding large language models (LLMs) and AI applications.

Website Content Crawler has a simple input configuration so that it can be easily integrated into customer-facing products, where customers
can enter just a URL of the website they want to have indexed by an AI application.
You can retrieve the results using the API to formats such as JSON or CSV,
which can be fed directly to your LLM,
[vector database](https://blog.apify.com/what-is-a-vector-database/), or RAG pipeline.

![website-content-crawler-diagram.png](https://apify-uploads-prod.s3.us-east-1.amazonaws.com/9fBHq4FpHxdWY7r5c-issue-pndQjRHs9VMpSH6g2-9fiVpyyjNk-Website_content_crawler_diagram.png)

### Main features

Website Content Crawler is built upon [Crawlee](https://crawlee.dev/), Apify's state-of-the-art
open-source library for web scraping and crawling. The Actor can:

- Crawl JavaScript-enabled websites using **headless Firefox** or simple sites using **raw HTTP**.
- Circumvent **anti-scraping protections** using browser fingerprinting and proxies.
- Save web content in plain text, **Markdown**, or HTML.
- Crawl **pages behind a login** by providing cookies.
- **Download files** in PDF, DOC, DOCX, XLS, XLSX, or CSV formats.
- **Remove fluff** from pages like navigation, header, footers, ads, modals, or cookies warnings to improve the accuracy of the data.
- Load content of pages with **infinite scroll**.
- Use sitemaps to find more URLs on the website.
- **Scale gracefully** from tiny sites to sites with millions of pages by leveraging the Apify platform capabilities.
- Integrate with **🦜🔗LangChain**,: **LlamaIndex**, **Haystack**, **Pinecone**, **Qdrant**, or **OpenAI Assistant**
- and much more...

Learn about the key features and capabilities in the **Website Content Crawler Overview** video:

[Introducing Website Content Crawler](https://www.youtube.com/watch?v=vUMPfIOfXXQ)

Still unsure if the Website Content Crawler can handle your use case? Simply try it for free and see the results for yourself.


### Designed for generative AI and LLMs

The results of Website Content Crawler can help you feed, fine-tune or train your large language models (LLMs)
or provide context for prompts for ChatGPT.
In return, the model will answer questions based on your or your customer's websites and content.

To learn more, check out our **Web Scraping Data for Generative AI** video on this topic, showcasing the Website Content Crawler:

[Web Scraping Data for Generative AI webinar](https://www.youtube.com/watch?v=8uvHH-ocSes)

**Custom chatbots for customer support**

Customer service chatbots personalized on customer websites, such as documentation or knowledge bases,
are one of the most promising use cases of AI and LLMs. Let your
customers easily onboard by typing the URL of their site, and thus give your chatbot detailed
knowledge of their product or service.
Learn more about this use case in our [blog post](https://blog.apify.com/talk-to-your-website-with-large-language-models/).

**Generate personalized content based on customer’s copy**

ChatGPT and LLMs can write articles for you, but they won’t sound like you wrote them. Feed all your old blogs into your
model to make it sound like you. Alternatively, train the model on your customers’ blogs and have it write in their tone of voice.
Or help their technical writers with making first drafts of new documentation pages.

**Retrieval Augmented Generation (RAG) use cases**

Use your website content to create an all-knowing AI assistant. The LLM-powered bot can then answer questions based on your website content,
or even generate new content based on the existing one. This is a great way to provide a personalized experience to your customers
or help your employees find the information they need faster.

**Summarization, translation, proofreading at scale**

Got some old docs or blogs that need to be improved? Use Website Content Crawler to scrape the content, feed it to the ChatGPT API,
and ask it to summarize, proofread, translate, or change the style of the content.

**Enhance your custom GPTs**

Uploading knowledge files gives custom OpenAI GPTs reliable information to refer to when generating answers. With Website Content Crawler, you can scrape data from any website to [provide your GPT with custom knowledge](https://blog.apify.com/custom-gpts-knowledge/).


### How does it work?

Website Content Crawler operates in three stages:

1) **Crawling** - Finds and downloads the right web pages.
2) **HTML processing** - Transforms the DOM of crawled pages to e.g. remove navigation, header, footer, cookie warnings, and other fluff.
3) **Output** - Converts the resulting DOM to plain text or Markdown and saves downloaded files.

For clarity, the input settings of the Actor are organized according to the above stages. Note that input settings
have reasonable defaults—the only mandatory setting is the **Start URLs**.

#### Crawling

Website Content Crawler only needs one or more **Start URLs** to run, typically the top-level URL of the documentation site, blog, or
knowledge base that you want to scrape. The actor crawls the start URLs, finds links to other pages,
and recursively crawls those pages, too, as long as their URL is under the start URL.

For example, if you enter the start URL `https://example.com/blog/`, the
actor will crawl pages like `https://example.com/blog/article-1` or `https://example.com/blog/section/article-2`,
but will skip pages like `https://example.com/docs/something-else`.

You can also force the crawler to skip certain URLs using the **Exclude URLs (globs)** input setting,
which specifies an array of glob patterns matching URLs of pages to be skipped.
Note that this setting affects only links found on pages, but not **Start URLs**, which are always crawled.
For example, `https://{store,docs}.example.com/**` will exclude all URLs starting with
`https://store.example.com/` and `https://docs.example.com/`.
Or `https://example.com/**/*\?*foo=*` exclude all URLs that contain `foo` query parameter with any value.
You can learn more about globs and test them [here](https://www.digitalocean.com/community/tools/glob?comments=true&glob=https%3A%2F%2Fexample.com%2Fdont_scrape_this%2F%2A%2A&matches=false&tests=https%3A%2F%2Fexample.com%2Ftools%2F&tests=https%3A%2F%2Fexample.com%2Fdont_scrape_this%2F&tests=https%3A%2F%2Fexample.com%2Fdont_scrape_this%2F123%3Ftest%3Dabc&tests=https%3A%2F%2Fexample.com%2Fscrape_this).

The Actor automatically skips duplicate pages identified by the same [canonical URL](https://en.wikipedia.org/wiki/Canonical_link_element);
those pages are loaded and counted towards the _Max pages_ limit but not saved to the results.

##### Crawler types

Website Content Crawler provides various input settings to customize the crawling.
For example, you can select the crawler type:
- **Adaptive switching between browser and raw HTTP client** (default) - The crawler automatically switches between raw HTTP for static pages and Firefox browser (via Playwright) for dynamic pages, to get the maximum performance wherever possible.
- **Headless web browser** - Useful for modern websites with
  anti-scraping protections and JavaScript rendering. It recognizes common blocking patterns like CAPTCHAs and
  automatically retries blocked requests through new sessions. However, running web browsers is more expensive as it
  requires more computing resources and is slower.
- **Raw HTTP client** - High-performance crawling mode that uses raw HTTP requests to fetch the pages.
  It is faster and cheaper, but it might not work on all websites.
- **Raw HTTP client with JS execution (JSDOM)** (deprecated) - A compromise between a browser and raw HTTP crawlers. This crawler type is deprecated, use **Raw HTTP client (Cheerio)** instead.
- **Headless browser (Chrome+Playwright)** (deprecated) - This crawler type is deprecated, use **Headless browser (Firefox+Playwright)** instead.

You can also set additional input parameters such as a maximum number of pages, maximum crawling depth,
maximum concurrency, proxy configuration, timeout, etc., to control the behavior and performance of the Actor.

#### HTML processing

The goal of the HTML processing step is to ensure each web page has the right content — neither less nor more.

If you're using a headless browser **Crawler type**,
whenever a web page is loaded,
the Actor can wait a certain time or scroll to a certain height
to ensure all dynamic page content is loaded, using the **Wait for dynamic content** or **Maximum scroll height** input settings, respectively.
If **Expand clickable elements** is enabled, the Actor tries to click various DOM
elements to ensure their content is expanded and visible in the resulting text.

Once the web page is ready, the Actor
transforms its DOM to remove irrelevant content in order to help you ensure you're feeding your AI models with relevant data
to keep them accurate.

First, the Actor removes DOM nodes matching the **Remove HTML elements (CSS selector)**. The provided default value attempts
to remove all common types of modals, navigation, headers, or footers, as well as scripts and inline images
to reduce the output HTML size.

If the **Extract HTML elements (CSS selector)** option is specified, the Actor only keeps the contents of the elements targeted by this CSS selector and removes all the other HTML elements from the DOM.

Then, if **Remove cookie warnings** is enabled,
the Actor removes cookie warnings using the [I don't care about cookies](https://addons.mozilla.org/en-US/firefox/addon/i-dont-care-about-cookies/) browser extension.

Finally, the Actor transforms the page using the selected **HTML transformer**, whose goal is to only keep the important content of the page and reduce
its complexity before converting it to text. Basically, to keep just the "meat" of the article or a page.

#### File download

If the `Save files` option is set, the Actor will download "document" files linked from the page. This is limited to PDF, DOC, DOCX, XLS, and XLSX files.

Note that these files are exempt from the URL scoping rules - any file linked from the page will be downloaded, regardless of its URL.
You can change this behaviour by using the `Include / Exclude URLs (globs)` setting.

The hard limit for a single file download time is 1 hour. If the download of a single file takes longer, the Actor will abort the download.

Furthermore, you can specify the minimal file download speed in kilobytes per second. If the download speed stays below this threshold for more than 10 seconds, the download will be aborted.
This is configurable via the `Minimal file download speed (KB/s)` setting.

#### Output

Once the web page HTML is processed, the Actor converts it to the desired output format, including plain text, Markdown to preserve rich formatting,
or save the full HTML or a screenshot of the page, which is useful for debugging.
The Actor also saves important metadata about the content, such as author, language, publishing date, etc.

The results of the actor are stored in the default [Dataset](https://docs.apify.com/platform/storage/dataset) associated
with the Actor run, from where you can access it via API and export to formats like JSON, XML, or CSV.


### Example

This example shows how to scrape all pages from the Apify documentation at https://docs.apify.com/:

#### Input


![input-screenshot.png](https://apify-uploads-prod.s3.amazonaws.com/d9f0eb8b-ae01-42be-8629-39e659a14ed6_website_content_crawler_input_example.png)

[See full input](https://apify.com/apify/website-content-crawler/input-schema) with description.

#### Output

This is how one crawled page (https://docs.apify.com/academy/web-scraping-for-beginners) looks in a browser:

![page-screenshot.png](https://apify-uploads-prod.s3.amazonaws.com/399ddefd-3877-41e7-86ed-4025cdea46f8_Screenshot2023-03-29at10.56.57.png)

And here is how the crawling result looks in JSON format (note that other formats like CSV or Excel are also supported).
The main page content can be found in the `text` field, and it only contains the valuable
content, without menus and other noise:

```json
{
    "url": "https://docs.apify.com/academy/web-scraping-for-beginners",
    "crawl": {
        "loadedUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
        "loadedTime": "2023-04-05T16:26:51.030Z",
        "referrerUrl": "https://docs.apify.com/academy",
        "depth": 0
    },
    "metadata": {
        "canonicalUrl": "https://docs.apify.com/academy/web-scraping-for-beginners",
        "title": "Web scraping for beginners | Apify Documentation",
        "description": "Learn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.",
        "author": null,
        "keywords": null,
        "languageCode": "en"
    },
    "screenshotUrl": null,
    "text": "Skip to main content\nOn this page\nWeb scraping for beginners\nLearn how to develop web scrapers with this comprehensive and practical course. Go from beginner to expert, all in one place.\nWelcome to Web scraping for beginners...",
    "html": null,
    "markdown": "  Web scraping for beginners | Apify Documentation       \n\n[Skip to main content](#docusaurus_skipToContent_fallback)\n\nOn this page\n\n# Web scraping for beginners..."
}
````

### Integration with the AI ecosystem

Thanks to the native [Apify platform integrations](https://docs.apify.com/platform/integrations),
Website Content Crawler can seamlessly connect with various third-party
systems and tools.

#### Exporting GPT knowledge files

Apify allows you to seamlessly export the results of Website Content Crawler runs to your custom GPTs.

To do this, go to the Output tab of the Actor run and click the "Export results" button. From here, pick `JSON` and click "Export". You can then upload the JSON file to your custom GPTs.

For a step-by-step guide, see [How to add a knowledge base to your GPTs](https://blog.apify.com/custom-gpts-knowledge/#how-to-add-knowledge-to-gpts-step-by-step-guide).

#### LangChain integration

[LangChain](https://github.com/hwchase17/langchain) is the most popular framework for
developing applications powered by language models.
It provides an [integration for Apify](https://python.langchain.com/en/latest/modules/agents/tools/examples/apify.html),
so you can feed Actor results directly to LangChain’s vector databases,
enabling you to easily create ChatGPT-like query interfaces to
websites with documentation, knowledge base, blog, etc.

##### Python example

First, install LangChain with OpenAI LLM and Apify API client for Python:

```bash
pip install apify-client langchain langchain_community langchain_openai openai tiktoken
```

And then create a ChatGPT-powered answering machine:

```python
import os

from langchain.indexes import VectorstoreIndexCreator
from langchain_community.utilities import ApifyWrapper
from langchain_core.document_loaders.base import Document
from langchain_openai import OpenAI
from langchain_openai.embeddings import OpenAIEmbeddings

## Set up your Apify API token and OpenAI API key
os.environ["OPENAI_API_KEY"] = "Your OpenAI API key"
os.environ["APIFY_API_TOKEN"] = "Your Apify API token"

apify = ApifyWrapper()

## Run the Website Content Crawler on a website, wait for it to finish, and save its results into a LangChain document loader:
loader = apify.call_actor(
    actor_id="apify/website-content-crawler",
    run_input={"startUrls": [{"url": "https://docs.apify.com/"}], "maxCrawlPages": 10},
    dataset_mapping_function=lambda item: Document(
        page_content=item["text"] or "", metadata={"source": item["url"]}
    ),
)
## Initialize the vector database with the text documents:
index = VectorstoreIndexCreator(embedding=OpenAIEmbeddings()).from_loaders([loader])

## Finally, query the vector database:
query = "What is Apify?"
result = index.query_with_sources(query, llm=OpenAI())

print("answer:", result["answer"])
print("source:", result["sources"])
```

The query produces an answer like this:

> *Apify is a platform for developing, running, and sharing serverless cloud programs. It enables users to create web scraping and automation tools and publish them on the Apify platform.*
>
> https://docs.apify.com/platform/actors, https://docs.apify.com/platform/actors/running/actors-in-store, https://docs.apify.com/platform/security, https://docs.apify.com/platform/actors/examples

For details and Jupyter notebook, see [Apify integration for LangChain](https://python.langchain.com/docs/integrations/tools/apify/).

##### Node.js example

See [detailed example](https://js.langchain.com/docs/modules/indexes/document_loaders/examples/web_loaders/apify_dataset) in LangChain for JavaScript.

#### LlamaIndex integration

[LlamaIndex](https://docs.llamaindex.ai/) is a Python library that provides a central interface to connect LLMs with external data.
The [Apify integration](https://llamahub.ai/l/readers/llama-index-readers-apify?from=) makes it easy to feed LlamaIndex applications with data crawled from the web.

Install all required packages:

```bash
pip install apify-client llama-index-core llama-index-readers-apify
```

```python
from llama_index.core import Document
from llama_index.readers.apify import ApifyActor

reader = ApifyActor("<My Apify API token>")

documents = reader.load_data(
    actor_id="apify/website-content-crawler",
    run_input={
        "startUrls": [{"url": "https://docs.llamaindex.ai/en/latest/"}]
    },
    dataset_mapping_function=lambda item: Document(
        text=item.get("text"),
        metadata={
            "url": item.get("url"),
        },
    ),
)
```

#### Vector database integrations (Pinecone, Qdrant)

Website Content Crawler can be easily integrated with vector databases to store the crawled data for semantic search.
Using Apify's [Pinecone](https://apify.com/apify/pinecone-integration) or [Qdrant](https://apify.com/apify/qdrant-integration) integration Actors, you can upload the results of Website Content Crawler directly into a vector database.
The integrations support incremental updates, updating only the data that has changed since the last crawl.
This helps to reduce costly embedding computation and storage operations, making it suitable for regular updates of large websites.
Just set up the Pinecone integration Actor with Website Content Crawler using this [step-by-step guide](https://docs.apify.com/platform/integrations/pinecone).

#### GPT integration

You can use Website Content Crawler to add knowledge to your GPTs. Crawl a website and upload the scraped dataset to your custom GPT.
The video tutorial below demonstrates how it works.

[How to add a knowledge base to your GPTs](https://www.youtube.com/watch?v=z552gt-3Ce0)

You can also use the Website Content Crawler together with the OpenAI Assistant to update its knowledge base with web content using the [OpenAI VectorStore Integration](https://apify.com/jiri.spilka/openai-vector-store-integration).

### How much does it cost?

Website Content Crawler is free to use—you only pay for the Apify platform usage consumed by the Actor.
The exact price depends on the crawler type and settings, website complexity, network speed,
and random circumstances.

The main cost driver of Website Content Crawler is the compute power, which is measured in the Actor compute units (CU):
1 CU corresponds to an actor with 1 GB of
memory running for 1 hour. With the baseline price of $0.25/CU, from our tests, the actor usage costs **approximately**:

- $0.5 - $5 per 1,000 web pages with a headless browser, depending on the website
- $0.2 per 1,000 web pages with raw HTTP crawler

Note that Apify's free plan gives you $5 free credits every month and access to [Apify Proxy](https://apify.com/proxy),
which is sufficient for testing and low-volume use cases.

### Troubleshooting

- The Actor works best for crawling sites with multiple URLs. For **extracting text or Markdown from a single URL**,
  you might prefer to use [RAG Web Browser](https://apify.com/apify/rag-web-browser) in the Standby mode,
  which is much faster and more efficient.
- If the **extracted text doesn’t contain the expected page content**, try to select another *Crawler type*.
  Generally, a headless browser will extract more text as it loads dynamic page content
  and is less likely to be blocked.
- If the **extracted text has more than expected page content** (e.g. navigation or footer),
  try to select another *HTML transformer*, or use the *Remove HTML elements* setting
  to skip unwanted parts of the page.
- If the **crawler is too slow**, try increasing the Actor memory and/or the *Initial concurrency* setting.
  Note that if you set the concurrency too high, the Actor will run out of memory and crash,
  or potentially overload the target site.
- If the target website is blocking the crawler, make sure to use the **Stealthy web browser (Firefox+Playwright)**
  crawler type and use residential proxies
- The crawler **automatically restarts on crash**, and continues where it left off.
  But if it crashes more than 3 times per minute, the system fails the Actor run.

### Help & support

Website Content Crawler is under active development.
If you have any feedback or feature ideas, please [submit an issue](https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/issues).

### Is it legal?

Web scraping is generally legal if you scrape publicly available non-personal data. What you do with the data is another question.
Documentation, help articles, or blogs are typically protected by copyright, so you can't republish the content without the owner's permission.

Learn more about the legality of web scraping in this
[blog post](https://blog.apify.com/is-web-scraping-legal/). If you're not sure, please seek professional legal advice.

# Actor input Schema

## `startUrls` (type: `array`):

One or more URLs of pages where the crawler will start.

By default, the Actor will also crawl sub-pages of these URLs.

For example, for start URL `https://example.com/blog`, it will crawl also `https://example.com/blog/post` or `https://example.com/blog/article`.

The **Include URL patterns (globs)** option can override this behavior.

## `crawlerType` (type: `string`):

Select the crawling engine:

- **Adaptive switching** between browser and raw HTTP: Fast and renders JavaScript content if present. Default and recommended option.
- **Headless browser** (Firefox+Playwright): Reliable, renders JavaScript content, best in avoiding blocking, but might be slow.
- **Raw HTTP client** (Cheerio): Fastest, but doesn't render JavaScript content.
- **Raw HTTP client with JavaScript** (JSDOM): Deprecated, use Cheerio instead.
- **Headless browser** (Chrome+Playwright): Deprecated, use Firefox+Playwright instead.

More details about Crawler types are in [readme](https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/information/version-0/readme#crawler-types).

## `includeUrlGlobs` (type: `array`):

Define URL patterns (globs) to extend crawling beyond **Start URLs** and their subpages.

Example: `https://www.example.com/blog/**` matches any blog page — `https://www.example.com/blog/post-title` or `https://www.example.com/blog/category/post` — even if the Start URL is `https://www.example.com/product/some-product`.

It affects only links found on pages, but not **Start URLs** - if you want to crawl a page, make sure to specify its URL in the **Start URLs** field.

Combined with **Exclude URL patterns**, you can precisely control which pages are crawled.

Learn more about globs [here](https://www.digitalocean.com/community/tools/glob?comments=true\&glob=https%3A%2F%2Fexample.com%2Fscrape_this%2F%2A%2A\&matches=false\&tests=https%3A%2F%2Fexample.com%2Ftools%2F\&tests=https%3A%2F%2Fexample.com%2Fscrape_this%2F\&tests=https%3A%2F%2Fexample.com%2Fscrape_this%2F123%3Ftest%3Dabc\&tests=https%3A%2F%2Fexample.com%2Fdont_scrape_this) and test them with our **Glob tester** under this input.

## `excludeUrlGlobs` (type: `array`):

Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not **Start URLs**, which are always crawled.

For example `https://{store,docs}.example.com/**` excludes all URLs starting with `https://store.example.com/` or `https://docs.example.com/`, and `https://example.com/**/*\?*foo=*` excludes all URLs that contain `foo` query parameter with any value.

Learn more about globs [here](https://www.digitalocean.com/community/tools/glob?comments=true\&glob=https%3A%2F%2Fexample.com%2Fscrape_this%2F%2A%2A\&matches=false\&tests=https%3A%2F%2Fexample.com%2Ftools%2F\&tests=https%3A%2F%2Fexample.com%2Fscrape_this%2F\&tests=https%3A%2F%2Fexample.com%2Fscrape_this%2F123%3Ftest%3Dabc\&tests=https%3A%2F%2Fexample.com%2Fdont_scrape_this) and test them with our **Glob tester** under this input.

## `maxCrawlDepth` (type: `integer`):

The maximum number of links starting from the start URL that the crawler will recursively follow. The start URLs have depth `0`, the pages linked directly from the start URLs have depth `1`, and so on.

Useful to prevent accidental crawler runaway. By setting it to `0`, the Actor will only crawl the Start URLs.

## `maxCrawlPages` (type: `integer`):

The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.

## `useSitemaps` (type: `boolean`):

If enabled, the crawler will look for [Sitemaps](https://en.wikipedia.org/wiki/Sitemaps) at the domains of the provided *Start URLs* and enqueue matching URLs similarly as the links found on crawled pages.

You can also reference a `sitemap.xml` file directly by adding it as another Start URL (e.g. `https://www.example.com/sitemap.xml`)

The crawling could be more robust with Sitemaps, as it includes pages that might be not reachable from Start URLs. However, **loading and processing Sitemaps can take a lot of time, especially for large sites**.

Note that if a page is found via Sitemaps, it will have `depth` of `1`.

## `useLlmsTxt` (type: `boolean`):

If enabled, the crawler will look for `/llms.txt` files at the root of the domains of the provided Start URLs (e.g., `https://example.com/llms.txt`) and enqueue them for crawling. Note that this also enables crawling other Markdown files and enqueueing links from them.

## `respectRobotsTxtFile` (type: `boolean`):

If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.

## `keepUrlFragments` (type: `boolean`):

Indicates that URL fragments (e.g. <code>http://example.com<b>#fragment</b></code>) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such a case, this option should be enabled.

## `ignoreCanonicalUrl` (type: `boolean`):

If enabled, the Actor will ignore the canonical URL or the `ETag` header reported by the page, and use the actual URL instead. You can use this feature for websites that report invalid canonical URLs, which causes the Actor to skip those pages in results.

## `proxyConfiguration` (type: `object`):

Enables loading the websites from IP addresses in specific geographies and to circumvent blocking.

## `initialCookies` (type: `array`):

Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with `name` and `value` properties. For example:

```json
[
  {
    "name": "cookieName",
    "value": "cookieValue",
    "path": "/",
    "domain": ".apify.com"
  }
]
```

You can use the [EditThisCookie](https://docs.apify.com/academy/tools/edit-this-cookie) browser extension to copy browser cookies in this format, and paste it here.

Note that the value is secret and encrypted to protect your login cookies.

## `customHttpHeaders` (type: `object`):

HTTP headers that will be added to all requests made by the crawler. This is useful for setting custom authentication headers or other headers required by the target website. The value is expected to be a JSON object with `name` and `value` properties pairs. For example: `{ "name1": "value1", "Authorization": "Basic a1b2c3d4..." }`.

## `signHttpRequests` (type: `boolean`):

If enabled, the crawler will sign all HTTP requests using its Web Bot Auth private key. This is necessary if you want to use Website Content Crawler as a Cloudflare Signed Agent.

## `initialConcurrency` (type: `integer`):

The initial number of web browsers or HTTP clients running in parallel. The system scales the concurrency up and down based on the current CPU and memory load. If the value is set to 0 (default), the Actor uses the default setting for the specific crawler type.

Note that if you set this value too high, the Actor will run out of memory and crash. If too low, it will be slow at start before it scales the concurrency up.

## `maxConcurrency` (type: `integer`):

The maximum number of web browsers or HTTP clients running in parallel. This setting is useful to avoid overloading the target websites and to avoid getting blocked.

## `requestTimeoutSecs` (type: `integer`):

Timeout in seconds for making the request and processing its response. Defaults to 60s.

## `minFileDownloadSpeedKBps` (type: `integer`):

The minimum viable file download speed in kilobytes per seconds. If the file download speed is lower than this value for a prolonged duration, the crawler will consider the file download as failing, abort it, and retry it again (up to "Maximum number of retries" times). This is useful to avoid your crawls being stuck on slow file downloads.

## `maxRequestRetries` (type: `integer`):

The maximum number of times the crawler will retry the request on network, proxy or server errors. If the (n+1)-th request still fails, the crawler will mark this request as failed.

## `maxSessionRotations` (type: `integer`):

The maximum number of times the crawler will rotate the session (IP address + browser configuration) on anti-scraping measures like CAPTCHAs. If the crawler rotates the session more than this number and the page is still blocked, it will finish with an error.

## `ignoreHttpsErrors` (type: `boolean`):

If enabled, the scraper will ignore HTTPS certificate errors. Use at your own risk.

## `dynamicContentWaitSecs` (type: `integer`):

The maximum time in seconds to wait for dynamic page content to load. By default, it is 10 seconds. The crawler will continue processing the page either if this time elapses, or if it detects the network became idle as there are no more requests for additional resources.

When using the **Wait for selector** option, the crawler will wait for the selector to appear for this amount of time. If the selector doesn't appear within this period, the request will fail and will be retried.

Note that this setting is ignored for the raw HTTP client, because it doesn't execute JavaScript or loads any dynamic resources. Similarly, if the value is set to `0`, the crawler doesn't wait for any dynamic to load and processes the HTML as provided on load.

## `waitForSelector` (type: `string`):

Specify a **CSS selector** to tell the crawler to wait for a specific element to appear before it starts extracting content. This is helpful for pages where the content loads dynamically.

Examples: `div`, `#id-of-an-element`, `.class-name`

This setting disables the default content-load detection. If the element doesn't appear within the **Wait for dynamic content** timeout, the request will fail and be retried.

If **Wait for dynamic content** is set to `0`, the crawler does not wait for late elements. Instead, it checks the selector only against the current page state / HTML snapshot, and fails the request immediately if the selector is not found.

With the raw HTTP client, this option checks for the presence of the selector in the HTML content and throws an error if it's not found.

## `softWaitForSelector` (type: `string`):

If set, the crawler will wait for the specified CSS selector to appear in the page before proceeding with the content extraction. Unlike the `waitForSelector` option, this option doesn't fail the request if the selector doesn't appear within the timeout (the request processing will continue).

## `maxScrollHeightPixels` (type: `integer`):

The crawler will scroll down the page until all content is loaded (and network becomes idle), or until this maximum scrolling height is reached. Setting this value to `0` disables scrolling altogether.

Note that this setting is ignored for the raw HTTP client, because it doesn't execute JavaScript or loads any dynamic resources.

## `removeCookieWarnings` (type: `boolean`):

If enabled, the Actor will try to remove cookies consent dialogs or modals, using the [I don't care about cookies](https://addons.mozilla.org/en-US/firefox/addon/i-dont-care-about-cookies/) browser extension, to improve the accuracy of the extracted text. Note that there is a small performance penalty if this feature is enabled.

This setting is ignored when using the raw HTTP crawler type.

## `blockMedia` (type: `boolean`):

If the flag is enabled and the Actor is using a headless browser, it will not load images, fonts, stylesheets and videos to improve performance. It will load scripts as usual - that is after all the point of using a headless browser.

## `expandIframes` (type: `boolean`):

By default, the Actor will extract content from `iframe` elements. If you want to specifically skip `iframe` processing, disable this option. Works only for the `playwright:firefox` crawler type.

## `clickElementsCssSelector` (type: `string`):

A CSS selector matching DOM elements that will be clicked. This is useful for expanding collapsed sections, in order to capture their text content. The value must be a valid CSS selector as accepted by the `document.querySelectorAll()` function.

## `stickyContainerCssSelector` (type: `string`):

This is an **experimental** feature. A CSS selector matching DOM elements that will be prevented from deleting any of their children. This is useful in conjunction with the "Expand clickable elements" option on pages where hidden content is actually removed from the DOM (i.e., some variants of the accordion pattern). Enabling this might corrupt the extracted content, which is why it is disabled by default. It is possible to enable the feature for the whole page with the `*` selector, or you can target specific elements if the former has unwanted side effects.

## `pageFunction` (type: `string`):

A declaration of an asynchronous JS function (e.g. `async function pageFunction({ page }) { await page.click('.submit-button') }`).

The function receives `context` as the only argument. Context is a JavaScript object containing the following properties:

- `page`: Currently loaded Playwright `Page` instance.
- `request`: The request object that triggered the page load.

The function will be executed in the browser context for each crawled page, after the page is loaded (included all dynamic content) and before the content is extracted and cleaned.

## `keepElementsCssSelector` (type: `string`):

Extract only relevant page content by specifying CSS selectors (e.g. `div`, `#element-id`, `.class-name`). [Learn more about CSS selectors](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/Basic_selectors).

If any selectors are defined, everything else will be removed from the page.

This option runs before the `HTML transformer` option. If you are missing content in the output despite using this option, try disabling the `HTML transformer`.

## `removeElementsCssSelector` (type: `string`):

Specify which HTML elements should be removed from the page before text extraction. This is useful to skip irrelevant page content.

By default, the Actor removes common navigation elements, headers, footers, modals, scripts, and inline image. You can disable the removal by setting this value to some non-existent CSS selector like `dummy_keep_everything`.

## `htmlTransformer` (type: `string`):

Specify how to transform HTML to get meaningful content, removing extra fluff like navigation or pop-ups. This is applied after any HTML elements are removed or clicked.

- **Readable text with fallback**: Uses Mozilla's Readability to extract content, but keeps the original HTML if it's not a clear article. Great for sites with mixed content like articles and product pages.

- **Readable text** (Default): Also uses Mozilla's Readability but is more aggressive, removing headers, footers, and navigation. Best for blogs and article-heavy sites.

- **Extractus**: An alternative content extraction algorithm that might work better for certain news sites or blogs with unique layouts.

- **Defuddle**: More forgiving than Readability, better preserving elements like math and footnotes, code. It also extracts metadata and uses mobile styles for clean-up.

- **None**: Only performs basic cleaning and removes elements specified by you. This option is best when you need to preserve most of the page's original HTML.

## `readableTextCharThreshold` (type: `integer`):

A configuration options for the "Readable text" HTML transformer. It contains the minimum number of characters an article must have in order to be considered relevant.

## `aggressivePrune` (type: `boolean`):

This is an **experimental feature**. If enabled, the crawler will prune content lines that are very similar to the ones already crawled on other pages, using the Count-Min Sketch algorithm. This is useful to strip repeating content in the scraped data like menus, headers, footers, etc. In some (not very likely) cases, it might remove relevant content from some pages.

## `debugMode` (type: `boolean`):

If enabled, the Actor will store the output of all types of HTML transformers, including the ones that are not used by default, and it will also store the HTML to Key-value Store with a link. All this data is stored under the `debug` field in the resulting Dataset.

## `debugLog` (type: `boolean`):

If enabled, the actor log will include debug messages. Beware that this can be quite verbose.

## `storeSkippedUrls` (type: `boolean`):

If enabled, the crawler will store all URLs that were skipped during the crawl in a Key-Value Store record named `SKIPPED_URLS`. The record will contain a JSON object with reasons for skipping and the URLs that were skipped for each reason. This is useful for debugging and understanding why certain pages were not crawled.

## `saveHtml` (type: `boolean`):

If enabled, the crawler stores full transformed HTML of all pages found to the output dataset under the `html` field. **This option has been deprecated** in favor of the `saveHtmlAsFile` option, because the dataset records have a size of approximately 10MB and it's harder to review the HTML for debugging.

## `saveHtmlAsFile` (type: `boolean`):

If enabled, the crawler stores full transformed HTML of all pages found to the default key-value store and saves links to the files as `htmlUrl` field in the output dataset. Storing HTML in key-value store is preferred to storing it into the dataset with the `saveHtml` option, because there's no size limit and it's easier for debugging as you can easily view the HTML.

## `saveMarkdown` (type: `boolean`):

If enabled, the crawler converts the transformed HTML of all pages found to Markdown, and stores it under the `markdown` field in the output dataset.

## `saveFiles` (type: `boolean`):

Deprecated in favor of the `saveContentTypes` option. Will be removed soon. If enabled, the crawler downloads files linked from the web pages, as long as their URL has one of the following file extensions: PDF, DOC, DOCX, XLS, XLSX, and CSV. Note that unlike web pages, the files are downloaded regardless if they are under **Start URLs** or not. The files are stored to the default key-value store, and metadata about them to the output dataset, similarly as for web pages.

## `saveContentTypes` (type: `string`):

The crawler downloads files linked from the web pages, as long as their content type matches the provided value. Select predefined <a href="https://www.iana.org/assignments/media-types/media-types.xhtml">Content-type</a> groups to download common file types, or enter custom HTTP Content-type strings, including wildcards (e.g., application/pdf, text/\*, image/\*) for specific downloads. Note that unlike web pages, the files are downloaded regardless if they are under **Start URLs** or not. The files are stored to the default key-value store, and metadata about them to the output dataset, similarly as for web pages.

## `saveScreenshots` (type: `boolean`):

If enabled, the crawler stores a screenshot for each article page to the default key-value store. The link to the screenshot is stored under the `screenshotUrl` field in the output dataset. It is useful for debugging, but reduces performance and increases storage costs.

Note that this feature only works with the `playwright:firefox` crawler type.

## `maxResults` (type: `integer`):

The maximum number of web pages and files to store. This setting helps prevent an accidental crawler runaway by automatically stopping the crawl once this limit is reached. Note that the crawler skips pages whose canonical URL matches a page that has already been crawled, so it may crawl more pages than the number of stored results. Similarly, there may be more stored results than crawled web pages because downloaded files also count toward results.

## `clientSideMinChangePercentage` (type: `integer`):

The least amount of content (as a percentage) change after the initial load required to consider the pages client-side rendered

## `renderingTypeDetectionPercentage` (type: `integer`):

How often should the adaptive attempt to detect page rendering type

## `reuseStoredDetectionResults` (type: `boolean`):

If enabled, the crawler (if using playwright:adaptive) will reuse results of rendering type detections done in previous runs to speed up crawling of statically rendered pages

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://docs.apify.com/academy/scraping-basics-javascript"
    }
  ],
  "crawlerType": "playwright:adaptive",
  "includeUrlGlobs": [],
  "excludeUrlGlobs": [],
  "maxCrawlDepth": 20,
  "maxCrawlPages": 9999999,
  "useSitemaps": false,
  "useLlmsTxt": false,
  "respectRobotsTxtFile": true,
  "keepUrlFragments": false,
  "ignoreCanonicalUrl": false,
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "initialCookies": [],
  "customHttpHeaders": {},
  "signHttpRequests": false,
  "initialConcurrency": 0,
  "maxConcurrency": 200,
  "requestTimeoutSecs": 60,
  "minFileDownloadSpeedKBps": 128,
  "maxRequestRetries": 3,
  "maxSessionRotations": 10,
  "ignoreHttpsErrors": false,
  "dynamicContentWaitSecs": 10,
  "waitForSelector": "",
  "softWaitForSelector": "",
  "maxScrollHeightPixels": 5000,
  "removeCookieWarnings": true,
  "blockMedia": true,
  "expandIframes": true,
  "clickElementsCssSelector": "[aria-expanded=\"false\"]",
  "pageFunction": "",
  "keepElementsCssSelector": "",
  "removeElementsCssSelector": "nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]",
  "htmlTransformer": "readableText",
  "readableTextCharThreshold": 100,
  "aggressivePrune": false,
  "debugMode": false,
  "debugLog": false,
  "storeSkippedUrls": false,
  "saveHtml": false,
  "saveHtmlAsFile": false,
  "saveMarkdown": true,
  "saveFiles": false,
  "saveScreenshots": false,
  "maxResults": 9999999,
  "clientSideMinChangePercentage": 15,
  "renderingTypeDetectionPercentage": 10,
  "reuseStoredDetectionResults": false
}
```

# Actor output Schema

## `crawlResults` (type: `string`):

No description

## `screenshots` (type: `string`):

No description

## `downloadedFiles` (type: `string`):

No description

## `htmlSnapshots` (type: `string`):

No description

## `crawlErrors` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://docs.apify.com/academy/scraping-basics-javascript"
        }
    ],
    "crawlerType": "playwright:adaptive",
    "includeUrlGlobs": [],
    "excludeUrlGlobs": [],
    "useSitemaps": false,
    "useLlmsTxt": false,
    "respectRobotsTxtFile": true,
    "proxyConfiguration": {
        "useApifyProxy": true
    },
    "initialCookies": [],
    "customHttpHeaders": {},
    "signHttpRequests": false,
    "blockMedia": true,
    "clickElementsCssSelector": "[aria-expanded=\"false\"]",
    "keepElementsCssSelector": "",
    "removeElementsCssSelector": `nav, footer, script, style, noscript, svg, img[src^='data:'],
[role="alert"],
[role="banner"],
[role="dialog"],
[role="alertdialog"],
[role="region"][aria-label*="skip" i],
[aria-modal="true"]`,
    "storeSkippedUrls": false
};

// Run the Actor and wait for it to finish
const run = await client.actor("apify/website-content-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://docs.apify.com/academy/scraping-basics-javascript" }],
    "crawlerType": "playwright:adaptive",
    "includeUrlGlobs": [],
    "excludeUrlGlobs": [],
    "useSitemaps": False,
    "useLlmsTxt": False,
    "respectRobotsTxtFile": True,
    "proxyConfiguration": { "useApifyProxy": True },
    "initialCookies": [],
    "customHttpHeaders": {},
    "signHttpRequests": False,
    "blockMedia": True,
    "clickElementsCssSelector": "[aria-expanded=\"false\"]",
    "keepElementsCssSelector": "",
    "removeElementsCssSelector": """nav, footer, script, style, noscript, svg, img[src^='data:'],
[role=\"alert\"],
[role=\"banner\"],
[role=\"dialog\"],
[role=\"alertdialog\"],
[role=\"region\"][aria-label*=\"skip\" i],
[aria-modal=\"true\"]""",
    "storeSkippedUrls": False,
}

# Run the Actor and wait for it to finish
run = client.actor("apify/website-content-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://docs.apify.com/academy/scraping-basics-javascript"
    }
  ],
  "crawlerType": "playwright:adaptive",
  "includeUrlGlobs": [],
  "excludeUrlGlobs": [],
  "useSitemaps": false,
  "useLlmsTxt": false,
  "respectRobotsTxtFile": true,
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "initialCookies": [],
  "customHttpHeaders": {},
  "signHttpRequests": false,
  "blockMedia": true,
  "clickElementsCssSelector": "[aria-expanded=\\"false\\"]",
  "keepElementsCssSelector": "",
  "removeElementsCssSelector": "nav, footer, script, style, noscript, svg, img[src^='\''data:'\''],\\n[role=\\"alert\\"],\\n[role=\\"banner\\"],\\n[role=\\"dialog\\"],\\n[role=\\"alertdialog\\"],\\n[role=\\"region\\"][aria-label*=\\"skip\\" i],\\n[aria-modal=\\"true\\"]",
  "storeSkippedUrls": false
}' |
apify call apify/website-content-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=apify/website-content-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

````json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website Content Crawler",
        "description": "Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.",
        "version": "0.3",
        "x-build-id": "rafOuxkX6cdWCDm7H"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/apify~website-content-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-apify-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/apify~website-content-crawler/runs": {
            "post": {
                "operationId": "runs-sync-apify-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/apify~website-content-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-apify-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls",
                    "proxyConfiguration"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "One or more URLs of pages where the crawler will start.\n\nBy default, the Actor will also crawl sub-pages of these URLs.\n\nFor example, for start URL `https://example.com/blog`, it will crawl also `https://example.com/blog/post` or `https://example.com/blog/article`.\n\nThe **Include URL patterns (globs)** option can override this behavior.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "crawlerType": {
                        "title": "Crawler type",
                        "enum": [
                            "playwright:adaptive",
                            "playwright:firefox",
                            "cheerio",
                            "jsdom",
                            "playwright:chrome"
                        ],
                        "type": "string",
                        "description": "Select the crawling engine:\n- **Adaptive switching** between browser and raw HTTP: Fast and renders JavaScript content if present. Default and recommended option.\n- **Headless browser** (Firefox+Playwright): Reliable, renders JavaScript content, best in avoiding blocking, but might be slow.\n- **Raw HTTP client** (Cheerio): Fastest, but doesn't render JavaScript content.\n- **Raw HTTP client with JavaScript** (JSDOM): Deprecated, use Cheerio instead.\n- **Headless browser** (Chrome+Playwright): Deprecated, use Firefox+Playwright instead.\n\nMore details about Crawler types are in [readme](https://console.apify.com/actors/aYG0l9s7dbB7j3gbS/information/version-0/readme#crawler-types).",
                        "default": "playwright:firefox"
                    },
                    "includeUrlGlobs": {
                        "title": "Include URL patterns (globs)",
                        "type": "array",
                        "description": "Define URL patterns (globs) to extend crawling beyond **Start URLs** and their subpages.\n\nExample: `https://www.example.com/blog/**` matches any blog page — `https://www.example.com/blog/post-title` or `https://www.example.com/blog/category/post` — even if the Start URL is `https://www.example.com/product/some-product`.\n\nIt affects only links found on pages, but not **Start URLs** - if you want to crawl a page, make sure to specify its URL in the **Start URLs** field.\n\nCombined with **Exclude URL patterns**, you can precisely control which pages are crawled.\n\nLearn more about globs [here](https://www.digitalocean.com/community/tools/glob?comments=true&glob=https%3A%2F%2Fexample.com%2Fscrape_this%2F%2A%2A&matches=false&tests=https%3A%2F%2Fexample.com%2Ftools%2F&tests=https%3A%2F%2Fexample.com%2Fscrape_this%2F&tests=https%3A%2F%2Fexample.com%2Fscrape_this%2F123%3Ftest%3Dabc&tests=https%3A%2F%2Fexample.com%2Fdont_scrape_this) and test them with our **Glob tester** under this input.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "glob"
                            ],
                            "properties": {
                                "glob": {
                                    "type": "string",
                                    "title": "Glob of a web page"
                                }
                            }
                        }
                    },
                    "excludeUrlGlobs": {
                        "title": "Exclude URL patterns (globs)",
                        "type": "array",
                        "description": "Glob patterns matching URLs of pages that will be excluded from crawling. Note that this affects only links found on pages, but not **Start URLs**, which are always crawled. \n\nFor example `https://{store,docs}.example.com/**` excludes all URLs starting with `https://store.example.com/` or `https://docs.example.com/`, and `https://example.com/**/*\\?*foo=*` excludes all URLs that contain `foo` query parameter with any value.\n\nLearn more about globs [here](https://www.digitalocean.com/community/tools/glob?comments=true&glob=https%3A%2F%2Fexample.com%2Fscrape_this%2F%2A%2A&matches=false&tests=https%3A%2F%2Fexample.com%2Ftools%2F&tests=https%3A%2F%2Fexample.com%2Fscrape_this%2F&tests=https%3A%2F%2Fexample.com%2Fscrape_this%2F123%3Ftest%3Dabc&tests=https%3A%2F%2Fexample.com%2Fdont_scrape_this) and test them with our **Glob tester** under this input.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "glob"
                            ],
                            "properties": {
                                "glob": {
                                    "type": "string",
                                    "title": "Glob of a web page"
                                }
                            }
                        }
                    },
                    "maxCrawlDepth": {
                        "title": "Max crawling depth",
                        "minimum": 0,
                        "type": "integer",
                        "description": "The maximum number of links starting from the start URL that the crawler will recursively follow. The start URLs have depth `0`, the pages linked directly from the start URLs have depth `1`, and so on.\n\nUseful to prevent accidental crawler runaway. By setting it to `0`, the Actor will only crawl the Start URLs.",
                        "default": 20
                    },
                    "maxCrawlPages": {
                        "title": "Max pages",
                        "minimum": 0,
                        "type": "integer",
                        "description": "The maximum number pages to crawl. It includes the start URLs, pagination pages, pages with no content, etc. The crawler will automatically finish after reaching this number. This setting is useful to prevent accidental crawler runaway.",
                        "default": 9999999
                    },
                    "useSitemaps": {
                        "title": "Load URLs from Sitemaps",
                        "type": "boolean",
                        "description": "If enabled, the crawler will look for [Sitemaps](https://en.wikipedia.org/wiki/Sitemaps) at the domains of the provided *Start URLs* and enqueue matching URLs similarly as the links found on crawled pages.\n\nYou can also reference a `sitemap.xml` file directly by adding it as another Start URL (e.g. `https://www.example.com/sitemap.xml`)\n\nThe crawling could be more robust with Sitemaps, as it includes pages that might be not reachable from Start URLs. However, **loading and processing Sitemaps can take a lot of time, especially for large sites**.\n\nNote that if a page is found via Sitemaps, it will have `depth` of `1`.",
                        "default": false
                    },
                    "useLlmsTxt": {
                        "title": "Crawl /llms.txt and Markdown files",
                        "type": "boolean",
                        "description": "If enabled, the crawler will look for `/llms.txt` files at the root of the domains of the provided Start URLs (e.g., `https://example.com/llms.txt`) and enqueue them for crawling. Note that this also enables crawling other Markdown files and enqueueing links from them.",
                        "default": false
                    },
                    "respectRobotsTxtFile": {
                        "title": "Respect the robots.txt file",
                        "type": "boolean",
                        "description": "If enabled, the crawler will consult the robots.txt file for the target website before crawling each page. At the moment, the crawler does not use any specific user agent identifier. The crawl-delay directive is also not supported yet.",
                        "default": false
                    },
                    "keepUrlFragments": {
                        "title": "URL #fragments identify unique pages",
                        "type": "boolean",
                        "description": "Indicates that URL fragments (e.g. <code>http://example.com<b>#fragment</b></code>) should be included when checking whether a URL has already been visited or not. Typically, URL fragments are used for page navigation only and therefore they should be ignored, as they don't identify separate pages. However, some single-page websites use URL fragments to display different pages; in such a case, this option should be enabled.",
                        "default": false
                    },
                    "ignoreCanonicalUrl": {
                        "title": "Ignore canonical URLs",
                        "type": "boolean",
                        "description": "If enabled, the Actor will ignore the canonical URL or the `ETag` header reported by the page, and use the actual URL instead. You can use this feature for websites that report invalid canonical URLs, which causes the Actor to skip those pages in results.",
                        "default": false
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Enables loading the websites from IP addresses in specific geographies and to circumvent blocking.",
                        "default": {
                            "useApifyProxy": true
                        }
                    },
                    "initialCookies": {
                        "title": "Custom cookies",
                        "type": "array",
                        "description": "Cookies that will be pre-set to all pages the scraper opens. This is useful for pages that require login. The value is expected to be a JSON array of objects with `name` and `value` properties. For example: \n\n```json\n[\n  {\n    \"name\": \"cookieName\",\n    \"value\": \"cookieValue\",\n    \"path\": \"/\",\n    \"domain\": \".apify.com\"\n  }\n]\n```\n\nYou can use the [EditThisCookie](https://docs.apify.com/academy/tools/edit-this-cookie) browser extension to copy browser cookies in this format, and paste it here.\n\nNote that the value is secret and encrypted to protect your login cookies."
                    },
                    "customHttpHeaders": {
                        "title": "Custom HTTP headers",
                        "type": "object",
                        "description": "HTTP headers that will be added to all requests made by the crawler. This is useful for setting custom authentication headers or other headers required by the target website. The value is expected to be a JSON object with `name` and `value` properties pairs. For example: `{ \"name1\": \"value1\", \"Authorization\": \"Basic a1b2c3d4...\" }`.",
                        "default": {}
                    },
                    "signHttpRequests": {
                        "title": "Sign HTTP requests (experimental)",
                        "type": "boolean",
                        "description": "If enabled, the crawler will sign all HTTP requests using its Web Bot Auth private key. This is necessary if you want to use Website Content Crawler as a Cloudflare Signed Agent.",
                        "default": false
                    },
                    "initialConcurrency": {
                        "title": "Initial concurrency",
                        "minimum": 0,
                        "maximum": 999,
                        "type": "integer",
                        "description": "The initial number of web browsers or HTTP clients running in parallel. The system scales the concurrency up and down based on the current CPU and memory load. If the value is set to 0 (default), the Actor uses the default setting for the specific crawler type.\n\nNote that if you set this value too high, the Actor will run out of memory and crash. If too low, it will be slow at start before it scales the concurrency up.",
                        "default": 0
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "maximum": 999,
                        "type": "integer",
                        "description": "The maximum number of web browsers or HTTP clients running in parallel. This setting is useful to avoid overloading the target websites and to avoid getting blocked.",
                        "default": 200
                    },
                    "requestTimeoutSecs": {
                        "title": "Page request timeout",
                        "minimum": 1,
                        "maximum": 600,
                        "type": "integer",
                        "description": "Timeout in seconds for making the request and processing its response. Defaults to 60s.",
                        "default": 60
                    },
                    "minFileDownloadSpeedKBps": {
                        "title": "Minimum file download speed",
                        "type": "integer",
                        "description": "The minimum viable file download speed in kilobytes per seconds. If the file download speed is lower than this value for a prolonged duration, the crawler will consider the file download as failing, abort it, and retry it again (up to \"Maximum number of retries\" times). This is useful to avoid your crawls being stuck on slow file downloads.",
                        "default": 128
                    },
                    "maxRequestRetries": {
                        "title": "Maximum number of retries on network / server errors",
                        "minimum": 0,
                        "maximum": 20,
                        "type": "integer",
                        "description": "The maximum number of times the crawler will retry the request on network, proxy or server errors. If the (n+1)-th request still fails, the crawler will mark this request as failed.",
                        "default": 3
                    },
                    "maxSessionRotations": {
                        "title": "Maximum number of session rotations",
                        "minimum": 0,
                        "maximum": 20,
                        "type": "integer",
                        "description": "The maximum number of times the crawler will rotate the session (IP address + browser configuration) on anti-scraping measures like CAPTCHAs. If the crawler rotates the session more than this number and the page is still blocked, it will finish with an error.",
                        "default": 10
                    },
                    "ignoreHttpsErrors": {
                        "title": "Ignore HTTPS errors",
                        "type": "boolean",
                        "description": "If enabled, the scraper will ignore HTTPS certificate errors. Use at your own risk.",
                        "default": false
                    },
                    "dynamicContentWaitSecs": {
                        "title": "Wait for dynamic content",
                        "type": "integer",
                        "description": "The maximum time in seconds to wait for dynamic page content to load. By default, it is 10 seconds. The crawler will continue processing the page either if this time elapses, or if it detects the network became idle as there are no more requests for additional resources.\n\nWhen using the **Wait for selector** option, the crawler will wait for the selector to appear for this amount of time. If the selector doesn't appear within this period, the request will fail and will be retried.\n\nNote that this setting is ignored for the raw HTTP client, because it doesn't execute JavaScript or loads any dynamic resources. Similarly, if the value is set to `0`, the crawler doesn't wait for any dynamic to load and processes the HTML as provided on load.",
                        "default": 10
                    },
                    "waitForSelector": {
                        "title": "Wait for selector",
                        "type": "string",
                        "description": "Specify a **CSS selector** to tell the crawler to wait for a specific element to appear before it starts extracting content. This is helpful for pages where the content loads dynamically.\n\nExamples: `div`, `#id-of-an-element`, `.class-name`\n\nThis setting disables the default content-load detection. If the element doesn't appear within the **Wait for dynamic content** timeout, the request will fail and be retried.\n\nIf **Wait for dynamic content** is set to `0`, the crawler does not wait for late elements. Instead, it checks the selector only against the current page state / HTML snapshot, and fails the request immediately if the selector is not found.\n\nWith the raw HTTP client, this option checks for the presence of the selector in the HTML content and throws an error if it's not found.",
                        "default": ""
                    },
                    "softWaitForSelector": {
                        "title": "Soft wait for selector",
                        "type": "string",
                        "description": "If set, the crawler will wait for the specified CSS selector to appear in the page before proceeding with the content extraction. Unlike the `waitForSelector` option, this option doesn't fail the request if the selector doesn't appear within the timeout (the request processing will continue).",
                        "default": ""
                    },
                    "maxScrollHeightPixels": {
                        "title": "Maximum scroll height",
                        "minimum": 0,
                        "type": "integer",
                        "description": "The crawler will scroll down the page until all content is loaded (and network becomes idle), or until this maximum scrolling height is reached. Setting this value to `0` disables scrolling altogether.\n\nNote that this setting is ignored for the raw HTTP client, because it doesn't execute JavaScript or loads any dynamic resources.",
                        "default": 5000
                    },
                    "removeCookieWarnings": {
                        "title": "Remove cookie warnings",
                        "type": "boolean",
                        "description": "If enabled, the Actor will try to remove cookies consent dialogs or modals, using the [I don't care about cookies](https://addons.mozilla.org/en-US/firefox/addon/i-dont-care-about-cookies/) browser extension, to improve the accuracy of the extracted text. Note that there is a small performance penalty if this feature is enabled.\n\nThis setting is ignored when using the raw HTTP crawler type.",
                        "default": true
                    },
                    "blockMedia": {
                        "title": "Block loading of images and videos",
                        "type": "boolean",
                        "description": "If the flag is enabled and the Actor is using a headless browser, it will not load images, fonts, stylesheets and videos to improve performance. It will load scripts as usual - that is after all the point of using a headless browser.",
                        "default": false
                    },
                    "expandIframes": {
                        "title": "Expand iframe elements",
                        "type": "boolean",
                        "description": "By default, the Actor will extract content from `iframe` elements. If you want to specifically skip `iframe` processing, disable this option. Works only for the `playwright:firefox` crawler type.",
                        "default": true
                    },
                    "clickElementsCssSelector": {
                        "title": "Expand clickable elements",
                        "type": "string",
                        "description": "A CSS selector matching DOM elements that will be clicked. This is useful for expanding collapsed sections, in order to capture their text content. The value must be a valid CSS selector as accepted by the `document.querySelectorAll()` function. ",
                        "default": "[aria-expanded=\"false\"]"
                    },
                    "stickyContainerCssSelector": {
                        "title": "Make containers sticky",
                        "type": "string",
                        "description": "This is an **experimental** feature. A CSS selector matching DOM elements that will be prevented from deleting any of their children. This is useful in conjunction with the \"Expand clickable elements\" option on pages where hidden content is actually removed from the DOM (i.e., some variants of the accordion pattern). Enabling this might corrupt the extracted content, which is why it is disabled by default. It is possible to enable the feature for the whole page with the `*` selector, or you can target specific elements if the former has unwanted side effects."
                    },
                    "pageFunction": {
                        "title": "Page function",
                        "type": "string",
                        "description": "A declaration of an asynchronous JS function (e.g. `async function pageFunction({ page }) { await page.click('.submit-button') }`).\n\nThe function receives `context` as the only argument. Context is a JavaScript object containing the following properties:\n- `page`: Currently loaded Playwright `Page` instance.\n- `request`: The request object that triggered the page load.\n\nThe function will be executed in the browser context for each crawled page, after the page is loaded (included all dynamic content) and before the content is extracted and cleaned.",
                        "default": ""
                    },
                    "keepElementsCssSelector": {
                        "title": "Keep HTML elements (CSS selector)",
                        "type": "string",
                        "description": "Extract only relevant page content by specifying CSS selectors (e.g. `div`, `#element-id`, `.class-name`). [Learn more about CSS selectors](https://developer.mozilla.org/en-US/docs/Learn_web_development/Core/Styling_basics/Basic_selectors).\n\nIf any selectors are defined, everything else will be removed from the page.\n\nThis option runs before the `HTML transformer` option. If you are missing content in the output despite using this option, try disabling the `HTML transformer`.",
                        "default": ""
                    },
                    "removeElementsCssSelector": {
                        "title": "Remove HTML elements (CSS selector)",
                        "type": "string",
                        "description": "Specify which HTML elements should be removed from the page before text extraction. This is useful to skip irrelevant page content.\n\nBy default, the Actor removes common navigation elements, headers, footers, modals, scripts, and inline image. You can disable the removal by setting this value to some non-existent CSS selector like `dummy_keep_everything`.",
                        "default": "nav, footer, script, style, noscript, svg, img[src^='data:'],\n[role=\"alert\"],\n[role=\"banner\"],\n[role=\"dialog\"],\n[role=\"alertdialog\"],\n[role=\"region\"][aria-label*=\"skip\" i],\n[aria-modal=\"true\"]"
                    },
                    "htmlTransformer": {
                        "title": "HTML transformer",
                        "enum": [
                            "readableTextIfPossible",
                            "readableText",
                            "extractus",
                            "defuddle",
                            "none"
                        ],
                        "type": "string",
                        "description": "Specify how to transform HTML to get meaningful content, removing extra fluff like navigation or pop-ups. This is applied after any HTML elements are removed or clicked.\n\n- **Readable text with fallback**: Uses Mozilla's Readability to extract content, but keeps the original HTML if it's not a clear article. Great for sites with mixed content like articles and product pages.\n\n- **Readable text** (Default): Also uses Mozilla's Readability but is more aggressive, removing headers, footers, and navigation. Best for blogs and article-heavy sites.\n\n- **Extractus**: An alternative content extraction algorithm that might work better for certain news sites or blogs with unique layouts.\n\n- **Defuddle**: More forgiving than Readability, better preserving elements like math and footnotes, code. It also extracts metadata and uses mobile styles for clean-up.\n\n- **None**: Only performs basic cleaning and removes elements specified by you. This option is best when you need to preserve most of the page's original HTML.",
                        "default": "readableText"
                    },
                    "readableTextCharThreshold": {
                        "title": "Readable text extractor character threshold",
                        "type": "integer",
                        "description": "A configuration options for the \"Readable text\" HTML transformer. It contains the minimum number of characters an article must have in order to be considered relevant.",
                        "default": 100
                    },
                    "aggressivePrune": {
                        "title": "Remove duplicate text lines",
                        "type": "boolean",
                        "description": "This is an **experimental feature**. If enabled, the crawler will prune content lines that are very similar to the ones already crawled on other pages, using the Count-Min Sketch algorithm. This is useful to strip repeating content in the scraped data like menus, headers, footers, etc. In some (not very likely) cases, it might remove relevant content from some pages.",
                        "default": false
                    },
                    "debugMode": {
                        "title": "Debug mode (stores output of all HTML transformers)",
                        "type": "boolean",
                        "description": "If enabled, the Actor will store the output of all types of HTML transformers, including the ones that are not used by default, and it will also store the HTML to Key-value Store with a link. All this data is stored under the `debug` field in the resulting Dataset.",
                        "default": false
                    },
                    "debugLog": {
                        "title": "Debug log",
                        "type": "boolean",
                        "description": "If enabled, the actor log will include debug messages. Beware that this can be quite verbose.",
                        "default": false
                    },
                    "storeSkippedUrls": {
                        "title": "Store skipped URLs",
                        "type": "boolean",
                        "description": "If enabled, the crawler will store all URLs that were skipped during the crawl in a Key-Value Store record named `SKIPPED_URLS`. The record will contain a JSON object with reasons for skipping and the URLs that were skipped for each reason. This is useful for debugging and understanding why certain pages were not crawled.",
                        "default": false
                    },
                    "saveHtml": {
                        "title": "Save HTML to dataset (deprecated)",
                        "type": "boolean",
                        "description": "If enabled, the crawler stores full transformed HTML of all pages found to the output dataset under the `html` field. **This option has been deprecated** in favor of the `saveHtmlAsFile` option, because the dataset records have a size of approximately 10MB and it's harder to review the HTML for debugging.",
                        "default": false
                    },
                    "saveHtmlAsFile": {
                        "title": "Save HTML to key-value store",
                        "type": "boolean",
                        "description": "If enabled, the crawler stores full transformed HTML of all pages found to the default key-value store and saves links to the files as `htmlUrl` field in the output dataset. Storing HTML in key-value store is preferred to storing it into the dataset with the `saveHtml` option, because there's no size limit and it's easier for debugging as you can easily view the HTML.",
                        "default": false
                    },
                    "saveMarkdown": {
                        "title": "Save Markdown",
                        "type": "boolean",
                        "description": "If enabled, the crawler converts the transformed HTML of all pages found to Markdown, and stores it under the `markdown` field in the output dataset.",
                        "default": true
                    },
                    "saveFiles": {
                        "title": "Save files",
                        "type": "boolean",
                        "description": "Deprecated in favor of the `saveContentTypes` option. Will be removed soon. If enabled, the crawler downloads files linked from the web pages, as long as their URL has one of the following file extensions: PDF, DOC, DOCX, XLS, XLSX, and CSV. Note that unlike web pages, the files are downloaded regardless if they are under **Start URLs** or not. The files are stored to the default key-value store, and metadata about them to the output dataset, similarly as for web pages.",
                        "default": false
                    },
                    "saveContentTypes": {
                        "title": "Save linked files with Content-Type",
                        "type": "string",
                        "description": "The crawler downloads files linked from the web pages, as long as their content type matches the provided value. Select predefined <a href=\"https://www.iana.org/assignments/media-types/media-types.xhtml\">Content-type</a> groups to download common file types, or enter custom HTTP Content-type strings, including wildcards (e.g., application/pdf, text/\\*, image/\\*) for specific downloads. Note that unlike web pages, the files are downloaded regardless if they are under **Start URLs** or not. The files are stored to the default key-value store, and metadata about them to the output dataset, similarly as for web pages."
                    },
                    "saveScreenshots": {
                        "title": "Save screenshots (headless browser only)",
                        "type": "boolean",
                        "description": "If enabled, the crawler stores a screenshot for each article page to the default key-value store. The link to the screenshot is stored under the `screenshotUrl` field in the output dataset. It is useful for debugging, but reduces performance and increases storage costs.\n\nNote that this feature only works with the `playwright:firefox` crawler type.",
                        "default": false
                    },
                    "maxResults": {
                        "title": "Max results",
                        "minimum": 0,
                        "type": "integer",
                        "description": "The maximum number of web pages and files to store. This setting helps prevent an accidental crawler runaway by automatically stopping the crawl once this limit is reached. Note that the crawler skips pages whose canonical URL matches a page that has already been crawled, so it may crawl more pages than the number of stored results. Similarly, there may be more stored results than crawled web pages because downloaded files also count toward results.",
                        "default": 9999999
                    },
                    "clientSideMinChangePercentage": {
                        "title": "(Adaptive crawling only) Minimum client-side content change percentage",
                        "minimum": 1,
                        "type": "integer",
                        "description": "The least amount of content (as a percentage) change after the initial load required to consider the pages client-side rendered",
                        "default": 15
                    },
                    "renderingTypeDetectionPercentage": {
                        "title": "(Adaptive crawling only) How often should the crawler attempt to detect page rendering type",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "How often should the adaptive attempt to detect page rendering type",
                        "default": 10
                    },
                    "reuseStoredDetectionResults": {
                        "title": "Reuse stored detections results (experimental)",
                        "type": "boolean",
                        "description": "If enabled, the crawler (if using playwright:adaptive) will reuse results of rendering type detections done in previous runs to speed up crawling of statically rendered pages",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
````
