# Keywords Extractor (`lukaskrivka/keywords-extractor`) Actor

Use our free website keyword extractor to crawl any website and extract keyword counts on each page.

- **URL**: https://apify.com/lukaskrivka/keywords-extractor.md
- **Developed by:** [Lukáš Křivka](https://apify.com/lukaskrivka) (community)
- **Categories:** Other, Open source
- **Stats:** 838 total users, 5 monthly users, 100.0% runs succeeded, 15 bookmarks
- **User rating**: 4.78 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### Keyword Extractor

Can deeply crawl a website and counts how many times are provided keywords found on the page.

#### How to use
- You can pass in any number of keywords that you want to count.
- You can combine Start URLs, Pseudo Urls and link selector to traverse any number of pages accross websites. Check our [scraping tutorial](https://docs.apify.com/scraping) on how to use these.
- You can specify `maxDepth` and `maxPagesPerCrawl` to limit the scope of the scrape. Start URLs have depth 0. So if you want just the start URLs, set `maxDepth` to 0, etc.
- You can pick case sensitive search and search through scripts.
- You can choose to scrape with or without browser. Browser is more expensive but allows JavaScript rendering and waiting.
- For browser, you can use many additional features

#### How are keywords determined
The text is split into words by [word boundaries](https://www.regular-expressions.info/wordboundaries.html). Each word is then compared with each keyword. In the future, we may add other types of boundaries to choose from.

#### Example Output
For `keywords`:
````

\["watch", "watches", "rolex"]

```
starting on `https://www.chrono24.com/watches/mens-watches--62.htm`
```

\[
{
"url": "https://www.chrono24.com/watches/mens-watches--62.htm",
"depth": 0,
"result": {
"watch": 63,
"watches": 81,
"rolex": 57
}
},
{
"url": "https://www.chrono24.com/user/index.htm",
"depth": 1,
"result": {
"watch": 9,
"watches": 13,
"rolex": 1
}
},
{
"url": "https://www.chrono24.com/info/watch-collection.htm",
"depth": 1,
"result": {
"watch": 56,
"watches": 23,
"rolex": 1
}
},
...
]

````

# Actor input Schema

## `startUrls` (type: `array`):

A static list of URLs to scrape. To be able to add new URLs on the fly, enable the <b>Use request queue</b> option.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#start-urls' target='_blank' rel='noopener'>Start URLs</a> in README.
## `useBrowser` (type: `boolean`):

If on, it will use regular borwser for scraping.
## `keywords` (type: `array`):

List of keywords to search and count on every page
## `caseSensitive` (type: `boolean`):

If on, it will only match keywords with exact upper or lower case.
## `scanScripts` (type: `boolean`):

If on, it will also count keywords appearing inside scripts.
## `linkSelector` (type: `string`):

A CSS selector saying which links on the page (<code>&lt;a&gt;</code> elements with <code>href</code> attribute) shall be followed and added to the request queue. This setting only applies if <b>Use request queue</b> is enabled. To filter the links added to the queue, use the <b>Pseudo-URLs</b> setting.<br><br>If <b>Link selector</b> is empty, the page links are ignored.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#link-selector' target='_blank' rel='noopener'>Link selector</a> in README.
## `pseudoUrls` (type: `array`):

Specifies what kind of URLs found by <b>Link selector</b> should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in <code>[]</code> brackets, e.g. <code>http://www.example.com/[.*]</code>. This setting only applies if the <b>Use request queue</b> option is enabled.<br><br>If <b>Pseudo-URLs</b> are omitted, the actor enqueues all links matched by the <b>Link selector</b>.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#pseudo-urls' target='_blank' rel='noopener'>Pseudo-URLs</a> in README.
## `maxDepth` (type: `integer`):

How many links deep from the Start URLs do you want to crawl. Start URLs have depth 0.
## `proxyConfiguration` (type: `object`):

Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.
## `maxPagesPerCrawl` (type: `integer`):

The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It's always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.<br><br>If set to <code>0</code>, there is no limit.
## `maxConcurrency` (type: `integer`):

Specified the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target website.
## `retireInstanceAfterRequestCount` (type: `integer`):

How often will the browser itself rotate. Pick higher for smaller consumption, pick less to rotate (test) more proxies
## `useChrome` (type: `boolean`):

Only works for puppeteer type. Be careful that Chrome is not guaranteed to work with Puppeteer.
## `waitFor` (type: `string`):

Only works for puppeteer type. Will wait on each page. You can provide number in ms or a selector.

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://apify.com"
    }
  ],
  "useBrowser": false,
  "keywords": [
    "apify",
    "web",
    "scraping",
    "automation"
  ],
  "caseSensitive": false,
  "scanScripts": false,
  "linkSelector": "a[href]",
  "pseudoUrls": [
    {
      "purl": "https://apify.com/[.*]"
    }
  ],
  "maxDepth": 5,
  "proxyConfiguration": {
    "useApifyProxy": false
  },
  "maxPagesPerCrawl": 100,
  "maxConcurrency": 50,
  "retireInstanceAfterRequestCount": 50,
  "useChrome": false
}
````

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://apify.com"
        }
    ],
    "keywords": [
        "apify",
        "web",
        "scraping",
        "automation"
    ],
    "linkSelector": "a[href]",
    "pseudoUrls": [
        {
            "purl": "https://apify.com/[.*]"
        }
    ],
    "proxyConfiguration": {
        "useApifyProxy": false
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("lukaskrivka/keywords-extractor").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://apify.com" }],
    "keywords": [
        "apify",
        "web",
        "scraping",
        "automation",
    ],
    "linkSelector": "a[href]",
    "pseudoUrls": [{ "purl": "https://apify.com/[.*]" }],
    "proxyConfiguration": { "useApifyProxy": False },
}

# Run the Actor and wait for it to finish
run = client.actor("lukaskrivka/keywords-extractor").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://apify.com"
    }
  ],
  "keywords": [
    "apify",
    "web",
    "scraping",
    "automation"
  ],
  "linkSelector": "a[href]",
  "pseudoUrls": [
    {
      "purl": "https://apify.com/[.*]"
    }
  ],
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}' |
apify call lukaskrivka/keywords-extractor --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=lukaskrivka/keywords-extractor",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Keywords Extractor",
        "description": "Use our free website keyword extractor to crawl any website and extract keyword counts on each page.",
        "version": "0.0",
        "x-build-id": "irvxpoDkPfnbazYNZ"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/lukaskrivka~keywords-extractor/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-lukaskrivka-keywords-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/lukaskrivka~keywords-extractor/runs": {
            "post": {
                "operationId": "runs-sync-lukaskrivka-keywords-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/lukaskrivka~keywords-extractor/run-sync": {
            "post": {
                "operationId": "run-sync-lukaskrivka-keywords-extractor",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrls",
                    "keywords"
                ],
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "A static list of URLs to scrape. To be able to add new URLs on the fly, enable the <b>Use request queue</b> option.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#start-urls' target='_blank' rel='noopener'>Start URLs</a> in README.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "useBrowser": {
                        "title": "Use Browser",
                        "type": "boolean",
                        "description": "If on, it will use regular borwser for scraping.",
                        "default": false
                    },
                    "keywords": {
                        "title": "Keywords",
                        "type": "array",
                        "description": "List of keywords to search and count on every page",
                        "items": {
                            "type": "string"
                        }
                    },
                    "caseSensitive": {
                        "title": "Case sensitive",
                        "type": "boolean",
                        "description": "If on, it will only match keywords with exact upper or lower case.",
                        "default": false
                    },
                    "scanScripts": {
                        "title": "Scan scripts",
                        "type": "boolean",
                        "description": "If on, it will also count keywords appearing inside scripts.",
                        "default": false
                    },
                    "linkSelector": {
                        "title": "Link selector",
                        "type": "string",
                        "description": "A CSS selector saying which links on the page (<code>&lt;a&gt;</code> elements with <code>href</code> attribute) shall be followed and added to the request queue. This setting only applies if <b>Use request queue</b> is enabled. To filter the links added to the queue, use the <b>Pseudo-URLs</b> setting.<br><br>If <b>Link selector</b> is empty, the page links are ignored.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#link-selector' target='_blank' rel='noopener'>Link selector</a> in README."
                    },
                    "pseudoUrls": {
                        "title": "Pseudo-URLs",
                        "type": "array",
                        "description": "Specifies what kind of URLs found by <b>Link selector</b> should be added to the request queue. A pseudo-URL is a URL with regular expressions enclosed in <code>[]</code> brackets, e.g. <code>http://www.example.com/[.*]</code>. This setting only applies if the <b>Use request queue</b> option is enabled.<br><br>If <b>Pseudo-URLs</b> are omitted, the actor enqueues all links matched by the <b>Link selector</b>.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#pseudo-urls' target='_blank' rel='noopener'>Pseudo-URLs</a> in README.",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "purl"
                            ],
                            "properties": {
                                "purl": {
                                    "type": "string",
                                    "title": "Pseudo-URL of a web page"
                                }
                            }
                        }
                    },
                    "maxDepth": {
                        "title": "Max depth",
                        "minimum": 0,
                        "type": "integer",
                        "description": "How many links deep from the Start URLs do you want to crawl. Start URLs have depth 0.",
                        "default": 5
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Specifies proxy servers that will be used by the scraper in order to hide its origin.<br><br>For details, see <a href='https://apify.com/apify/web-scraper#proxy-configuration' target='_blank' rel='noopener'>Proxy configuration</a> in README.",
                        "default": {}
                    },
                    "maxPagesPerCrawl": {
                        "title": "Max pages per run",
                        "minimum": 0,
                        "type": "integer",
                        "description": "The maximum number of pages that the scraper will load. The scraper will stop when this limit is reached. It's always a good idea to set this limit in order to prevent excess platform usage for misconfigured scrapers. Note that the actual number of pages loaded might be slightly higher than this value.<br><br>If set to <code>0</code>, there is no limit.",
                        "default": 100
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Specified the maximum number of pages that can be processed by the scraper in parallel. The scraper automatically increases and decreases concurrency based on available system resources. This option enables you to set an upper limit, for example to reduce the load on a target website.",
                        "default": 50
                    },
                    "retireInstanceAfterRequestCount": {
                        "title": "Retire Instance After Request Count",
                        "minimum": 1,
                        "type": "integer",
                        "description": "How often will the browser itself rotate. Pick higher for smaller consumption, pick less to rotate (test) more proxies",
                        "default": 50
                    },
                    "useChrome": {
                        "title": "Use Chrome",
                        "type": "boolean",
                        "description": "Only works for puppeteer type. Be careful that Chrome is not guaranteed to work with Puppeteer.",
                        "default": false
                    },
                    "waitFor": {
                        "title": "Wait for",
                        "type": "string",
                        "description": "Only works for puppeteer type. Will wait on each page. You can provide number in ms or a selector."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
