# Robots Txt Analyzer (`zerobreak/robots-txt-analyzer`) Actor

Robots txt analyzer that fetches and parses crawl rules from any website in bulk, so SEO teams and developers can audit blocked paths, user agents, and sitemap locations across hundreds of domains without manual work.

- **URL**: https://apify.com/zerobreak/robots-txt-analyzer.md
- **Developed by:** [ZeroBreak](https://apify.com/zerobreak) (community)
- **Categories:** SEO tools
- **Stats:** 3 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

$2.99/month + usage

To use this Actor, you pay a monthly rental fee to the developer. The rent is subtracted from your prepaid usage every month after the free trial period.You also pay for the Apify platform usage, which gets cheaper the higher Apify subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#rental-actors

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Robots.txt Analyzer: Parse and Validate Crawl Rules for Any Website

Robots.txt Analyzer fetches and parses robots.txt files for any website. Give it one domain or a list of hundreds and get back every directive: blocked paths, allowed paths, crawl delays, and sitemap URLs, organized by user agent. Most tools check robots.txt for a single site; this actor handles bulk analysis, so you can audit dozens of domains in a single run.

### Use cases

- **SEO auditing**: check which pages Googlebot or Bingbot can access before pushing new content live
- **Technical SEO review**: audit robots.txt across dozens of client domains without opening each file manually
- **Bot access testing**: verify whether a specific URL path is blocked for any crawler before deployment
- **Competitive analysis**: compare robots.txt configurations across competitor domains to see what they protect from indexing
- **Site monitoring**: schedule regular runs to catch unexpected changes that could block search engine crawlers
- **QA validation**: confirm that robots.txt deployments match intended crawl rules after each release

### Input

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `url` | string | | Single website URL to analyze |
| `urls` | array | | List of website URLs for bulk analysis. One URL per line. |
| `userAgent` | string | `*` | Crawler user agent to check rules for (e.g. `Googlebot`, `Bingbot`, `*`) |
| `checkPath` | string | | Specific URL path to check (e.g. `/admin/`) |
| `maxUrls` | integer | `100` | Maximum number of URLs to process per run |
| `timeoutSecs` | integer | `300` | Overall actor timeout in seconds |
| `requestTimeoutSecs` | integer | `30` | Per-request timeout in seconds |
| `proxyConfiguration` | object | Datacenter (Anywhere) | Proxy type and location for requests. Supports Datacenter, Residential, Special, and custom proxies. Optional. |

#### Example input

```json
{
    "urls": ["https://apify.com", "https://news.ycombinator.com"],
    "userAgent": "Googlebot",
    "checkPath": "/admin/",
    "maxUrls": 100,
    "proxyConfiguration": { "useApifyProxy": true }
}
````

### What data does this actor extract?

The actor stores one result per URL in the Apify dataset. Each entry contains:

```json
{
    "url": "https://apify.com",
    "robotsTxtUrl": "https://apify.com/robots.txt",
    "httpStatus": 200,
    "isAccessible": true,
    "rawContent": "User-agent: *\nDisallow: /api/\nSitemap: https://apify.com/sitemap.xml",
    "userAgentsFound": ["*", "Googlebot"],
    "sitemapUrls": ["https://apify.com/sitemap.xml"],
    "crawlDelay": null,
    "disallowedPaths": ["/api/"],
    "allowedPaths": [],
    "checkedUserAgent": "Googlebot",
    "checkedPath": "/admin/",
    "isPathBlocked": true,
    "matchingRule": "Disallow: /admin/",
    "error": null,
    "scrapedAt": "2025-03-08T12:00:00+00:00"
}
```

| Field | Type | Description |
|-------|------|-------------|
| `url` | string | Original website URL |
| `robotsTxtUrl` | string | URL of the fetched robots.txt file |
| `httpStatus` | integer | HTTP status code returned for the robots.txt request |
| `isAccessible` | boolean | Whether robots.txt was found and returned HTTP 200 |
| `rawContent` | string | Full raw text of the robots.txt file |
| `userAgentsFound` | array | All user agents declared in the file |
| `sitemapUrls` | array | Sitemap URLs declared in the file |
| `crawlDelay` | number | Crawl delay in seconds for the checked user agent, if declared |
| `disallowedPaths` | array | Paths disallowed for the checked user agent |
| `allowedPaths` | array | Paths explicitly allowed for the checked user agent |
| `checkedUserAgent` | string | User agent checked against the robots.txt rules |
| `checkedPath` | string | Specific path checked for access, if provided |
| `isPathBlocked` | boolean | Whether the checked path is blocked. Null if no path was provided. |
| `matchingRule` | string | The specific rule that determined the access result |
| `error` | string | Error message if the fetch failed |
| `scrapedAt` | string | ISO 8601 timestamp of the analysis |

### How it works

1. The actor reads the input URL or list of URLs
2. For each domain, it builds the robots.txt URL by appending `/robots.txt` to the root
3. It fetches the file with an HTTP GET request
4. The parser groups directives by user agent, reading each line in order
5. It matches the configured user agent against the parsed groups, checking for an exact match first and falling back to the wildcard `*` group
6. If a check path is provided, it applies the longest-match rule to determine access
7. All results are pushed to the Apify dataset

### Integrations

Connect Robots.txt Analyzer with other apps and services using [Apify integrations](https://apify.com/integrations). You can pipe results to Google Sheets, Airtable, or trigger Slack alerts via Make or Zapier whenever a path becomes blocked. You can also use [webhooks](https://docs.apify.com/integrations/webhooks) to act on results as soon as a run finishes.

### FAQ

**Does this actor handle robots.txt files with multiple user-agent groups?**

Yes. The parser reads every user-agent block and applies the correct rules for the configured user agent, with automatic fallback to the wildcard `*` group when no exact match is found.

**What happens if a site has no robots.txt file?**

The actor records an HTTP 404 status and sets `isAccessible` to `false`. A missing robots.txt means no restrictions, so `isPathBlocked` is set to `false` when a check path is provided.

**How many URLs can I process per run?**

Up to 1000 per run, controlled by the `maxUrls` input. The default is 100 to avoid accidental large runs on first use.

**Can this actor check if Googlebot can access a specific page?**

Yes. Set `userAgent` to `Googlebot` and `checkPath` to the path you want to check. The output includes `isPathBlocked` and `matchingRule` showing exactly which directive made the decision.

**Does it handle robots.txt with wildcard path patterns like `*` and `$`?**

The actor handles standard robots.txt directives: `Disallow`, `Allow`, `Crawl-delay`, and `Sitemap`. Wildcard characters within path patterns (`*` and `$` mid-path) are not currently supported; only prefix matching is applied.

Use Robots.txt Analyzer for single-site spot checks or scheduled bulk audits across hundreds of domains. Export to Google Sheets and plug into your existing SEO workflow through the Apify platform.

# Actor input Schema

## `url` (type: `string`):

A single website URL to analyze. The actor will fetch the robots.txt from this domain.

## `urls` (type: `array`):

List of website URLs to analyze in bulk. One URL per line. Combined with the single URL field above.

## `userAgent` (type: `string`):

The crawler user agent to check rules for. Use '\*' for catch-all wildcard rules. Common values: Googlebot, Bingbot, \*.

## `checkPath` (type: `string`):

A specific URL path to check for access (e.g. /admin/ or /private/). If provided, the output will include whether this path is blocked for the chosen user agent.

## `maxUrls` (type: `integer`):

Maximum number of URLs to process per run. Prevents accidental large runs.

## `timeoutSecs` (type: `integer`):

Maximum time in seconds the actor can run before it stops.

## `requestTimeoutSecs` (type: `integer`):

Maximum time in seconds to wait for each robots.txt request before timing out.

## `proxyConfiguration` (type: `object`):

Select proxies to use for requests. Helps avoid IP blocking and rate limits. Datacenter proxies are fastest; Residential proxies are harder to detect.

## Actor input object example

```json
{
  "url": "https://apify.com",
  "urls": [
    "https://apify.com",
    "https://news.ycombinator.com"
  ],
  "userAgent": "Googlebot",
  "checkPath": "/admin/",
  "maxUrls": 100,
  "timeoutSecs": 300,
  "requestTimeoutSecs": 30,
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "url": "https://apify.com",
    "userAgent": "*",
    "proxyConfiguration": {
        "useApifyProxy": true
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("zerobreak/robots-txt-analyzer").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "url": "https://apify.com",
    "userAgent": "*",
    "proxyConfiguration": { "useApifyProxy": True },
}

# Run the Actor and wait for it to finish
run = client.actor("zerobreak/robots-txt-analyzer").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "url": "https://apify.com",
  "userAgent": "*",
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}' |
apify call zerobreak/robots-txt-analyzer --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=zerobreak/robots-txt-analyzer",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Robots Txt Analyzer",
        "description": "Robots txt analyzer that fetches and parses crawl rules from any website in bulk, so SEO teams and developers can audit blocked paths, user agents, and sitemap locations across hundreds of domains without manual work.",
        "version": "0.0",
        "x-build-id": "p2xQwcCgl4eZYTFuY"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/zerobreak~robots-txt-analyzer/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-zerobreak-robots-txt-analyzer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/zerobreak~robots-txt-analyzer/runs": {
            "post": {
                "operationId": "runs-sync-zerobreak-robots-txt-analyzer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/zerobreak~robots-txt-analyzer/run-sync": {
            "post": {
                "operationId": "run-sync-zerobreak-robots-txt-analyzer",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "url": {
                        "title": "Website URL",
                        "type": "string",
                        "description": "A single website URL to analyze. The actor will fetch the robots.txt from this domain."
                    },
                    "urls": {
                        "title": "Website URLs (bulk)",
                        "type": "array",
                        "description": "List of website URLs to analyze in bulk. One URL per line. Combined with the single URL field above.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "userAgent": {
                        "title": "User agent to check",
                        "type": "string",
                        "description": "The crawler user agent to check rules for. Use '*' for catch-all wildcard rules. Common values: Googlebot, Bingbot, *.",
                        "default": "*"
                    },
                    "checkPath": {
                        "title": "Path to check",
                        "type": "string",
                        "description": "A specific URL path to check for access (e.g. /admin/ or /private/). If provided, the output will include whether this path is blocked for the chosen user agent."
                    },
                    "maxUrls": {
                        "title": "Max URLs",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum number of URLs to process per run. Prevents accidental large runs.",
                        "default": 100
                    },
                    "timeoutSecs": {
                        "title": "Overall timeout (seconds)",
                        "minimum": 30,
                        "maximum": 3600,
                        "type": "integer",
                        "description": "Maximum time in seconds the actor can run before it stops.",
                        "default": 300
                    },
                    "requestTimeoutSecs": {
                        "title": "Request timeout (seconds)",
                        "minimum": 5,
                        "maximum": 120,
                        "type": "integer",
                        "description": "Maximum time in seconds to wait for each robots.txt request before timing out.",
                        "default": 30
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Select proxies to use for requests. Helps avoid IP blocking and rate limits. Datacenter proxies are fastest; Residential proxies are harder to detect."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
