# Sitemap Generator — Full-Site URL Discovery & Crawling (`junipr/sitemap-generator`) Actor

Generate XML sitemaps by crawling websites. Link following, robots.txt respect, configurable depth/limits. Valid XML with lastmod, changefreq, priority. URL inventory with status codes. Ideal for SEO and migrations.

- **URL**: https://apify.com/junipr/sitemap-generator.md
- **Developed by:** [junipr](https://apify.com/junipr) (community)
- **Categories:** SEO tools, Developer tools
- **Stats:** 3 total users, 0 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

from $1.30 / 1,000 page crawleds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Sitemap Generator

### Introduction

Sitemap Generator is a production-grade Apify actor that crawls any website and generates a standards-compliant XML sitemap. It discovers all accessible pages through link following, respects configurable depth limits and URL patterns, and detects existing sitemaps from `robots.txt` and common paths like `/sitemap.xml`. The actor outputs a ready-to-submit XML sitemap plus a structured page list with metadata including last modified dates, change frequency estimates, and calculated priority values.

**Primary use cases:**
- SEO professionals auditing and generating sitemaps for client sites
- Web developers building sitemaps for sites without CMS-generated ones
- DevOps teams automating sitemap generation in CI/CD pipelines
- Content teams verifying all pages are indexed and discoverable
- Migration specialists mapping old site structure for redirects

**Key differentiators:** JavaScript rendering support via Playwright for SPAs, automatic existing sitemap detection and merging, depth-based priority estimation with inbound link boosting, lastmod detection from HTTP headers and meta tags, and auto-split at 50K URLs per sitemap protocol spec.

### Why Use This Actor

| Feature | Sitemap Generator | Sitemap Generator (Apify) | XML Sitemap Creator | Screaming Frog |
|---------|-------------------|--------------------------|--------------------|-----------------------|
| JS-rendered pages | Yes (Playwright) | No | No | Yes (desktop) |
| Existing sitemap detection | Yes (robots.txt + paths) | No | Partial | Yes |
| Priority estimation | Depth + link count | None | Static values | Heuristic |
| lastmod from headers | Yes (multi-source) | No | No | Yes |
| changefreq estimation | Yes (content heuristic) | No | No | No |
| Output: XML + JSON | Both | XML only | XML only | XML + CSV |
| Auto-split >50K URLs | Yes with index | No | No | Yes |
| Canonical URL handling | Full support | No | No | Yes |
| PPE pricing | $2/1K pages | Compute-based | Compute-based | License fee |
| Zero-config | Yes | Yes | Mostly | No |

This actor handles the most common pain points with existing sitemap generators: failure on JavaScript-rendered pages, no detection of existing sitemaps, poor deduplication of query-parameterized URLs, and lack of meaningful priority values.

### How to Use

#### Zero-Config Quick Start

Just provide a start URL and run. Everything else has sensible defaults:

```json
{
    "startUrl": "https://example.com"
}
````

The actor will crawl up to 500 pages, generate an XML sitemap, and store it in the Key-Value Store under the `SITEMAP_XML` key.

#### Step-by-Step

1. Go to the actor's page on Apify Console
2. Enter your website URL in the **Start URL** field
3. (Optional) Adjust max pages, depth, or enable Playwright for JS-heavy sites
4. Click **Start** to run the actor
5. When complete, download the XML sitemap from the **Key-Value Store** tab (`SITEMAP_XML`)
6. Upload the sitemap to Google Search Console or place it at your site's root

#### Common Configuration Recipes

**Quick Audit** — Default settings for a fast overview:

```json
{
    "startUrl": "https://example.com",
    "maxPages": 500,
    "crawlerType": "cheerio"
}
```

**Full Site Map** — Comprehensive crawl of the entire site:

```json
{
    "startUrl": "https://example.com",
    "maxPages": 50000,
    "maxDepth": 10,
    "crawlerType": "cheerio"
}
```

**Blog Only** — Generate sitemap for just the blog section:

```json
{
    "startUrl": "https://example.com/blog",
    "includePatterns": ["/blog/*"],
    "stayWithinPath": true
}
```

**SPA Site** — JavaScript-rendered single page application:

```json
{
    "startUrl": "https://app.example.com",
    "crawlerType": "playwright",
    "maxConcurrency": 5
}
```

**Compare with Existing** — See what your existing sitemap is missing:

```json
{
    "startUrl": "https://example.com",
    "existingSitemapAction": "compare"
}
```

### Input Configuration

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `startUrl` | string | *required* | Root URL to start crawling |
| `maxPages` | integer | 500 | Max pages to crawl (1-100,000) |
| `maxDepth` | integer | 5 | Link-following depth (0 = start URL only) |
| `includePatterns` | string\[] | `[]` | Glob patterns for URLs to include |
| `excludePatterns` | string\[] | file extensions | Glob patterns for URLs to exclude |
| `crawlerType` | string | "cheerio" | Engine: "cheerio" (fast) or "playwright" (JS) |
| `includeLastmod` | boolean | true | Include last modified dates |
| `includeChangefreq` | boolean | true | Include change frequency |
| `includePriority` | boolean | true | Include calculated priority |
| `checkExistingSitemap` | boolean | true | Detect existing sitemaps |
| `existingSitemapAction` | string | "merge" | merge, replace, or compare |
| `respectRobotsTxt` | boolean | true | Honor robots.txt directives |
| `sitemapFormat` | string | "xml" | Output: xml, txt, or both |
| `splitAtCount` | integer | 50000 | Auto-split threshold |

See the **Input Schema** tab for the complete list of parameters with detailed descriptions.

### Output Format

#### XML Sitemap (Key-Value Store: `SITEMAP_XML`)

```xml
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-01-15T10:30:00Z</lastmod>
    <changefreq>daily</changefreq>
    <priority>1.0</priority>
  </url>
  <url>
    <loc>https://example.com/about</loc>
    <lastmod>2025-12-01T08:00:00Z</lastmod>
    <changefreq>monthly</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>
```

#### Dataset Item (one per page crawled)

```json
{
    "url": "https://example.com/about",
    "statusCode": 200,
    "depth": 1,
    "title": "About Us",
    "lastModified": "2025-12-01T08:00:00.000Z",
    "changefreq": "monthly",
    "priority": 0.8,
    "contentType": "text/html",
    "responseTimeMs": 245,
    "inSitemap": true,
    "excludeReason": null
}
```

#### Run Summary (Key-Value Store: `RUN_SUMMARY`)

```json
{
    "startUrl": "https://example.com",
    "totalPagesCrawled": 347,
    "totalPagesInSitemap": 312,
    "pagesExcluded": 35,
    "pagesSkippedByRobots": 8,
    "pagesFailed": 3,
    "duplicatesRemoved": 12,
    "existingSitemapFound": true,
    "existingSitemapUrls": 290,
    "durationMs": 45200,
    "sitemapSplitCount": 1
}
```

### Tips and Advanced Usage

#### Optimizing Crawl Speed

- Use `crawlerType: "cheerio"` for static sites — it is 5-10x faster than Playwright and uses far less memory
- Increase `maxConcurrency` for faster crawls on sites that handle high request rates
- Set `excludeQueryParams: true` (default) to avoid crawling the same page with different query strings

#### URL Pattern Filtering

- Use `includePatterns` to limit the sitemap to specific sections: `["/blog/*", "/products/*"]`
- Use `excludePatterns` to skip admin pages, API endpoints, or file downloads
- Patterns use glob syntax: `*` matches anything except `/`, `**` matches anything including `/`

#### Existing Sitemap Workflows

- **Merge** (default): Combines crawled URLs with the existing sitemap for a complete picture
- **Compare**: Generates a diff report showing URLs missing from your current sitemap and URLs in the sitemap that are no longer accessible
- **Replace**: Ignores the existing sitemap entirely and generates a fresh one from the crawl

#### Submitting to Search Engines

After generating your sitemap, download it from the Key-Value Store and either upload it to Google Search Console or place it at your site root. You can also ping search engines programmatically: `https://www.google.com/ping?sitemap=https://example.com/sitemap.xml`

### Pricing

This actor uses **Pay-Per-Event (PPE)** pricing at **$2.00 per 1,000 pages crawled**.

A billable event occurs when the actor successfully fetches a URL, processes it, and records the result. You are NOT charged for URLs blocked by robots.txt, duplicate URLs filtered before request, failed requests, or the initial robots.txt and existing sitemap fetches.

#### Cost Examples

| Scenario | Pages | Cost |
|----------|-------|------|
| Small blog (50 pages) | 50 | $0.10 |
| Business site (200 pages) | 200 | $0.40 |
| E-commerce (5,000 pages) | 5,000 | $10.00 |
| News site (50,000 pages) | 50,000 | $100.00 |

Plus standard Apify platform compute costs based on memory and runtime.

### FAQ

#### Does it handle JavaScript-rendered pages?

Yes. Set `crawlerType` to `"playwright"` to enable full browser rendering. This handles React, Next.js, Vue, Angular, and any other SPA framework. The Playwright mode uses a real Chromium browser to render pages before extracting links, so it discovers routes that only exist in client-side JavaScript.

#### How does it estimate priority values?

Priority is calculated from two factors: page depth (distance from the homepage) and inbound link count. The homepage always gets priority 1.0. Each additional level of depth reduces priority by 0.2, down to a minimum of 0.1. Pages with many inbound links from other pages on the site receive a boost of up to +0.2.

#### Can it detect my existing sitemap?

Yes. When `checkExistingSitemap` is enabled (the default), the actor checks `robots.txt` for Sitemap directives and probes common paths like `/sitemap.xml` and `/sitemap_index.xml`. It supports sitemap index files and will recursively fetch all sub-sitemaps.

#### What happens if my site has more than 50,000 URLs?

The sitemap protocol limits each sitemap file to 50,000 URLs. When this limit is exceeded, the actor automatically splits the output into multiple sitemap files and generates a sitemap index file (`SITEMAP_INDEX_XML` in the Key-Value Store) that references all the individual sitemaps.

#### Does it respect robots.txt?

Yes. The `respectRobotsTxt` option is enabled by default. The actor parses robots.txt for Disallow directives and Crawl-delay values. Disallowed paths are skipped entirely (never requested), and crawl delay is honored by reducing concurrency.

#### Can I filter which pages are included?

Yes. Use `includePatterns` to specify glob patterns for URLs that should appear in the sitemap (e.g., `["/blog/*"]`). Use `excludePatterns` to exclude specific paths or file types. Common binary file extensions are excluded by default.

#### How often should I regenerate my sitemap?

For most sites, weekly or monthly regeneration is sufficient. For news sites or frequently updated content, consider daily runs. You can schedule the actor on Apify to run automatically at any interval.

#### What's a "page crawled" for pricing purposes?

A page crawled is any unique URL that the actor successfully fetches and receives a response from (HTTP 2xx or 3xx). Pages that fail to load, URLs blocked by robots.txt, and duplicates filtered before the request is made are not counted as billable events.

# Actor input Schema

## `startUrl` (type: `string`):

The root URL to start crawling from. Must include protocol (https://).

## `maxPages` (type: `integer`):

Maximum number of pages to crawl. Higher values take longer but produce a more complete sitemap.

## `maxDepth` (type: `integer`):

Maximum link-following depth from start URL. 0 means only the start URL. Higher values discover deeper pages.

## `includePatterns` (type: `array`):

URL patterns to include (glob syntax). Empty means include all URLs. Examples: "/blog/*", "*.html", "/products/\*".

## `excludePatterns` (type: `array`):

URL patterns to exclude (glob syntax). Applied after include patterns. Common file extensions are excluded by default.

## `excludeQueryParams` (type: `boolean`):

Treat URLs with different query parameters as the same page for deduplication.

## `respectCanonical` (type: `boolean`):

Follow <link rel="canonical"> tags. Index the canonical URL and skip duplicates.

## `stayWithinDomain` (type: `boolean`):

Only crawl pages on the same domain as the start URL.

## `stayWithinPath` (type: `boolean`):

Only crawl pages under the same path prefix as the start URL.

## `includeLastmod` (type: `boolean`):

Include <lastmod> in the sitemap. Detected from Last-Modified header, <meta> tags, or schema.org dateModified.

## `includeChangefreq` (type: `boolean`):

Include <changefreq> in the sitemap. Estimated from content type heuristics.

## `includePriority` (type: `boolean`):

Include <priority> in the sitemap. Calculated from page depth and inbound link count.

## `defaultChangefreq` (type: `string`):

Default changefreq value when not estimable from content type.

## `sitemapFormat` (type: `string`):

Output format: xml (standard XML sitemap), txt (plain URL list), or both.

## `splitAtCount` (type: `integer`):

Auto-split sitemap at this many URLs and generate a sitemap index. Per sitemap protocol spec, max is 50,000.

## `checkExistingSitemap` (type: `boolean`):

Check for existing sitemaps at robots.txt and common paths before crawling.

## `existingSitemapAction` (type: `string`):

What to do when an existing sitemap is found: merge (combine with crawled), replace (ignore existing), or compare (output diff).

## `sitemapPaths` (type: `array`):

Additional paths to check for existing sitemaps, besides robots.txt.

## `crawlerType` (type: `string`):

Crawler engine: cheerio (fast, static HTML) or playwright (JS rendering for SPAs).

## `maxConcurrency` (type: `integer`):

Maximum concurrent requests (Cheerio) or browser pages (Playwright). Higher values crawl faster but use more resources.

## `requestTimeout` (type: `integer`):

Timeout per page request in milliseconds.

## `maxRetries` (type: `integer`):

Retry failed page requests up to this many times.

## `respectRobotsTxt` (type: `boolean`):

Honor robots.txt crawl directives and delays. Recommended to leave enabled.

## `userAgent` (type: `string`):

Custom User-Agent string sent with all requests.

## `proxyConfiguration` (type: `object`):

Proxy settings. Defaults to Apify datacenter proxies.

## `httpHeaders` (type: `object`):

Custom HTTP headers for all requests.

## `cookies` (type: `array`):

Cookies to set for all requests. Each object needs name, value, and domain.

## Actor input object example

```json
{
  "startUrl": "https://crawlee.dev",
  "maxPages": 500,
  "maxDepth": 5,
  "includePatterns": [],
  "excludePatterns": [
    "*.pdf",
    "*.zip",
    "*.jpg",
    "*.png",
    "*.gif",
    "*.svg",
    "*.css",
    "*.js"
  ],
  "excludeQueryParams": true,
  "respectCanonical": true,
  "stayWithinDomain": true,
  "stayWithinPath": false,
  "includeLastmod": true,
  "includeChangefreq": true,
  "includePriority": true,
  "defaultChangefreq": "weekly",
  "sitemapFormat": "xml",
  "splitAtCount": 50000,
  "checkExistingSitemap": true,
  "existingSitemapAction": "merge",
  "sitemapPaths": [
    "/sitemap.xml",
    "/sitemap_index.xml",
    "/sitemap/"
  ],
  "crawlerType": "cheerio",
  "maxConcurrency": 20,
  "requestTimeout": 30000,
  "maxRetries": 3,
  "respectRobotsTxt": true,
  "userAgent": "JuniprSitemapBot/1.0",
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "httpHeaders": {}
}
```

# Actor output Schema

## `results` (type: `string`):

Discovered URLs with metadata including last modified date, change frequency, priority, and HTTP status.

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrl": "https://crawlee.dev",
    "maxPages": 500,
    "maxDepth": 5,
    "defaultChangefreq": "weekly",
    "sitemapFormat": "xml",
    "splitAtCount": 50000,
    "existingSitemapAction": "merge",
    "crawlerType": "cheerio",
    "maxConcurrency": 20,
    "requestTimeout": 30000,
    "maxRetries": 3,
    "userAgent": "JuniprSitemapBot/1.0",
    "proxyConfiguration": {
        "useApifyProxy": true
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("junipr/sitemap-generator").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrl": "https://crawlee.dev",
    "maxPages": 500,
    "maxDepth": 5,
    "defaultChangefreq": "weekly",
    "sitemapFormat": "xml",
    "splitAtCount": 50000,
    "existingSitemapAction": "merge",
    "crawlerType": "cheerio",
    "maxConcurrency": 20,
    "requestTimeout": 30000,
    "maxRetries": 3,
    "userAgent": "JuniprSitemapBot/1.0",
    "proxyConfiguration": { "useApifyProxy": True },
}

# Run the Actor and wait for it to finish
run = client.actor("junipr/sitemap-generator").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrl": "https://crawlee.dev",
  "maxPages": 500,
  "maxDepth": 5,
  "defaultChangefreq": "weekly",
  "sitemapFormat": "xml",
  "splitAtCount": 50000,
  "existingSitemapAction": "merge",
  "crawlerType": "cheerio",
  "maxConcurrency": 20,
  "requestTimeout": 30000,
  "maxRetries": 3,
  "userAgent": "JuniprSitemapBot/1.0",
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}' |
apify call junipr/sitemap-generator --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=junipr/sitemap-generator",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Sitemap Generator — Full-Site URL Discovery & Crawling",
        "description": "Generate XML sitemaps by crawling websites. Link following, robots.txt respect, configurable depth/limits. Valid XML with lastmod, changefreq, priority. URL inventory with status codes. Ideal for SEO and migrations.",
        "version": "1.0",
        "x-build-id": "ZOMLlMDf19gEUDRIi"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/junipr~sitemap-generator/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-junipr-sitemap-generator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/junipr~sitemap-generator/runs": {
            "post": {
                "operationId": "runs-sync-junipr-sitemap-generator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/junipr~sitemap-generator/run-sync": {
            "post": {
                "operationId": "run-sync-junipr-sitemap-generator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "type": "string",
                        "description": "The root URL to start crawling from. Must include protocol (https://).",
                        "default": "https://crawlee.dev"
                    },
                    "maxPages": {
                        "title": "Max Pages",
                        "minimum": 1,
                        "maximum": 100000,
                        "type": "integer",
                        "description": "Maximum number of pages to crawl. Higher values take longer but produce a more complete sitemap.",
                        "default": 500
                    },
                    "maxDepth": {
                        "title": "Max Crawl Depth",
                        "minimum": 0,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Maximum link-following depth from start URL. 0 means only the start URL. Higher values discover deeper pages.",
                        "default": 5
                    },
                    "includePatterns": {
                        "title": "Include URL Patterns",
                        "type": "array",
                        "description": "URL patterns to include (glob syntax). Empty means include all URLs. Examples: \"/blog/*\", \"*.html\", \"/products/*\".",
                        "items": {
                            "type": "string"
                        },
                        "default": []
                    },
                    "excludePatterns": {
                        "title": "Exclude URL Patterns",
                        "type": "array",
                        "description": "URL patterns to exclude (glob syntax). Applied after include patterns. Common file extensions are excluded by default.",
                        "items": {
                            "type": "string"
                        },
                        "default": [
                            "*.pdf",
                            "*.zip",
                            "*.jpg",
                            "*.png",
                            "*.gif",
                            "*.svg",
                            "*.css",
                            "*.js"
                        ]
                    },
                    "excludeQueryParams": {
                        "title": "Exclude Query Parameters",
                        "type": "boolean",
                        "description": "Treat URLs with different query parameters as the same page for deduplication.",
                        "default": true
                    },
                    "respectCanonical": {
                        "title": "Respect Canonical URLs",
                        "type": "boolean",
                        "description": "Follow <link rel=\"canonical\"> tags. Index the canonical URL and skip duplicates.",
                        "default": true
                    },
                    "stayWithinDomain": {
                        "title": "Stay Within Domain",
                        "type": "boolean",
                        "description": "Only crawl pages on the same domain as the start URL.",
                        "default": true
                    },
                    "stayWithinPath": {
                        "title": "Stay Within Path",
                        "type": "boolean",
                        "description": "Only crawl pages under the same path prefix as the start URL.",
                        "default": false
                    },
                    "includeLastmod": {
                        "title": "Include Last Modified",
                        "type": "boolean",
                        "description": "Include <lastmod> in the sitemap. Detected from Last-Modified header, <meta> tags, or schema.org dateModified.",
                        "default": true
                    },
                    "includeChangefreq": {
                        "title": "Include Change Frequency",
                        "type": "boolean",
                        "description": "Include <changefreq> in the sitemap. Estimated from content type heuristics.",
                        "default": true
                    },
                    "includePriority": {
                        "title": "Include Priority",
                        "type": "boolean",
                        "description": "Include <priority> in the sitemap. Calculated from page depth and inbound link count.",
                        "default": true
                    },
                    "defaultChangefreq": {
                        "title": "Default Change Frequency",
                        "enum": [
                            "always",
                            "hourly",
                            "daily",
                            "weekly",
                            "monthly",
                            "yearly",
                            "never"
                        ],
                        "type": "string",
                        "description": "Default changefreq value when not estimable from content type.",
                        "default": "weekly"
                    },
                    "sitemapFormat": {
                        "title": "Sitemap Format",
                        "enum": [
                            "xml",
                            "txt",
                            "both"
                        ],
                        "type": "string",
                        "description": "Output format: xml (standard XML sitemap), txt (plain URL list), or both.",
                        "default": "xml"
                    },
                    "splitAtCount": {
                        "title": "Split at URL Count",
                        "minimum": 1000,
                        "maximum": 50000,
                        "type": "integer",
                        "description": "Auto-split sitemap at this many URLs and generate a sitemap index. Per sitemap protocol spec, max is 50,000.",
                        "default": 50000
                    },
                    "checkExistingSitemap": {
                        "title": "Check Existing Sitemap",
                        "type": "boolean",
                        "description": "Check for existing sitemaps at robots.txt and common paths before crawling.",
                        "default": true
                    },
                    "existingSitemapAction": {
                        "title": "Existing Sitemap Action",
                        "enum": [
                            "merge",
                            "replace",
                            "compare"
                        ],
                        "type": "string",
                        "description": "What to do when an existing sitemap is found: merge (combine with crawled), replace (ignore existing), or compare (output diff).",
                        "default": "merge"
                    },
                    "sitemapPaths": {
                        "title": "Sitemap Paths to Check",
                        "type": "array",
                        "description": "Additional paths to check for existing sitemaps, besides robots.txt.",
                        "items": {
                            "type": "string"
                        },
                        "default": [
                            "/sitemap.xml",
                            "/sitemap_index.xml",
                            "/sitemap/"
                        ]
                    },
                    "crawlerType": {
                        "title": "Crawler Type",
                        "enum": [
                            "cheerio",
                            "playwright"
                        ],
                        "type": "string",
                        "description": "Crawler engine: cheerio (fast, static HTML) or playwright (JS rendering for SPAs).",
                        "default": "cheerio"
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Maximum concurrent requests (Cheerio) or browser pages (Playwright). Higher values crawl faster but use more resources.",
                        "default": 20
                    },
                    "requestTimeout": {
                        "title": "Request Timeout (ms)",
                        "minimum": 5000,
                        "maximum": 120000,
                        "type": "integer",
                        "description": "Timeout per page request in milliseconds.",
                        "default": 30000
                    },
                    "maxRetries": {
                        "title": "Max Retries",
                        "minimum": 0,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Retry failed page requests up to this many times.",
                        "default": 3
                    },
                    "respectRobotsTxt": {
                        "title": "Respect robots.txt",
                        "type": "boolean",
                        "description": "Honor robots.txt crawl directives and delays. Recommended to leave enabled.",
                        "default": true
                    },
                    "userAgent": {
                        "title": "User Agent",
                        "type": "string",
                        "description": "Custom User-Agent string sent with all requests.",
                        "default": "JuniprSitemapBot/1.0"
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Proxy settings. Defaults to Apify datacenter proxies.",
                        "default": {
                            "useApifyProxy": true
                        }
                    },
                    "httpHeaders": {
                        "title": "HTTP Headers",
                        "type": "object",
                        "description": "Custom HTTP headers for all requests.",
                        "default": {}
                    },
                    "cookies": {
                        "title": "Cookies",
                        "type": "array",
                        "description": "Cookies to set for all requests. Each object needs name, value, and domain.",
                        "items": {
                            "type": "object",
                            "properties": {
                                "name": {
                                    "type": "string",
                                    "title": "Cookie Name",
                                    "description": "Name of the cookie"
                                },
                                "value": {
                                    "type": "string",
                                    "title": "Cookie Value",
                                    "description": "Value of the cookie"
                                },
                                "domain": {
                                    "type": "string",
                                    "title": "Cookie Domain",
                                    "description": "Domain the cookie applies to"
                                }
                            },
                            "required": [
                                "name",
                                "value",
                                "domain"
                            ]
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
