# Website Content Crawler for LLM's (`salesblaster-ai/website-content-crawler`) Actor

Extract contact information + turn any website into clean, structured content ready for LLM's (e.g. AI lead magnets, RAG pipelines, and outbound personalization).

Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for LLM's, and optimized for lead generation.

- **URL**: https://apify.com/salesblaster-ai/website-content-crawler.md
- **Developed by:** [SalesBlaster AI](https://apify.com/salesblaster-ai) (community)
- **Categories:** AI, Lead generation, Agents
- **Stats:** 7 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## LLM-Optimized Website Content Crawler

Extract contact information + turn any website into clean, structured content ready for LLM's (AI lead magnets, RAG pipelines, and outbound personalization).

### Why This Actor?

Most web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for AI workflows — it extracts only the meaningful content, splits it into semantically coherent chunks with heading context, and scores each chunk for quality. The result: content your LLM can actually use without drowning in nav menus, cookie banners, and boilerplate.

**Built for agency owners and outbound teams who use AI lead magnets to start conversations with prospects.**

### Use Cases

#### AI Lead Magnets
Crawl a prospect's website before generating a personalized audit, report, or strategy doc. Feed the chunks directly into your LLM to produce a lead magnet that references real details from their site — not generic filler.

- **AI Automation Agency**: Crawl their site and generate a custom n8n workflow or automation map personalized to their business processes
- **Paid Ads Agency**: Crawl their brand and product pages to generate AI video/picture Meta ad creatives tailored to their offer
- **Web Design Agency**: Crawl their existing site and generate a fully custom landing page based on their real content and messaging
- **SEO Agency**: Crawl their site to produce a personalized SEO audit and competitor analysis with page-level recommendations
- **Lead Gen Agency**: Crawl their offer and ICP pages to generate sample cold email scripts and LinkedIn outbound sequences
- **Sales Agency**: Crawl their sales pages to build a free AI voice mock call agent or custom sales scripts for their offer
- **Content Agency**: Crawl their brand voice and existing content to generate a custom content calendar with sample carousel posts

#### RAG Knowledge Bases
Build a searchable knowledge base from any website. Chunks come pre-tagged with heading paths and content types, so you can filter by topic before stuffing your context window.

#### Outbound Personalization
Extract key details from a prospect's website to personalize cold outreach at scale. The contact extraction feature pulls emails, phone numbers, and social profiles automatically.

### How It Works

````

Website URL → Sitemap Discovery → Page Crawling → Content Extraction → Semantic Chunking → Quality Scoring
→ Contact Extraction (optional)

````

1. **Discover pages** — Finds pages via sitemap.xml or by following links (configurable strategy)
2. **Extract content** — Uses Mozilla Readability to strip nav, footer, ads, and boilerplate from each page
3. **Chunk by headings** — Splits content along the heading hierarchy so each chunk has semantic context (e.g., "About > Team > Leadership")
4. **Score quality** — Assigns a quality score, content type, and link density metric to each chunk
5. **Extract contacts** — Deduplicates emails, phone numbers, and social links across all crawled pages

### Input

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| `startUrl` | string | *required* | Website URL to crawl |
| `maxPages` | number | 20 | Maximum pages to crawl |
| `maxConcurrency` | number | 5 | Concurrent page requests |
| `sitemapStrategy` | enum | `"AUTO"` | `"AUTO"` / `"SITEMAP_FIRST"` / `"CRAWL_LINKS"` |
| `includePaths` | string[] | `[]` | Only crawl URLs matching these path prefixes (e.g., `["/blog"]`) |
| `excludePaths` | string[] | common defaults | Skip URLs matching these path prefixes |
| `excludeUrlRegex` | string | media/binary files | Regex pattern to exclude URLs |
| `chunkingOptions.maxChars` | number | 2000 | Max characters per chunk |
| `chunkingOptions.overlapChars` | number | 200 | Overlap between consecutive chunks |
| `extractContacts` | boolean | `true` | Extract emails, phones, and social links |
| `datasetName` | string | `"default"` | Name for the output dataset |

### Output

#### Content Chunks (Dataset)

Each crawled page produces one or more chunk records:

```json
{
  "site": "example.com",
  "url": "https://example.com/about",
  "title": "About Us",
  "chunkIndex": 0,
  "chunkCount": 3,
  "headingPath": "About > Team > Leadership",
  "markdown": "# Team\n\nOur leadership team...",
  "contentType": "marketing",
  "quality": {
    "score": 85,
    "textLength": 1500,
    "linkDensity": 0.03,
    "hasStructure": true
  },
  "crawledAt": "2026-01-09T12:00:00Z",
  "datasetName": "my-crawl"
}
````

**Content types**: `blog`, `docs`, `legal`, `product`, `marketing`, `other`

The `headingPath` field gives your LLM the section context without needing to process the entire page — useful for filtering chunks by topic or building hierarchical summaries.

#### Contact Summary (Key-Value Store)

Aggregated contact info across all crawled pages, stored under the `OUTPUT` key:

```json
{
  "summary": {
    "totalEmails": 5,
    "totalPhones": 3,
    "totalSocialLinks": 8,
    "socialBreakdown": {
      "linkedin": 3,
      "twitter": 2,
      "facebook": 3
    }
  },
  "contacts": {
    "emails": ["contact@example.com", "support@example.com"],
    "phones": ["+14155552671", "+14155552672"],
    "social": [
      {
        "platform": "linkedin",
        "url": "https://linkedin.com/company/example"
      }
    ]
  },
  "crawlStats": {
    "pagesVisited": 20,
    "pagesSkipped": 0,
    "errors": 0
  }
}
```

### Examples

#### Lead Magnet: Crawl a Prospect's Blog

Crawl their blog content to generate a personalized content audit.

```json
{
  "startUrl": "https://prospect-company.com/blog",
  "maxPages": 50,
  "includePaths": ["/blog"],
  "chunkingOptions": {
    "maxChars": 3000,
    "overlapChars": 300
  }
}
```

#### Lead Magnet: Full Site Audit

Crawl their entire site for a comprehensive UX or SEO review.

```json
{
  "startUrl": "https://prospect-company.com",
  "maxPages": 100,
  "sitemapStrategy": "SITEMAP_FIRST",
  "chunkingOptions": {
    "maxChars": 1500,
    "overlapChars": 150
  }
}
```

#### Outbound: Extract Contact Info

Quick crawl focused on finding emails and social profiles.

```json
{
  "startUrl": "https://prospect-company.com",
  "maxPages": 20,
  "extractContacts": true,
  "chunkingOptions": {
    "maxChars": 500,
    "overlapChars": 0
  }
}
```

### Tips

- **Start small**: Set `maxPages` to 10-20 for your first run, then increase once you see the output quality
- **Use `includePaths`** to focus on the most valuable sections (e.g., `/blog`, `/services`, `/case-studies`)
- **Larger chunks** (3000+ chars) work better for lead magnet generation; **smaller chunks** (1000-1500) work better for RAG retrieval
- **`SITEMAP_FIRST`** is faster and more complete for well-structured sites; **`CRAWL_LINKS`** is better for sites with missing or incomplete sitemaps
- **Quality scores** above 70 generally indicate high-value content worth including in your LLM prompts

## Contact

For more information or help, feel free to reach out to the creator:

- https://maxforbang.com/about
- max@salesblaster.ai

# Actor input Schema

## `startUrl` (type: `string`):

Single website URL to crawl (e.g., https://example.com)

## `maxPages` (type: `integer`):

Maximum number of pages to crawl from this site

## `maxConcurrency` (type: `integer`):

Maximum concurrent page requests

## `sitemapStrategy` (type: `string`):

How to discover URLs: AUTO (try sitemap first), SITEMAP\_FIRST (sitemap only), CRAWL\_LINKS (follow links only)

## `includePaths` (type: `array`):

Only crawl URLs matching these path prefixes (e.g., \['/blog', '/docs']). Empty = all paths.

## `excludePaths` (type: `array`):

Skip URLs matching these path prefixes (e.g., \['/admin', '/login'])

## `excludeUrlRegex` (type: `string`):

Skip URLs matching this regex pattern (e.g., '.(pdf|zip|jpg|png)$')

## `chunkingOptions` (type: `object`):

Configuration for content chunking (JSON format: {"maxChars": 2000, "overlapChars": 200})

## `extractContacts` (type: `boolean`):

Extract emails, phone numbers, and social links from all pages

## Actor input object example

```json
{
  "startUrl": "https://salesblaster.ai",
  "maxPages": 20,
  "maxConcurrency": 5,
  "sitemapStrategy": "AUTO",
  "includePaths": [],
  "excludePaths": [
    "/admin",
    "/login",
    "/wp-admin",
    "/cart",
    "/checkout"
  ],
  "excludeUrlRegex": "\\.(pdf|zip|jpg|png|gif|svg|mp4|mp3|css|js)$",
  "chunkingOptions": {
    "maxChars": 2000,
    "overlapChars": 200
  },
  "extractContacts": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("salesblaster-ai/website-content-crawler").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("salesblaster-ai/website-content-crawler").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call salesblaster-ai/website-content-crawler --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=salesblaster-ai/website-content-crawler",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Website Content Crawler for LLM's",
        "description": "Extract contact information + turn any website into clean, structured content ready for LLM's (e.g. AI lead magnets, RAG pipelines, and outbound personalization).\n\nMost web scrapers dump raw HTML or unstructured text. This crawler is purpose-built for LLM's, and optimized for lead generation.",
        "version": "0.0",
        "x-build-id": "O2EbLNxaCQMKopVsJ"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/salesblaster-ai~website-content-crawler/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-salesblaster-ai-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/salesblaster-ai~website-content-crawler/runs": {
            "post": {
                "operationId": "runs-sync-salesblaster-ai-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/salesblaster-ai~website-content-crawler/run-sync": {
            "post": {
                "operationId": "run-sync-salesblaster-ai-website-content-crawler",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrl"
                ],
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "pattern": "^https?://",
                        "type": "string",
                        "description": "Single website URL to crawl (e.g., https://example.com)"
                    },
                    "maxPages": {
                        "title": "Maximum Pages",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum number of pages to crawl from this site",
                        "default": 20
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Maximum concurrent page requests",
                        "default": 5
                    },
                    "sitemapStrategy": {
                        "title": "Sitemap Strategy",
                        "enum": [
                            "AUTO",
                            "SITEMAP_FIRST",
                            "CRAWL_LINKS"
                        ],
                        "type": "string",
                        "description": "How to discover URLs: AUTO (try sitemap first), SITEMAP_FIRST (sitemap only), CRAWL_LINKS (follow links only)",
                        "default": "AUTO"
                    },
                    "includePaths": {
                        "title": "Include Paths",
                        "type": "array",
                        "description": "Only crawl URLs matching these path prefixes (e.g., ['/blog', '/docs']). Empty = all paths.",
                        "default": [],
                        "items": {
                            "type": "string"
                        }
                    },
                    "excludePaths": {
                        "title": "Exclude Paths",
                        "type": "array",
                        "description": "Skip URLs matching these path prefixes (e.g., ['/admin', '/login'])",
                        "default": [
                            "/admin",
                            "/login",
                            "/wp-admin",
                            "/cart",
                            "/checkout"
                        ],
                        "items": {
                            "type": "string"
                        }
                    },
                    "excludeUrlRegex": {
                        "title": "Exclude URL Regex",
                        "type": "string",
                        "description": "Skip URLs matching this regex pattern (e.g., '\\.(pdf|zip|jpg|png)$')",
                        "default": "\\.(pdf|zip|jpg|png|gif|svg|mp4|mp3|css|js)$"
                    },
                    "chunkingOptions": {
                        "title": "Chunking Options",
                        "type": "object",
                        "description": "Configuration for content chunking (JSON format: {\"maxChars\": 2000, \"overlapChars\": 200})",
                        "default": {
                            "maxChars": 2000,
                            "overlapChars": 200
                        }
                    },
                    "extractContacts": {
                        "title": "Extract Contacts",
                        "type": "boolean",
                        "description": "Extract emails, phone numbers, and social links from all pages",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
