AI / RAG Web Crawler
Pricing
from $0.50 / 1,000 results
AI / RAG Web Crawler
Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.
Pricing
from $0.50 / 1,000 results
Rating
0.0
(0)
Developer
Group Oject
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
19 hours ago
Last modified
Categories
Share
Crawl any website and get clean, LLM-ready Markdown chunks — ready to feed AI agents, chatbots, and RAG pipelines.
Point it at a docs site, knowledge base, or blog. It crawls the pages, strips the navigation/ads/boilerplate, converts the main content to clean Markdown, and (optionally) splits it into overlapping chunks. One dataset row per chunk — pipe it straight into a vector database.
⚡ Fast HTTP crawler (no headless browser). No API key required.
What it does
- Crawls from your start URLs, following links up to a depth/page limit you set (same-domain by default).
- Extracts the main content — removes nav, header, footer, sidebars, scripts, ads.
- Converts it to clean Markdown (headings, lists, links, code preserved).
- Chunks it into overlapping, embeddings-sized pieces for RAG.
Output is one row per chunk, each tagged with its source URL, title, and chunk position — exactly the shape you want for an embeddings/vector pipeline.
Who it's for
- AI/RAG builders — turn a docs site or knowledge base into a clean corpus for retrieval.
- Chatbot makers — feed your support docs into a customer-facing assistant.
- Agent developers — give an agent a fresh, structured snapshot of a site.
- Data teams — bulk-convert web content to Markdown without writing a parser.
Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrls | array | — | URLs to crawl (plain strings or { "url": "..." }) |
maxCrawlPages | integer | 50 | Total page cap |
maxCrawlDepth | integer | 1 | Link-hops from start URLs (0 = start URLs only) |
sameDomainOnly | boolean | true | Only follow links on the start domain(s) |
includeUrlGlobs | array | — | Only crawl URLs matching these globs (e.g. https://site.com/docs/*) |
excludeUrlGlobs | array | — | Skip URLs matching these globs (e.g. *.pdf) |
chunkContent | boolean | true | Split pages into RAG chunks (one row each) |
chunkSize | integer | 1000 | Target characters per chunk |
chunkOverlap | integer | 100 | Overlap chars between chunks |
minChunkChars | integer | 50 | Drop chunks smaller than this |
saveHtml | boolean | false | Also include cleaned HTML |
maxConcurrency | integer | 10 | Pages crawled in parallel |
proxyConfiguration | object | — | Optional Apify Proxy |
Example input
{"startUrls": [{ "url": "https://docs.apify.com/" }],"maxCrawlPages": 30,"maxCrawlDepth": 2,"includeUrlGlobs": ["https://docs.apify.com/*"],"chunkContent": true,"chunkSize": 1000,"chunkOverlap": 100}
More in examples/.
Output
One dataset row per chunk:
{"url": "https://docs.apify.com/platform/actors","title": "Actors | Apify Docs","description": "Learn how Apify Actors work.","chunkIndex": 0,"chunkCount": 4,"content": "# Actors\n\nActors are serverless programs...","contentChars": 980,"depth": 1,"crawledAt": "2026-06-15T12:00:00.000Z"}
To build a vector index: embed the content field, store url + title + chunkIndex as metadata. Done.
Key-value store outputs
SUMMARY— pages crawled/failed, total chunks, average chunk size, settings
Tips for clean RAG data
- Use
includeUrlGlobsto stay inside the section you care about (e.g..../docs/*) and skip marketing pages. chunkSize800–1200 chars suits most embedding models; bumpchunkOverlapto 150–200 for prose-heavy sites.- Turn off
chunkContentif you want whole pages (one row each) and prefer to chunk in your own pipeline. - Exclude noise with
excludeUrlGlobs(*.pdf,*/tag/*,*/author/*).
Limitations & compliance
- HTTP crawler — it reads server-rendered HTML. Pages that render content purely client-side (heavy SPA) may yield little; those need a browser-based crawler.
- Main-content extraction is heuristic (prefers
<article>/<main>, strips common boilerplate). Unusual layouts may include or drop some content. - You choose the targets. Crawl only sites you're permitted to, respect each site's terms and robots policy, and don't collect private or paywalled data. This Actor accesses publicly reachable pages only.
Changelog
See CHANGELOG.md.