AI / RAG Web Crawler avatar

AI / RAG Web Crawler

Pricing

from $0.50 / 1,000 results

Go to Apify Store
AI / RAG Web Crawler

AI / RAG Web Crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Group Oject

Group Oject

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

19 hours ago

Last modified

Share

Crawl any website and get clean, LLM-ready Markdown chunks — ready to feed AI agents, chatbots, and RAG pipelines.

Point it at a docs site, knowledge base, or blog. It crawls the pages, strips the navigation/ads/boilerplate, converts the main content to clean Markdown, and (optionally) splits it into overlapping chunks. One dataset row per chunk — pipe it straight into a vector database.

⚡ Fast HTTP crawler (no headless browser). No API key required.


What it does

  1. Crawls from your start URLs, following links up to a depth/page limit you set (same-domain by default).
  2. Extracts the main content — removes nav, header, footer, sidebars, scripts, ads.
  3. Converts it to clean Markdown (headings, lists, links, code preserved).
  4. Chunks it into overlapping, embeddings-sized pieces for RAG.

Output is one row per chunk, each tagged with its source URL, title, and chunk position — exactly the shape you want for an embeddings/vector pipeline.


Who it's for

  • AI/RAG builders — turn a docs site or knowledge base into a clean corpus for retrieval.
  • Chatbot makers — feed your support docs into a customer-facing assistant.
  • Agent developers — give an agent a fresh, structured snapshot of a site.
  • Data teams — bulk-convert web content to Markdown without writing a parser.

Input

FieldTypeDefaultDescription
startUrlsarrayURLs to crawl (plain strings or { "url": "..." })
maxCrawlPagesinteger50Total page cap
maxCrawlDepthinteger1Link-hops from start URLs (0 = start URLs only)
sameDomainOnlybooleantrueOnly follow links on the start domain(s)
includeUrlGlobsarrayOnly crawl URLs matching these globs (e.g. https://site.com/docs/*)
excludeUrlGlobsarraySkip URLs matching these globs (e.g. *.pdf)
chunkContentbooleantrueSplit pages into RAG chunks (one row each)
chunkSizeinteger1000Target characters per chunk
chunkOverlapinteger100Overlap chars between chunks
minChunkCharsinteger50Drop chunks smaller than this
saveHtmlbooleanfalseAlso include cleaned HTML
maxConcurrencyinteger10Pages crawled in parallel
proxyConfigurationobjectOptional Apify Proxy

Example input

{
"startUrls": [{ "url": "https://docs.apify.com/" }],
"maxCrawlPages": 30,
"maxCrawlDepth": 2,
"includeUrlGlobs": ["https://docs.apify.com/*"],
"chunkContent": true,
"chunkSize": 1000,
"chunkOverlap": 100
}

More in examples/.


Output

One dataset row per chunk:

{
"url": "https://docs.apify.com/platform/actors",
"title": "Actors | Apify Docs",
"description": "Learn how Apify Actors work.",
"chunkIndex": 0,
"chunkCount": 4,
"content": "# Actors\n\nActors are serverless programs...",
"contentChars": 980,
"depth": 1,
"crawledAt": "2026-06-15T12:00:00.000Z"
}

To build a vector index: embed the content field, store url + title + chunkIndex as metadata. Done.

Key-value store outputs

  • SUMMARY — pages crawled/failed, total chunks, average chunk size, settings

Tips for clean RAG data

  • Use includeUrlGlobs to stay inside the section you care about (e.g. .../docs/*) and skip marketing pages.
  • chunkSize 800–1200 chars suits most embedding models; bump chunkOverlap to 150–200 for prose-heavy sites.
  • Turn off chunkContent if you want whole pages (one row each) and prefer to chunk in your own pipeline.
  • Exclude noise with excludeUrlGlobs (*.pdf, */tag/*, */author/*).

Limitations & compliance

  • HTTP crawler — it reads server-rendered HTML. Pages that render content purely client-side (heavy SPA) may yield little; those need a browser-based crawler.
  • Main-content extraction is heuristic (prefers <article>/<main>, strips common boilerplate). Unusual layouts may include or drop some content.
  • You choose the targets. Crawl only sites you're permitted to, respect each site's terms and robots policy, and don't collect private or paywalled data. This Actor accesses publicly reachable pages only.

Changelog

See CHANGELOG.md.