Pricing

from $0.50 / 1,000 results

AI / RAG Web Crawler

Crawl any website and extract clean, LLM-ready Markdown chunks to feed AI agents, chatbots, and RAG pipelines. One row per embeddable chunk.

Pricing

from $0.50 / 1,000 results

Rating

0.0

(0)

Developer

Group Oject

Actor stats

Bookmarked

Total users

Monthly active users

19 hours ago

Last modified

What it does

Crawls from your start URLs, following links up to a depth/page limit you set (same-domain by default).
Extracts the main content — removes nav, header, footer, sidebars, scripts, ads.
Converts it to clean Markdown (headings, lists, links, code preserved).
Chunks it into overlapping, embeddings-sized pieces for RAG.

Output is one row per chunk, each tagged with its source URL, title, and chunk position — exactly the shape you want for an embeddings/vector pipeline.

Who it's for

AI/RAG builders — turn a docs site or knowledge base into a clean corpus for retrieval.
Chatbot makers — feed your support docs into a customer-facing assistant.
Agent developers — give an agent a fresh, structured snapshot of a site.
Data teams — bulk-convert web content to Markdown without writing a parser.

Input

Field	Type	Default	Description
`startUrls`	array	—	URLs to crawl (plain strings or `{ "url": "..." }`)
`maxCrawlPages`	integer	`50`	Total page cap
`maxCrawlDepth`	integer	`1`	Link-hops from start URLs (0 = start URLs only)
`sameDomainOnly`	boolean	`true`	Only follow links on the start domain(s)
`includeUrlGlobs`	array	—	Only crawl URLs matching these globs (e.g. `https://site.com/docs/*`)
`excludeUrlGlobs`	array	—	Skip URLs matching these globs (e.g. `*.pdf`)
`chunkContent`	boolean	`true`	Split pages into RAG chunks (one row each)
`chunkSize`	integer	`1000`	Target characters per chunk
`chunkOverlap`	integer	`100`	Overlap chars between chunks
`minChunkChars`	integer	`50`	Drop chunks smaller than this
`saveHtml`	boolean	`false`	Also include cleaned HTML
`maxConcurrency`	integer	`10`	Pages crawled in parallel
`proxyConfiguration`	object	—	Optional Apify Proxy

Example input

{
  "startUrls": [{ "url": "https://docs.apify.com/" }],
  "maxCrawlPages": 30,
  "maxCrawlDepth": 2,
  "includeUrlGlobs": ["https://docs.apify.com/*"],
  "chunkContent": true,
  "chunkSize": 1000,
  "chunkOverlap": 100
}

Output

One dataset row per chunk:

{
  "url": "https://docs.apify.com/platform/actors",
  "title": "Actors | Apify Docs",
  "description": "Learn how Apify Actors work.",
  "chunkIndex": 0,
  "chunkCount": 4,
  "content": "# Actors\n\nActors are serverless programs...",
  "contentChars": 980,
  "depth": 1,
  "crawledAt": "2026-06-15T12:00:00.000Z"
}

To build a vector index: embed the content field, store url + title + chunkIndex as metadata. Done.

Key-value store outputs

SUMMARY — pages crawled/failed, total chunks, average chunk size, settings

Tips for clean RAG data

Use includeUrlGlobs to stay inside the section you care about (e.g. .../docs/*) and skip marketing pages.
chunkSize 800–1200 chars suits most embedding models; bump chunkOverlap to 150–200 for prose-heavy sites.
Turn off chunkContent if you want whole pages (one row each) and prefer to chunk in your own pipeline.
Exclude noise with excludeUrlGlobs (*.pdf, */tag/*, */author/*).

Limitations & compliance

HTTP crawler — it reads server-rendered HTML. Pages that render content purely client-side (heavy SPA) may yield little; those need a browser-based crawler.
Main-content extraction is heuristic (prefers <article>/<main>, strips common boilerplate). Unusual layouts may include or drop some content.
You choose the targets. Crawl only sites you're permitted to, respect each site's terms and robots policy, and don't collect private or paywalled data. This Actor accesses publicly reachable pages only.

Changelog

See CHANGELOG.md.

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Docs-to-RAG AI Crawler

charitable_jeopardy/WebScraperAp

Stop wasting space on website headers, footers, cookie banners, and navigation menus. Extract clean body text, chunk it for RAG, and detect page changes across runs crawling public docs, blogs, and knowledge bases,

charitable_jeopardy

Compliance-Grade Web Intelligence for AI Agents

ai_solutionist/compliance-web-intel

The scraper AI agents trust. Extract grounded facts with citations, entities, claims & RAG chunks. Built for LangChain, LlamaIndex, AutoGPT. Quality scoring, auto-citations, 6 task modes.

Jason Pellerin

Web Content Crawler — Generic Site Text Extractor

agency-shift/web-content-crawler

Generic web content crawler. Extract text content from any URL. Lightweight alternative for quick page scraping and data collection for AI training and research.

Valdeir Lima

RAG Web Extractor — Chunked Content for AI Pipelines

junipr/rag-web-extractor

Extract clean markdown from websites for RAG pipelines. Strip nav, ads, boilerplate. Preserve headings, links, images. Recursive crawling with depth control. Chunked output for embedding pipelines. Build AI knowledge bases.

junipr

RAG-Ready Web Scraper & Smart Chunker for AI Knowledge Bases

adinfosys-labs/rag-ready-web-scraper-smart-chunker-for-ai-knowledge-bases

RAG-ready web scraper that collects, cleans, deduplicates, filters, and chunks web content into structured datasets for AI pipelines. Generates high-quality knowledge-base data optimized for LLMs, embeddings, and vector databases

Artashes Arakelyan

Public Render & Embed Readiness Agent

jacksu/public-render-embed-readiness-agent

Preflight public pages for HTTP readability, JavaScript rendering dependency, iframes, lazy assets, access blockers, and scraper-readiness decisions.

jack su

AI Web Crawler

hounderd/ai-web-crawler

Crawl websites and extract clean, LLM-ready markdown content with stealth browser rendering, anti-bot hardening, smart content filtering, and structured metadata extraction. Built for RAG pipelines, AI agents, and data workflows.

Hounderd

RAG Web Browser Scraper

datapilot/rag-web-browser-scraper

RAG Web Browser Search & Crawl Actor uses to search Bing or crawl URLs, then extracts page content as clean markdown. It captures title, description, language, HTTP status, and structured metadata. Supports multiple queries, proxies, and outputs organized crawl + search results.

Data Pilot

Web Scraper Mcp

loom-stack/web-scraper-mcp

Batch scrape any website via Model Context Protocol. Works with Claude Desktop, Cursor, Cline, and all MCP-compatible AI clients. Smart retries, JS rendering, clean Markdown output, and async crawling for large batches.