RAG Website Crawler - Clean Markdown for LLMs & AI avatar

RAG Website Crawler - Clean Markdown for LLMs & AI

Pricing

Pay per usage

Go to Apify Store
RAG Website Crawler - Clean Markdown for LLMs & AI

RAG Website Crawler - Clean Markdown for LLMs & AI

Crawl any website and extract clean, chunked Markdown ready for RAG pipelines and LLM context. Returns page text, titles and URLs. No API key. Works in Claude, ChatGPT & any MCP-compatible AI agent.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

The Mine Works

The Mine Works

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

10 hours ago

Last modified

Share

RAG-Ready Website Crawler — Pre-Chunked Markdown with Token Counts

Turn any website into a clean, chunked, token-counted markdown corpus — ready to drop into a RAG pipeline, vector database, or LLM context window. No wasted spend: you are only charged for pages that successfully crawl and produce usable content.


What It Does

Most website-to-markdown scrapers dump raw HTML noise into your pipeline and charge you regardless of output quality. RAG-Crawler is designed from the ground up for AI-native ingestion workflows:

  • Crawls one or more seed URLs and follows internal links up to your page limit.
  • Strips boilerplate — nav, header, footer, ads, sidebars — using Mozilla Readability before converting to clean Markdown.
  • Splits each page into heading-based chunks with an approximate token count per chunk, so you can feed them directly into embedding or retrieval pipelines without a pre-processing step.
  • Returns structured JSON ready for vector databases (Pinecone, Weaviate, Qdrant, pgvector) or MCP tool servers.
  • Supports SPA / JavaScript-rendered pages via Playwright and static HTML via Cheerio.
  • Never charges for a failed page. If a page times out, errors, or returns no content, it is recorded in the dataset as status: "failed" with a charged: false flag.

Key Features

  • SPA / JS rendering — Playwright handles React, Vue, Next.js and other JS-rendered sites. Disable for pure static HTML to cut cost and latency.
  • Heading-based chunking — Splits content along # / ## / ### hierarchy. Oversized sections are further split by paragraph to respect your token limit. Chunk heading path is preserved so retrievers know the document context.
  • Token counts per chunk — Every chunk carries a token_count field (word-count × 1.35 approximation — no external tokenizer dependency). Accurate to ±8% on English prose.
  • Boilerplate stripping — Mozilla Readability extracts the article body before conversion. CSS selector override available for non-article pages (docs, product pages).
  • Configurable output formatschunks (RAG-ready split output), full (single markdown blob), or both.
  • Fail-loud shortfall reporting — Every run emits a _summary item with pages_crawled, pages_failed, total_tokens, and charged_for. If zero pages crawled, the actor fails with a plain-English shortfall_reason so you know exactly what went wrong.
  • Zero charge on failure — Charging happens only after content is confirmed extracted. Every failed-page record carries "charged": false for your audit trail.
  • URL exclusion patterns — Regex-based exclude list. Skip PDFs, author pages, tag archives, or any URL pattern before they are even enqueued.

Output Schema

Each successfully crawled page produces one item. Example with outputFormat: "chunks":

{
"url": "https://docs.example.com/getting-started/installation",
"canonical": "https://docs.example.com/getting-started/installation",
"title": "Installation — Example Docs",
"description": "How to install the Example SDK in your project.",
"language": "en",
"word_count": 412,
"token_count": 556,
"crawled_at": "2026-06-05T09:14:22.000Z",
"status": "success",
"chunks": [
{
"heading_path": ["Getting Started", "Installation"],
"text": "## Installation\n\nRun the following command to install the SDK:\n\n```bash\nnpm install example-sdk\n```",
"token_count": 28,
"chunk_index": 0
},
{
"heading_path": ["Getting Started", "Installation", "Requirements"],
"text": "### Requirements\n\nNode.js 18 or later is required. The SDK does not support CommonJS — use ESM (`\"type\": \"module\"` in package.json).",
"token_count": 41,
"chunk_index": 1
}
]
}

Failed pages:

{
"url": "https://docs.example.com/legacy-page",
"status": "failed",
"reason": "timeout",
"message": "Page did not reach load state within timeout.",
"charged": false
}

Run summary (always the final item):

{
"_type": "summary",
"pages_requested": 25,
"pages_crawled": 23,
"pages_failed": 2,
"total_tokens": 48210,
"charged_for": 23
}

Pricing

You are charged per successfully crawled page — nothing else.

ScenarioCharge
Page crawled, content extracted1 credit per page
Page timed out0 credits
Page returned no content0 credits
Page blocked / 4030 credits
Entire run produces zero pages0 credits + actor fails with reason

No hidden platform fees, no per-token surcharges. Compare this to Firecrawl, which charges per scrape attempt regardless of content quality, and other Apify scrapers that charge on request rather than on result. RAG-Crawler only charges when your pipeline actually gets something useful.


Use Cases

  • RAG pipelines — Ingest product docs, knowledge bases, or competitor sites into your retrieval system. Chunks arrive pre-sized and pre-labelled with heading paths — no custom splitting code required.
  • LLM context injection — Feed a full documentation site into an LLM in one run. Use outputFormat: "full" for single-page context or outputFormat: "chunks" for precise retrieval.
  • AI agents via MCP — RAG-Crawler is MCP-native. Trigger it from any MCP tool server and pipe the JSON output directly into your agent's knowledge tool.
  • Vector database ingestion — Output maps directly to Pinecone, Weaviate, Qdrant, or pgvector upsert payloads. token_count helps you stay under embedding model input limits.
  • Documentation indexing — Crawl versioned docs sites. Heading paths give you natural document hierarchy for structured retrieval.
  • Competitor content analysis — Crawl competitor sites to LLM-analyse content gaps, SEO positioning, or product messaging.

Technical Notes

SPA Handling

When renderJs: true, the actor uses Playwright's Chromium in headless mode. Wait strategy options:

  • networkidle (default) — waits until no more than 2 network connections for 500ms. Most reliable for SPAs that load data via API.
  • domcontentloaded — fires when the HTML is parsed. Faster but may miss dynamically rendered content.
  • load — waits for all resources including images. Slowest; rarely needed.

If the primary wait strategy times out (30s), the actor automatically retries with domcontentloaded. If document.body is null (which can happen on route-transition frames in some SPAs), the actor waits 2 seconds and retries HTML extraction once before marking the page as failed.

Chunking Algorithm

  1. The markdown is walked line-by-line. Each ATX heading (# through ######) opens a new section.
  2. Content before the first heading is grouped as a section with an empty heading path.
  3. If a section's token count is within maxTokensPerChunk, it becomes one chunk.
  4. If a section exceeds the limit, its body is split by paragraph (blank-line boundaries) and paragraphs are greedily packed into sub-chunks. The heading line is repeated at the top of each sub-chunk to preserve retrieval context.
  5. A single paragraph that exceeds maxTokensPerChunk is emitted as its own chunk (paragraphs are never split mid-sentence).

Token Approximation

RAG-Crawler uses Math.ceil(wordCount × 1.35) — no tiktoken or external tokenizer dependency. Accuracy:

Content typeError vs. cl100k_base
English prose±8%
Mixed code + prose±15%
Dense code±20%

For RAG chunking the error is acceptable — at worst a chunk boundary lands one paragraph off the "true" token boundary. For billing-critical token counting, run a local tiktoken pass on the output.

Custom Content Selector

If Readability misidentifies the main content area (common on documentation sites with complex layouts), set customCss to a CSS selector:

customCss: "article.docs-content"
customCss: "#main-content"
customCss: ".prose"

The selector takes precedence over Readability. If the selector matches nothing, the actor falls back to Readability, then to full-body conversion.


Keywords

website to markdown · RAG · LLM · Firecrawl alternative · MCP · vector database · chunking · token count · web scraper · SPA crawler · Playwright · Cheerio · Readability · AI pipeline · document ingestion · Apify actor