Pricing

Pay per usage

RAG Website Crawler - Clean Markdown for LLMs & AI

Crawl any website and extract clean, chunked Markdown ready for RAG pipelines and LLM context. Returns page text, titles and URLs. No API key. Works in Claude, ChatGPT & any MCP-compatible AI agent.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

The Mine Works

Actor stats

Bookmarked

Total users

Monthly active users

10 hours ago

Last modified

RAG-Ready Website Crawler — Pre-Chunked Markdown with Token Counts

Turn any website into a clean, chunked, token-counted markdown corpus — ready to drop into a RAG pipeline, vector database, or LLM context window. No wasted spend: you are only charged for pages that successfully crawl and produce usable content.

What It Does

Most website-to-markdown scrapers dump raw HTML noise into your pipeline and charge you regardless of output quality. RAG-Crawler is designed from the ground up for AI-native ingestion workflows:

Crawls one or more seed URLs and follows internal links up to your page limit.
Strips boilerplate — nav, header, footer, ads, sidebars — using Mozilla Readability before converting to clean Markdown.
Splits each page into heading-based chunks with an approximate token count per chunk, so you can feed them directly into embedding or retrieval pipelines without a pre-processing step.
Returns structured JSON ready for vector databases (Pinecone, Weaviate, Qdrant, pgvector) or MCP tool servers.
Supports SPA / JavaScript-rendered pages via Playwright and static HTML via Cheerio.
Never charges for a failed page. If a page times out, errors, or returns no content, it is recorded in the dataset as status: "failed" with a charged: false flag.

Key Features

SPA / JS rendering — Playwright handles React, Vue, Next.js and other JS-rendered sites. Disable for pure static HTML to cut cost and latency.
Heading-based chunking — Splits content along # / ## / ### hierarchy. Oversized sections are further split by paragraph to respect your token limit. Chunk heading path is preserved so retrievers know the document context.
Token counts per chunk — Every chunk carries a token_count field (word-count × 1.35 approximation — no external tokenizer dependency). Accurate to ±8% on English prose.
Boilerplate stripping — Mozilla Readability extracts the article body before conversion. CSS selector override available for non-article pages (docs, product pages).
Configurable output formats — chunks (RAG-ready split output), full (single markdown blob), or both.
Fail-loud shortfall reporting — Every run emits a _summary item with pages_crawled, pages_failed, total_tokens, and charged_for. If zero pages crawled, the actor fails with a plain-English shortfall_reason so you know exactly what went wrong.
Zero charge on failure — Charging happens only after content is confirmed extracted. Every failed-page record carries "charged": false for your audit trail.
URL exclusion patterns — Regex-based exclude list. Skip PDFs, author pages, tag archives, or any URL pattern before they are even enqueued.

Output Schema

Each successfully crawled page produces one item. Example with outputFormat: "chunks":

{
  "url": "https://docs.example.com/getting-started/installation",
  "canonical": "https://docs.example.com/getting-started/installation",
  "title": "Installation — Example Docs",
  "description": "How to install the Example SDK in your project.",
  "language": "en",
  "word_count": 412,
  "token_count": 556,
  "crawled_at": "2026-06-05T09:14:22.000Z",
  "status": "success",
  "chunks": [
    {
      "heading_path": ["Getting Started", "Installation"],
      "text": "## Installation\n\nRun the following command to install the SDK:\n\n```bash\nnpm install example-sdk\n```",
      "token_count": 28,
      "chunk_index": 0
    },
    {
      "heading_path": ["Getting Started", "Installation", "Requirements"],
      "text": "### Requirements\n\nNode.js 18 or later is required. The SDK does not support CommonJS — use ESM (`\"type\": \"module\"` in package.json).",
      "token_count": 41,
      "chunk_index": 1
    }
  ]
}

Failed pages:

{
  "url": "https://docs.example.com/legacy-page",
  "status": "failed",
  "reason": "timeout",
  "message": "Page did not reach load state within timeout.",
  "charged": false
}

Run summary (always the final item):

{
  "_type": "summary",
  "pages_requested": 25,
  "pages_crawled": 23,
  "pages_failed": 2,
  "total_tokens": 48210,
  "charged_for": 23
}

Pricing

You are charged per successfully crawled page — nothing else.

Scenario	Charge
Page crawled, content extracted	1 credit per page
Page timed out	0 credits
Page returned no content	0 credits
Page blocked / 403	0 credits
Entire run produces zero pages	0 credits + actor fails with reason

No hidden platform fees, no per-token surcharges. Compare this to Firecrawl, which charges per scrape attempt regardless of content quality, and other Apify scrapers that charge on request rather than on result. RAG-Crawler only charges when your pipeline actually gets something useful.

Use Cases

RAG pipelines — Ingest product docs, knowledge bases, or competitor sites into your retrieval system. Chunks arrive pre-sized and pre-labelled with heading paths — no custom splitting code required.
LLM context injection — Feed a full documentation site into an LLM in one run. Use outputFormat: "full" for single-page context or outputFormat: "chunks" for precise retrieval.
AI agents via MCP — RAG-Crawler is MCP-native. Trigger it from any MCP tool server and pipe the JSON output directly into your agent's knowledge tool.
Vector database ingestion — Output maps directly to Pinecone, Weaviate, Qdrant, or pgvector upsert payloads. token_count helps you stay under embedding model input limits.
Documentation indexing — Crawl versioned docs sites. Heading paths give you natural document hierarchy for structured retrieval.
Competitor content analysis — Crawl competitor sites to LLM-analyse content gaps, SEO positioning, or product messaging.

Technical Notes

SPA Handling

When renderJs: true, the actor uses Playwright's Chromium in headless mode. Wait strategy options:

networkidle (default) — waits until no more than 2 network connections for 500ms. Most reliable for SPAs that load data via API.
domcontentloaded — fires when the HTML is parsed. Faster but may miss dynamically rendered content.
load — waits for all resources including images. Slowest; rarely needed.

If the primary wait strategy times out (30s), the actor automatically retries with domcontentloaded. If document.body is null (which can happen on route-transition frames in some SPAs), the actor waits 2 seconds and retries HTML extraction once before marking the page as failed.

Chunking Algorithm

The markdown is walked line-by-line. Each ATX heading (# through ######) opens a new section.
Content before the first heading is grouped as a section with an empty heading path.
If a section's token count is within maxTokensPerChunk, it becomes one chunk.
If a section exceeds the limit, its body is split by paragraph (blank-line boundaries) and paragraphs are greedily packed into sub-chunks. The heading line is repeated at the top of each sub-chunk to preserve retrieval context.
A single paragraph that exceeds maxTokensPerChunk is emitted as its own chunk (paragraphs are never split mid-sentence).

Token Approximation

RAG-Crawler uses Math.ceil(wordCount × 1.35) — no tiktoken or external tokenizer dependency. Accuracy:

Content type	Error vs. cl100k_base
English prose	±8%
Mixed code + prose	±15%
Dense code	±20%

For RAG chunking the error is acceptable — at worst a chunk boundary lands one paragraph off the "true" token boundary. For billing-critical token counting, run a local tiktoken pass on the output.

Custom Content Selector

If Readability misidentifies the main content area (common on documentation sites with complex layouts), set customCss to a CSS selector:

customCss: "article.docs-content"
customCss: "#main-content"
customCss: ".prose"

The selector takes precedence over Readability. If the selector matches nothing, the actor falls back to Readability, then to full-body conversion.

Keywords

website to markdown · RAG · LLM · Firecrawl alternative · MCP · vector database · chunking · token count · web scraper · SPA crawler · Playwright · Cheerio · Readability · AI pipeline · document ingestion · Apify actor

Website to Markdown Crawler for LLM & RAG

logiover/website-text-markdown-crawler

Crawl any website to clean Markdown and plain text for LLM training and RAG. HTML to Markdown, no API or login. Export website text to CSV or JSON.

Logiover

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

Website Content Crawler API - Markdown for RAG

tugelbay/website-content-crawler

Crawl public websites and extract clean Markdown, text, or HTML for RAG pipelines, AI agents, documentation indexing, and content monitoring. Guide: https://konabayev.com/tools/website-content-crawler/?utm_source=apify_info&utm_medium=referral&utm_campaign=website-content-crawler

Tugelbay Konabayev

Docs Markdown Rag Ready Crawler

devwithbobby/docs-markdown-rag-ready-crawler

Turn any documentation site or website into clean, structured markdown—ready for RAG, embeddings, and AI agents.

Dev with Bobby

Website Content Crawler

rupom888/website-content-crawler

Syed Rupom

Web-to-Markdown Generator for AI & RAG Pipelines

profitstack/web-to-markdown-generator-for-ai-rag-pipelines

Convert any website into clean, heading-based chunking, LLM-ready Markdown for RAG and AI agents.

Manas Mantri

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

Website Content Crawler — AI & RAG Ready

santamaria-automations/website-content-crawler

Crawl any website and extract clean Markdown and plain text optimized for AI ingestion, RAG pipelines, and LLM context. Readability-style main content extraction removes ads, navs, and footers. Configurable depth, concurrency, and page limits. Pay-per-page.

Ale

AI Web Content Crawler - Markdown for LLMs

intelscrape/ai-web-content-crawler

Crawl any website and extract clean Markdown optimized for LLM training, RAG pipelines, and AI knowledge bases - removes boilerplate and outputs structured JSON with URL, title, markdown, and metadata.

IntelScrape

Website to Markdown for LLM and RAG

jeweled_jockstrap/my-actor-3

Convert any URL to clean Markdown text for AI applications. Strips HTML extracts content. For LLM training RAG pipelines and vector databases. Free Firecrawl alternative.