RAG Website Crawler - Clean Markdown for LLMs & AI
Pricing
Pay per usage
RAG Website Crawler - Clean Markdown for LLMs & AI
Crawl any website and extract clean, chunked Markdown ready for RAG pipelines and LLM context. Returns page text, titles and URLs. No API key. Works in Claude, ChatGPT & any MCP-compatible AI agent.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
The Mine Works
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
10 hours ago
Last modified
Categories
Share
RAG-Ready Website Crawler — Pre-Chunked Markdown with Token Counts
Turn any website into a clean, chunked, token-counted markdown corpus — ready to drop into a RAG pipeline, vector database, or LLM context window. No wasted spend: you are only charged for pages that successfully crawl and produce usable content.
What It Does
Most website-to-markdown scrapers dump raw HTML noise into your pipeline and charge you regardless of output quality. RAG-Crawler is designed from the ground up for AI-native ingestion workflows:
- Crawls one or more seed URLs and follows internal links up to your page limit.
- Strips boilerplate — nav, header, footer, ads, sidebars — using Mozilla Readability before converting to clean Markdown.
- Splits each page into heading-based chunks with an approximate token count per chunk, so you can feed them directly into embedding or retrieval pipelines without a pre-processing step.
- Returns structured JSON ready for vector databases (Pinecone, Weaviate, Qdrant, pgvector) or MCP tool servers.
- Supports SPA / JavaScript-rendered pages via Playwright and static HTML via Cheerio.
- Never charges for a failed page. If a page times out, errors, or returns no content, it is recorded in the dataset as
status: "failed"with acharged: falseflag.
Key Features
- SPA / JS rendering — Playwright handles React, Vue, Next.js and other JS-rendered sites. Disable for pure static HTML to cut cost and latency.
- Heading-based chunking — Splits content along
#/##/###hierarchy. Oversized sections are further split by paragraph to respect your token limit. Chunk heading path is preserved so retrievers know the document context. - Token counts per chunk — Every chunk carries a
token_countfield (word-count × 1.35 approximation — no external tokenizer dependency). Accurate to ±8% on English prose. - Boilerplate stripping — Mozilla Readability extracts the article body before conversion. CSS selector override available for non-article pages (docs, product pages).
- Configurable output formats —
chunks(RAG-ready split output),full(single markdown blob), orboth. - Fail-loud shortfall reporting — Every run emits a
_summaryitem withpages_crawled,pages_failed,total_tokens, andcharged_for. If zero pages crawled, the actor fails with a plain-Englishshortfall_reasonso you know exactly what went wrong. - Zero charge on failure — Charging happens only after content is confirmed extracted. Every failed-page record carries
"charged": falsefor your audit trail. - URL exclusion patterns — Regex-based exclude list. Skip PDFs, author pages, tag archives, or any URL pattern before they are even enqueued.
Output Schema
Each successfully crawled page produces one item. Example with outputFormat: "chunks":
{"url": "https://docs.example.com/getting-started/installation","canonical": "https://docs.example.com/getting-started/installation","title": "Installation — Example Docs","description": "How to install the Example SDK in your project.","language": "en","word_count": 412,"token_count": 556,"crawled_at": "2026-06-05T09:14:22.000Z","status": "success","chunks": [{"heading_path": ["Getting Started", "Installation"],"text": "## Installation\n\nRun the following command to install the SDK:\n\n```bash\nnpm install example-sdk\n```","token_count": 28,"chunk_index": 0},{"heading_path": ["Getting Started", "Installation", "Requirements"],"text": "### Requirements\n\nNode.js 18 or later is required. The SDK does not support CommonJS — use ESM (`\"type\": \"module\"` in package.json).","token_count": 41,"chunk_index": 1}]}
Failed pages:
{"url": "https://docs.example.com/legacy-page","status": "failed","reason": "timeout","message": "Page did not reach load state within timeout.","charged": false}
Run summary (always the final item):
{"_type": "summary","pages_requested": 25,"pages_crawled": 23,"pages_failed": 2,"total_tokens": 48210,"charged_for": 23}
Pricing
You are charged per successfully crawled page — nothing else.
| Scenario | Charge |
|---|---|
| Page crawled, content extracted | 1 credit per page |
| Page timed out | 0 credits |
| Page returned no content | 0 credits |
| Page blocked / 403 | 0 credits |
| Entire run produces zero pages | 0 credits + actor fails with reason |
No hidden platform fees, no per-token surcharges. Compare this to Firecrawl, which charges per scrape attempt regardless of content quality, and other Apify scrapers that charge on request rather than on result. RAG-Crawler only charges when your pipeline actually gets something useful.
Use Cases
- RAG pipelines — Ingest product docs, knowledge bases, or competitor sites into your retrieval system. Chunks arrive pre-sized and pre-labelled with heading paths — no custom splitting code required.
- LLM context injection — Feed a full documentation site into an LLM in one run. Use
outputFormat: "full"for single-page context oroutputFormat: "chunks"for precise retrieval. - AI agents via MCP — RAG-Crawler is MCP-native. Trigger it from any MCP tool server and pipe the JSON output directly into your agent's knowledge tool.
- Vector database ingestion — Output maps directly to Pinecone, Weaviate, Qdrant, or pgvector upsert payloads.
token_counthelps you stay under embedding model input limits. - Documentation indexing — Crawl versioned docs sites. Heading paths give you natural document hierarchy for structured retrieval.
- Competitor content analysis — Crawl competitor sites to LLM-analyse content gaps, SEO positioning, or product messaging.
Technical Notes
SPA Handling
When renderJs: true, the actor uses Playwright's Chromium in headless mode. Wait strategy options:
networkidle(default) — waits until no more than 2 network connections for 500ms. Most reliable for SPAs that load data via API.domcontentloaded— fires when the HTML is parsed. Faster but may miss dynamically rendered content.load— waits for all resources including images. Slowest; rarely needed.
If the primary wait strategy times out (30s), the actor automatically retries with domcontentloaded. If document.body is null (which can happen on route-transition frames in some SPAs), the actor waits 2 seconds and retries HTML extraction once before marking the page as failed.
Chunking Algorithm
- The markdown is walked line-by-line. Each ATX heading (
#through######) opens a new section. - Content before the first heading is grouped as a section with an empty heading path.
- If a section's token count is within
maxTokensPerChunk, it becomes one chunk. - If a section exceeds the limit, its body is split by paragraph (blank-line boundaries) and paragraphs are greedily packed into sub-chunks. The heading line is repeated at the top of each sub-chunk to preserve retrieval context.
- A single paragraph that exceeds
maxTokensPerChunkis emitted as its own chunk (paragraphs are never split mid-sentence).
Token Approximation
RAG-Crawler uses Math.ceil(wordCount × 1.35) — no tiktoken or external tokenizer dependency. Accuracy:
| Content type | Error vs. cl100k_base |
|---|---|
| English prose | ±8% |
| Mixed code + prose | ±15% |
| Dense code | ±20% |
For RAG chunking the error is acceptable — at worst a chunk boundary lands one paragraph off the "true" token boundary. For billing-critical token counting, run a local tiktoken pass on the output.
Custom Content Selector
If Readability misidentifies the main content area (common on documentation sites with complex layouts), set customCss to a CSS selector:
customCss: "article.docs-content"customCss: "#main-content"customCss: ".prose"
The selector takes precedence over Readability. If the selector matches nothing, the actor falls back to Readability, then to full-body conversion.
Keywords
website to markdown · RAG · LLM · Firecrawl alternative · MCP · vector database · chunking · token count · web scraper · SPA crawler · Playwright · Cheerio · Readability · AI pipeline · document ingestion · Apify actor