Pricing

from $0.01 / 1,000 results

DomainForge LLM Dataset Builder

Crawls websites and transforms web content into clean, structured datasets optimized for LLM fine-tuning, RAG applications, and knowledge base construction

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

fanio zilla

Actor stats

Bookmarked

Total users

Monthly active users

7 hours ago

Last modified

DomainForge LLM Dataset Builder 🛠️

DomainForge Logo

Turn Any Website into Production-Ready LLM Training & RAG Data with One Click.

LICENSE

💡 Stop Scraping, Start Training

AI Engineers and researchers spend 70-80% of their time building web scrapers, stripping boilerplate HTML, deduplicating content, and chunking files.

DomainForge does all of this for you automatically.

Whether you are fine-tuning a custom LLM, building a Retrieval-Augmented Generation (RAG) pipeline, or feeding a vector database, DomainForge crawls your target websites and processes them into clean, structured, and deduplicated datasets instantly.

✨ Superpowers
🚀 How It Works
🧩 Input Parameters
📊 Output Dataset Schema
📁 Example Configurations
🛠️ Usage
🩺 Troubleshooting
⚖️ Legal & Ethical Use
🔗 Try DomainForge Now

✨ Superpowers

🕷️ Smart Website Crawling: Crawl recursively, target specific URL paths (using globs or regex), or ingest entire sites instantly using sitemap.xml files.
🧼 High-Fidelity Noise Removal: Powered by Mozilla's Readability engine. We strip headers, footers, sidebars, cookie banners, navigation menus, and ads—leaving only the pristine, valuable content.
🧩 LLM-Optimized Auto-Chunking: Automatically splits long articles into clean, overlapping chunks (customizable chunk sizes and overlaps) optimized for vector store embeddings.
⚡ Exact Deduplication: Removes duplicate pages and identical content blocks using SHA-256 hashing, collapsing republished or near-identical content into a single canonical record. (Semantic, embedding-based near-duplicate deduplication is configurable and on the roadmap — see Input Parameters.)
📂 Multi-Format Export: Download your clean dataset as JSONL (Hugging Face ready), JSON, CSV, or raw Markdown files.

🚀 How It Works

DomainForge operates as a state-of-the-art data refinery:

graph LR
    A[Raw URL / Sitemap] --> B[Smart Crawler]
    B --> C[Noise & Boilerplate Stripper]
    C --> D[SHA-256 Deduplication]
    D --> E[Semantic Chunking]
    E --> F[LLM-Ready Dataset]
    style A fill:#4dabf7,stroke:#228be6,stroke-width:2px,color:#fff
    style F fill:#37b24d,stroke:#2b8a3e,stroke-width:2px,color:#fff
    style B fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff
    style C fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff
    style D fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff
    style E fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff

Input a URL: Just enter a start URL (e.g. https://docs.yourcompany.com) or a sitemap.
Forge processes the data: The actor extracts metadata (author, publish date, language, counts), cleans the markup, deduplicates content, and splits it into semantic chunks.
Deploy: Directly ingest the output into your vector databases (Pinecone, Chroma, Qdrant) or Hugging Face datasets.

🧩 Input Parameters

DomainForge is built for both developers and non-technical builders. You can configure it with a simple JSON or run it through the Apify Console UI. Only startUrls is required; every other field has a sensible default.

Simple Configuration (Just the basics)

{
  "startUrls": [{ "url": "https://docs.apify.com" }],
  "maxCrawlPages": 100
}

Full Enterprise Settings (Complete control)

{
  "startUrls": [
    { "url": "https://example.com/blog" },
    { "url": "https://example.com/sitemap.xml" }
  ],
  "maxCrawlPages": 1000,
  "maxDepth": 3,
  "includePatterns": ["*/blog/*", "*/docs/*"],
  "excludePatterns": ["*/admin/*", "*/login*"],
  "respectRobotsTxt": true,
  "requestDelay": 1000,
  "saveMarkdown": true,
  "saveHtml": false,
  "enableDeduplication": true,
  "embeddingProvider": "none",
  "chunkSize": 1024,
  "chunkOverlap": 200
}

Parameter Reference

Parameter	Type	Default	Description
`startUrls` *required	array of `{ url }`	—	One or more HTTP(S) URLs to begin crawling. Sitemap URLs (`sitemap.xml`) are detected and expanded automatically.
`maxCrawlPages`	integer	`1000`	Hard cap on pages crawled. Crawl stops once reached. (1–1,000,000)
`maxDepth`	integer	`3`	Link levels to follow from start URLs. `0` crawls only the start URLs / sitemap entries — ideal for sitemap-driven ingestion. (0–10)
`includePatterns`	string[]	`[]`	Glob (`/blog/`) or regex (`/\/blog\//`) patterns. When set, only matching URLs are crawled.
`excludePatterns`	string[]	`[]`	Glob/regex patterns to skip. Excludes take precedence over includes.
`respectRobotsTxt`	boolean	`true`	Honor `robots.txt`. Disallowed URLs are skipped and counted in the final statistics.
`requestDelay`	integer	`1000`	Milliseconds to wait between consecutive requests to the same domain. Be polite. (100–60,000 ms)
`saveMarkdown`	boolean	`true`	Include a cleaned Markdown version of each page in the output.
`saveHtml`	boolean	`false`	Include the original raw HTML. Off by default to keep datasets lean.
`enableDeduplication`	boolean	`true`	Remove exact duplicates (identical SHA-256 hash) from the dataset.
`embeddingProvider`	enum	`"none"`	Semantic-dedup provider: `"none"`, `"bge"` (on-device), or `"openai"` (API). See note below.
`semanticDeduplicationThreshold`	number	`0.85`	Cosine-similarity cutoff (0.0–1.0) above which two pages are treated as near-duplicates.
`chunkSize`	integer	`1024`	Target chunk size in characters for RAG splitting. (100–10,000 chars)
`chunkOverlap`	integer	`200`	Character overlap between consecutive chunks. Must be less than `chunkSize`. (0–9,999 chars)

ℹ️ About embeddingProvider: The semantic (embedding-based) deduplication path is configurable now but currently stubbed — when a provider other than "none" is selected, the pipeline runs normally and emits similarityScore: null on each item until the embedding engine ships. Exact SHA-256 deduplication (controlled by enableDeduplication) is fully active regardless. Set the provider now to keep your runs future-ready.

📊 Output Dataset Schema

Each crawled page is returned as one structured item in the default dataset, ready for database storage or model ingestion.

{
  "url": "https://example.com/blog/getting-started",
  "title": "Getting Started with AI Datasets",
  "markdown": "# Getting Started with AI Datasets\n\nHigh quality data is all you need...",
  "text": "Getting Started with AI Datasets High quality data is all you need...",
  "html": "<html><body>...</body></html>",
  "metadata": {
    "author": "Jane Doe",
    "publishDate": "2026-06-21T00:00:00.000Z",
    "language": "en",
    "wordCount": 745,
    "tokenCountApprox": 931,
    "crawledAt": "2026-06-21T12:00:00.000Z"
  },
  "chunks": [
    { "text": "High quality data is all you need to train custom LLMs...", "tokenCount": 256 }
  ],
  "dedupHash": "8f3b2a9ec1d74f6e0b9a2c5d8e1f4a7b3c9d0e2f5a8b1c4d7e0a3b6c9d2e5f8a",
  "similarityScore": null
}

Field Reference

Field	Type	Required	Description
`url`	string	✅	Fully-qualified URL of the crawled page.
`title`	string	✅	Page title extracted from content.
`markdown`	string	when `saveMarkdown: true`	Cleaned Markdown of the page content.
`text`	string	✅	Normalised plain text of the cleaned content.
`html`	string	when `saveHtml: true`	Original raw HTML of the page.
`metadata`	object	✅	Enriched metadata (see below).
`metadata.author`	string \| null	—	Author from meta tags, or `null`.
`metadata.publishDate`	string \| null	—	ISO date from meta tags / `<time>`, or `null`.
`metadata.language`	string	—	Language code from `<html lang>` (e.g. `en`), or `unknown`.
`metadata.wordCount`	integer	—	Word count of the plain text.
`metadata.tokenCountApprox`	integer	—	Estimated token count (`wordCount × 1.25`).
`metadata.crawledAt`	string	—	ISO 8601 timestamp when the page was processed.
`chunks`	array	✅	Overlapping text chunks for RAG ingestion.
`chunks[].text`	string	—	Text content of the chunk.
`chunks[].tokenCount`	integer	—	Approximate token count for the chunk.
`dedupHash`	string	✅	64-char SHA-256 hash of normalised text.
`similarityScore`	number \| null	—	Cosine similarity for semantic dedup (`null` until the embedding engine ships).

The Apify Console renders the dataset with three pre-built views — Overview, Content, and Chunks (RAG-ready) (the last flattens chunks into one row each for direct vector-store ingestion).

📁 Example Configurations

Pre-built input presets for common scenarios live in the ./examples directory. Copy one into the Apify Console (or pass it as the Actor input) and adjust the URLs.

Preset	Best for	Highlights
./examples/blog-crawl.json	Harvesting a blog	Path-scoped to `/blog/`, skips tag/author/pagination noise, exact-dedup on
./examples/documentation-sitemap.json	Ingesting a whole docs site	Sitemap-driven, `maxDepth 0` crawls only listed pages — fast and complete
./examples/rag-chunking.json	Building a RAG / vector corpus	Small 512-char chunks, tight overlap, markdown omitted for a lean dataset
./examples/semantic-dedup.json	Near-duplicate news/PR corpus	Enables the `bge` embedding path + 0.85 threshold (future-ready)

🛠️ Usage

Run on the Apify Console

Get started in seconds — no install required. Click Start with an input on the Actor's Apify Store page, paste a startUrls value, and hit Start.

Run from the CLI

# Run with the default input (set startUrls in the Console form)
apify call domainforge-llm-dataset-builder

# Run with a local input file and fetch the dataset as JSONL (Hugging Face ready)
apify call domainforge-llm-dataset-builder --input examples/blog-crawl.json
apify get-dataset --dataset-id <DATASET_ID> --format=jsonl > dataset.jsonl

Run programmatically (JavaScript / Node.js)

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });

const run = await client.actor('domainforge-llm-dataset-builder').call({
    startUrls: [{ url: 'https://docs.apify.com' }],
    maxCrawlPages: 100,
    enableDeduplication: true,
    chunkSize: 1024,
    chunkOverlap: 200,
});

// Stream the cleaned dataset
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`${items.length} pages forged`);

Run programmatically (Python)

from apify_client import ApifyClient

client = ApifyClient("YOUR_APIFY_TOKEN")

run = client.actor("domainforge-llm-dataset-builder").call(run_input={
    "startUrls": [{"url": "https://docs.apify.com"}],
    "maxCrawlPages": 100,
    "enableDeduplication": True,
    "chunkSize": 1024,
    "chunkOverlap": 200,
})

dataset = client.dataset(run["defaultDatasetId"]).list_items()
print(f"{dataset.count} pages forged")

Load into Hugging Face

Download the dataset as JSONL from the Console (or apify get-dataset --format=jsonl), then:

from datasets import load_dataset
ds = load_dataset("json", data_files="dataset.jsonl", split="train")

💡 Quick Tips for Best Results

Sitemaps are your friend: Use the sitemap URL (e.g. https://example.com/sitemap.xml) to crawl an entire site instantly. Set maxDepth to 0 to only crawl pages listed in the sitemap.
Targeted Ingestion: Use includePatterns (like */docs/* or */help/*) to prevent wasting compute on irrelevant pages like contacts or terms of service.
Hugging Face Friendly: Download the dataset in JSONL format using the Apify API for native integration with Hugging Face pipelines.

🩺 Troubleshooting

Symptom	Likely cause	Fix
Zero pages crawled / "Pages crawled: 0"	All start URLs blocked by `robots.txt` or outside `includePatterns`	Set `respectRobotsTxt: false` only if you have permission, or widen `includePatterns`. Check the run log for the count of robots-disallowed URLs.
Empty `text` / pages "skipped (empty/short)"	Readability couldn't find an article body (JS-rendered content, paywall, login wall)	DomainForge uses static HTTP (Cheerio) crawling. For heavily JS-rendered sites, pre-render the HTML or point at a server-rendered/AMP version.
Dataset missing `markdown` or `html`	`saveMarkdown` / `saveHtml` toggles	Set `saveMarkdown: true` (default on) and `saveHtml: true` (default off) in your input.
`similarityScore` is always `null`	Expected today — the embedding engine is not yet shipped	Exact SHA-256 dedup is active; semantic dedup is configured-but-stubbed. See the `embeddingProvider` note above.
`chunkOverlap` rejected	Overlap ≥ `chunkSize`	Lower `chunkOverlap` so it is strictly less than `chunkSize`.
Crawl hits `maxCrawlPages` too early	Default cap is 1000	Raise `maxCrawlPages`. For very large sites, prefer sitemap mode (`maxDepth: 0`) for predictable, complete coverage.
Run times out or runs out of memory	Very large crawl + default resources	Defaults are 2048 MB / 3600 s. Raise memory in the Console run settings, and/or split the crawl across multiple runs by path.
Lots of duplicate-looking pages survive	Republished content with edits won't match an exact hash	Enable `embeddingProvider` so near-duplicates are flagged once the semantic engine ships; until then, deduplicate downstream on `dedupHash` and `metadata.title`.

⚖️ Legal & Ethical Use

DomainForge is a general-purpose web crawler. You are responsible for using it lawfully and ethically.

Respect Terms of Service. Review the target site's ToS before crawling. Some sites prohibit automated access; honoring that is your responsibility.
Honor robots.txt. respectRobotsTxt is on by default. Keep it on unless you have explicit authorization, and even then slow down with requestDelay.
Rate-limit politely. Use a sensible requestDelay (default 1000 ms) to avoid overloading the target server. Treat shared infrastructure as you would want yours treated.
Personal data & GDPR/CCPA. Do not crawl, store, or republish personal data without a lawful basis. If your target contains PII, redact it before storage and avoid passing it into training pipelines.
Copyright & licensing. Crawled content may be copyrighted. Building a dataset for internal RAG is typically different from redistributing the text publicly — verify the license, attribute authors (metadata.author, metadata.publishDate are captured to help), and seek permission before publishing derived datasets.
Don't republish raw scraped content as if it were your own. Use output for legitimate training/retrieval purposes consistent with fair use and the source's license.

If you are unsure whether a use case is permitted, ask the data owner. Apify's Acceptable Use Policy applies to all Actors on the platform.

🔗 Try DomainForge Now

Get started in seconds! Run this actor directly on the Apify Console:

$apify call domainforge-llm-dataset-builder

For custom builds, bug reports, or feature requests, feel free to check the project LICENSE and source code.

Website Content Crawler

crawlerbros/website-content-crawler

Crawls websites and extracts clean text, markdown, or HTML content. Ideal for LLM training data, RAG pipelines, and knowledge base building.

Crawler Bros

LLM-Ready Web Scraper

devoted_helix/llm-web-scraper

Convert web pages to clean, LLM-friendly text. Perfect for RAG pipelines, AI chatbot training, and fine-tuning datasets. Removes ads,menus, and clutter automatically.

batuhan senavci

AI-Ready Website Crawler

optimus-fulcria/ai-ready-website-crawler

Crawl websites and convert to clean markdown for AI/RAG, LLM fine-tuning, and document pipelines.

Fulcria Labs

AI Training Dataset Builder: Articles, Blogs & Web Pages

turboextract/ai-training-dataset-builder

Turn any list of URLs into clean, structured training data for AI models, RAG systems, and LLM fine-tuning. Built for ML engineers and AI teams.

Moses Ndambuki

AI Training Data Scraper - LLM and RAG-Ready

george.the.developer/ai-training-data-scraper

Extract web content formatted for LLM fine-tuning and RAG pipelines. Output in OpenAI JSONL, Claude JSONL, Markdown, or raw text.

George Kioko

Q&A Dataset Extractor for LLM Fine-Tuning

deniz_schloesser/qa-dataset-extractor

Crawl any website, documentation or FAQ and turn it into clean, deduplicated question-answer pairs in OpenAI / Alpaca / plain JSONL format - ready for fine-tuning and RAG.

Deniz Schlösser

AI-Powered Web Content & Link Extractor

scrapercoder/ai-powered-web-content-link-extractor

Crawls websites to extract clean, structured content for AI/LLM use, ideal for training datasets, knowledge bases, and RAG systems. Json output includes: * text: Normalized page content * links: Extracted sub-URLs

wallnut.ai

179

AI Training Data Curator

ryanclinton/ai-training-data-curator

Crawl any website and extract clean, structured text data ready for LLM fine-tuning, RAG pipelines, and AI model training.

Ryan Clinton

Blog Post Scraper for LLM

extremescrapes/blog-post-scraper-for-llm

Extract blog posts as clean, image-free text optimized for AI/LLM training and fine-tuning. Filters by word count and outputs combined JSONL format ready for ML pipelines.

Extreme Scrapes

AI Training Data Collector — Clean Web Datasets for LLMs

avinashchby/ai-training-data-collector

Crawl websites and extract structured, clean text datasets perfect for fine-tuning LLMs and RAG pipelines. Removes boilerplate, deduplicates, and scores content quality.