DomainForge LLM Dataset Builder
Pricing
from $0.01 / 1,000 results
DomainForge LLM Dataset Builder
Crawls websites and transforms web content into clean, structured datasets optimized for LLM fine-tuning, RAG applications, and knowledge base construction
Pricing
from $0.01 / 1,000 results
Rating
0.0
(0)
Developer
fanio zilla
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
7 hours ago
Last modified
Categories
Share
DomainForge LLM Dataset Builder 🛠️
Turn Any Website into Production-Ready LLM Training & RAG Data with One Click.
💡 Stop Scraping, Start Training
AI Engineers and researchers spend 70-80% of their time building web scrapers, stripping boilerplate HTML, deduplicating content, and chunking files.
DomainForge does all of this for you automatically.
Whether you are fine-tuning a custom LLM, building a Retrieval-Augmented Generation (RAG) pipeline, or feeding a vector database, DomainForge crawls your target websites and processes them into clean, structured, and deduplicated datasets instantly.
Table of Contents
- ✨ Superpowers
- 🚀 How It Works
- 🧩 Input Parameters
- 📊 Output Dataset Schema
- 📁 Example Configurations
- 🛠️ Usage
- 🩺 Troubleshooting
- ⚖️ Legal & Ethical Use
- 🔗 Try DomainForge Now
✨ Superpowers
- 🕷️ Smart Website Crawling: Crawl recursively, target specific URL paths (using globs or regex), or ingest entire sites instantly using
sitemap.xmlfiles. - 🧼 High-Fidelity Noise Removal: Powered by Mozilla's Readability engine. We strip headers, footers, sidebars, cookie banners, navigation menus, and ads—leaving only the pristine, valuable content.
- 🧩 LLM-Optimized Auto-Chunking: Automatically splits long articles into clean, overlapping chunks (customizable chunk sizes and overlaps) optimized for vector store embeddings.
- ⚡ Exact Deduplication: Removes duplicate pages and identical content blocks using SHA-256 hashing, collapsing republished or near-identical content into a single canonical record. (Semantic, embedding-based near-duplicate deduplication is configurable and on the roadmap — see Input Parameters.)
- 📂 Multi-Format Export: Download your clean dataset as JSONL (Hugging Face ready), JSON, CSV, or raw Markdown files.
🚀 How It Works
DomainForge operates as a state-of-the-art data refinery:
graph LRA[Raw URL / Sitemap] --> B[Smart Crawler]B --> C[Noise & Boilerplate Stripper]C --> D[SHA-256 Deduplication]D --> E[Semantic Chunking]E --> F[LLM-Ready Dataset]style A fill:#4dabf7,stroke:#228be6,stroke-width:2px,color:#fffstyle F fill:#37b24d,stroke:#2b8a3e,stroke-width:2px,color:#fffstyle B fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fffstyle C fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fffstyle D fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fffstyle E fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff
- Input a URL: Just enter a start URL (e.g.
https://docs.yourcompany.com) or a sitemap. - Forge processes the data: The actor extracts metadata (author, publish date, language, counts), cleans the markup, deduplicates content, and splits it into semantic chunks.
- Deploy: Directly ingest the output into your vector databases (Pinecone, Chroma, Qdrant) or Hugging Face datasets.
🧩 Input Parameters
DomainForge is built for both developers and non-technical builders. You can configure it with a simple JSON or run it through the Apify Console UI. Only startUrls is required; every other field has a sensible default.
Simple Configuration (Just the basics)
{"startUrls": [{ "url": "https://docs.apify.com" }],"maxCrawlPages": 100}
Full Enterprise Settings (Complete control)
{"startUrls": [{ "url": "https://example.com/blog" },{ "url": "https://example.com/sitemap.xml" }],"maxCrawlPages": 1000,"maxDepth": 3,"includePatterns": ["*/blog/*", "*/docs/*"],"excludePatterns": ["*/admin/*", "*/login*"],"respectRobotsTxt": true,"requestDelay": 1000,"saveMarkdown": true,"saveHtml": false,"enableDeduplication": true,"embeddingProvider": "none","chunkSize": 1024,"chunkOverlap": 200}
Parameter Reference
| Parameter | Type | Default | Description |
|---|---|---|---|
startUrls *required | array of { url } | — | One or more HTTP(S) URLs to begin crawling. Sitemap URLs (sitemap.xml) are detected and expanded automatically. |
maxCrawlPages | integer | 1000 | Hard cap on pages crawled. Crawl stops once reached. (1–1,000,000) |
maxDepth | integer | 3 | Link levels to follow from start URLs. 0 crawls only the start URLs / sitemap entries — ideal for sitemap-driven ingestion. (0–10) |
includePatterns | string[] | [] | Glob (*/blog/*) or regex (/\/blog\//) patterns. When set, only matching URLs are crawled. |
excludePatterns | string[] | [] | Glob/regex patterns to skip. Excludes take precedence over includes. |
respectRobotsTxt | boolean | true | Honor robots.txt. Disallowed URLs are skipped and counted in the final statistics. |
requestDelay | integer | 1000 | Milliseconds to wait between consecutive requests to the same domain. Be polite. (100–60,000 ms) |
saveMarkdown | boolean | true | Include a cleaned Markdown version of each page in the output. |
saveHtml | boolean | false | Include the original raw HTML. Off by default to keep datasets lean. |
enableDeduplication | boolean | true | Remove exact duplicates (identical SHA-256 hash) from the dataset. |
embeddingProvider | enum | "none" | Semantic-dedup provider: "none", "bge" (on-device), or "openai" (API). See note below. |
semanticDeduplicationThreshold | number | 0.85 | Cosine-similarity cutoff (0.0–1.0) above which two pages are treated as near-duplicates. |
chunkSize | integer | 1024 | Target chunk size in characters for RAG splitting. (100–10,000 chars) |
chunkOverlap | integer | 200 | Character overlap between consecutive chunks. Must be less than chunkSize. (0–9,999 chars) |
ℹ️ About
embeddingProvider: The semantic (embedding-based) deduplication path is configurable now but currently stubbed — when a provider other than"none"is selected, the pipeline runs normally and emitssimilarityScore: nullon each item until the embedding engine ships. Exact SHA-256 deduplication (controlled byenableDeduplication) is fully active regardless. Set the provider now to keep your runs future-ready.
📊 Output Dataset Schema
Each crawled page is returned as one structured item in the default dataset, ready for database storage or model ingestion.
{"url": "https://example.com/blog/getting-started","title": "Getting Started with AI Datasets","markdown": "# Getting Started with AI Datasets\n\nHigh quality data is all you need...","text": "Getting Started with AI Datasets High quality data is all you need...","html": "<html><body>...</body></html>","metadata": {"author": "Jane Doe","publishDate": "2026-06-21T00:00:00.000Z","language": "en","wordCount": 745,"tokenCountApprox": 931,"crawledAt": "2026-06-21T12:00:00.000Z"},"chunks": [{ "text": "High quality data is all you need to train custom LLMs...", "tokenCount": 256 }],"dedupHash": "8f3b2a9ec1d74f6e0b9a2c5d8e1f4a7b3c9d0e2f5a8b1c4d7e0a3b6c9d2e5f8a","similarityScore": null}
Field Reference
| Field | Type | Required | Description |
|---|---|---|---|
url | string | ✅ | Fully-qualified URL of the crawled page. |
title | string | ✅ | Page title extracted from content. |
markdown | string | when saveMarkdown: true | Cleaned Markdown of the page content. |
text | string | ✅ | Normalised plain text of the cleaned content. |
html | string | when saveHtml: true | Original raw HTML of the page. |
metadata | object | ✅ | Enriched metadata (see below). |
metadata.author | string | null | — | Author from meta tags, or null. |
metadata.publishDate | string | null | — | ISO date from meta tags / <time>, or null. |
metadata.language | string | — | Language code from <html lang> (e.g. en), or unknown. |
metadata.wordCount | integer | — | Word count of the plain text. |
metadata.tokenCountApprox | integer | — | Estimated token count (wordCount × 1.25). |
metadata.crawledAt | string | — | ISO 8601 timestamp when the page was processed. |
chunks | array | ✅ | Overlapping text chunks for RAG ingestion. |
chunks[].text | string | — | Text content of the chunk. |
chunks[].tokenCount | integer | — | Approximate token count for the chunk. |
dedupHash | string | ✅ | 64-char SHA-256 hash of normalised text. |
similarityScore | number | null | — | Cosine similarity for semantic dedup (null until the embedding engine ships). |
The Apify Console renders the dataset with three pre-built views — Overview, Content, and Chunks (RAG-ready) (the last flattens chunks into one row each for direct vector-store ingestion).
📁 Example Configurations
Pre-built input presets for common scenarios live in the ./examples directory. Copy one into the Apify Console (or pass it as the Actor input) and adjust the URLs.
| Preset | Best for | Highlights |
|---|---|---|
| ./examples/blog-crawl.json | Harvesting a blog | Path-scoped to /blog/, skips tag/author/pagination noise, exact-dedup on |
| ./examples/documentation-sitemap.json | Ingesting a whole docs site | Sitemap-driven, maxDepth 0 crawls only listed pages — fast and complete |
| ./examples/rag-chunking.json | Building a RAG / vector corpus | Small 512-char chunks, tight overlap, markdown omitted for a lean dataset |
| ./examples/semantic-dedup.json | Near-duplicate news/PR corpus | Enables the bge embedding path + 0.85 threshold (future-ready) |
🛠️ Usage
Run on the Apify Console
Get started in seconds — no install required. Click Start with an input on the Actor's Apify Store page, paste a startUrls value, and hit Start.
Run from the CLI
# Run with the default input (set startUrls in the Console form)apify call domainforge-llm-dataset-builder# Run with a local input file and fetch the dataset as JSONL (Hugging Face ready)apify call domainforge-llm-dataset-builder --input examples/blog-crawl.jsonapify get-dataset --dataset-id <DATASET_ID> --format=jsonl > dataset.jsonl
Run programmatically (JavaScript / Node.js)
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });const run = await client.actor('domainforge-llm-dataset-builder').call({startUrls: [{ url: 'https://docs.apify.com' }],maxCrawlPages: 100,enableDeduplication: true,chunkSize: 1024,chunkOverlap: 200,});// Stream the cleaned datasetconst { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(`${items.length} pages forged`);
Run programmatically (Python)
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("domainforge-llm-dataset-builder").call(run_input={"startUrls": [{"url": "https://docs.apify.com"}],"maxCrawlPages": 100,"enableDeduplication": True,"chunkSize": 1024,"chunkOverlap": 200,})dataset = client.dataset(run["defaultDatasetId"]).list_items()print(f"{dataset.count} pages forged")
Load into Hugging Face
Download the dataset as JSONL from the Console (or apify get-dataset --format=jsonl), then:
from datasets import load_datasetds = load_dataset("json", data_files="dataset.jsonl", split="train")
💡 Quick Tips for Best Results
- Sitemaps are your friend: Use the sitemap URL (e.g.
https://example.com/sitemap.xml) to crawl an entire site instantly. SetmaxDepthto0to only crawl pages listed in the sitemap. - Targeted Ingestion: Use
includePatterns(like*/docs/*or*/help/*) to prevent wasting compute on irrelevant pages like contacts or terms of service. - Hugging Face Friendly: Download the dataset in
JSONLformat using the Apify API for native integration with Hugging Face pipelines.
🩺 Troubleshooting
| Symptom | Likely cause | Fix |
|---|---|---|
| Zero pages crawled / "Pages crawled: 0" | All start URLs blocked by robots.txt or outside includePatterns | Set respectRobotsTxt: false only if you have permission, or widen includePatterns. Check the run log for the count of robots-disallowed URLs. |
Empty text / pages "skipped (empty/short)" | Readability couldn't find an article body (JS-rendered content, paywall, login wall) | DomainForge uses static HTTP (Cheerio) crawling. For heavily JS-rendered sites, pre-render the HTML or point at a server-rendered/AMP version. |
Dataset missing markdown or html | saveMarkdown / saveHtml toggles | Set saveMarkdown: true (default on) and saveHtml: true (default off) in your input. |
similarityScore is always null | Expected today — the embedding engine is not yet shipped | Exact SHA-256 dedup is active; semantic dedup is configured-but-stubbed. See the embeddingProvider note above. |
chunkOverlap rejected | Overlap ≥ chunkSize | Lower chunkOverlap so it is strictly less than chunkSize. |
Crawl hits maxCrawlPages too early | Default cap is 1000 | Raise maxCrawlPages. For very large sites, prefer sitemap mode (maxDepth: 0) for predictable, complete coverage. |
| Run times out or runs out of memory | Very large crawl + default resources | Defaults are 2048 MB / 3600 s. Raise memory in the Console run settings, and/or split the crawl across multiple runs by path. |
| Lots of duplicate-looking pages survive | Republished content with edits won't match an exact hash | Enable embeddingProvider so near-duplicates are flagged once the semantic engine ships; until then, deduplicate downstream on dedupHash and metadata.title. |
⚖️ Legal & Ethical Use
DomainForge is a general-purpose web crawler. You are responsible for using it lawfully and ethically.
- Respect Terms of Service. Review the target site's ToS before crawling. Some sites prohibit automated access; honoring that is your responsibility.
- Honor
robots.txt.respectRobotsTxtis on by default. Keep it on unless you have explicit authorization, and even then slow down withrequestDelay. - Rate-limit politely. Use a sensible
requestDelay(default 1000 ms) to avoid overloading the target server. Treat shared infrastructure as you would want yours treated. - Personal data & GDPR/CCPA. Do not crawl, store, or republish personal data without a lawful basis. If your target contains PII, redact it before storage and avoid passing it into training pipelines.
- Copyright & licensing. Crawled content may be copyrighted. Building a dataset for internal RAG is typically different from redistributing the text publicly — verify the license, attribute authors (
metadata.author,metadata.publishDateare captured to help), and seek permission before publishing derived datasets. - Don't republish raw scraped content as if it were your own. Use output for legitimate training/retrieval purposes consistent with fair use and the source's license.
If you are unsure whether a use case is permitted, ask the data owner. Apify's Acceptable Use Policy applies to all Actors on the platform.
🔗 Try DomainForge Now
Get started in seconds! Run this actor directly on the Apify Console:
$apify call domainforge-llm-dataset-builder
For custom builds, bug reports, or feature requests, feel free to check the project LICENSE and source code.