DomainForge LLM Dataset Builder avatar

DomainForge LLM Dataset Builder

Pricing

from $0.01 / 1,000 results

Go to Apify Store
DomainForge LLM Dataset Builder

DomainForge LLM Dataset Builder

Crawls websites and transforms web content into clean, structured datasets optimized for LLM fine-tuning, RAG applications, and knowledge base construction

Pricing

from $0.01 / 1,000 results

Rating

0.0

(0)

Developer

fanio zilla

fanio zilla

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

7 hours ago

Last modified

Categories

Share

DomainForge LLM Dataset Builder 🛠️

DomainForge Logo

Turn Any Website into Production-Ready LLM Training & RAG Data with One Click.

Apify Actor LICENSE Hugging Face Ready


💡 Stop Scraping, Start Training

AI Engineers and researchers spend 70-80% of their time building web scrapers, stripping boilerplate HTML, deduplicating content, and chunking files.

DomainForge does all of this for you automatically.

Whether you are fine-tuning a custom LLM, building a Retrieval-Augmented Generation (RAG) pipeline, or feeding a vector database, DomainForge crawls your target websites and processes them into clean, structured, and deduplicated datasets instantly.


Table of Contents


✨ Superpowers

  • 🕷️ Smart Website Crawling: Crawl recursively, target specific URL paths (using globs or regex), or ingest entire sites instantly using sitemap.xml files.
  • 🧼 High-Fidelity Noise Removal: Powered by Mozilla's Readability engine. We strip headers, footers, sidebars, cookie banners, navigation menus, and ads—leaving only the pristine, valuable content.
  • 🧩 LLM-Optimized Auto-Chunking: Automatically splits long articles into clean, overlapping chunks (customizable chunk sizes and overlaps) optimized for vector store embeddings.
  • ⚡ Exact Deduplication: Removes duplicate pages and identical content blocks using SHA-256 hashing, collapsing republished or near-identical content into a single canonical record. (Semantic, embedding-based near-duplicate deduplication is configurable and on the roadmap — see Input Parameters.)
  • 📂 Multi-Format Export: Download your clean dataset as JSONL (Hugging Face ready), JSON, CSV, or raw Markdown files.

🚀 How It Works

DomainForge operates as a state-of-the-art data refinery:

graph LR
A[Raw URL / Sitemap] --> B[Smart Crawler]
B --> C[Noise & Boilerplate Stripper]
C --> D[SHA-256 Deduplication]
D --> E[Semantic Chunking]
E --> F[LLM-Ready Dataset]
style A fill:#4dabf7,stroke:#228be6,stroke-width:2px,color:#fff
style F fill:#37b24d,stroke:#2b8a3e,stroke-width:2px,color:#fff
style B fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff
style C fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff
style D fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff
style E fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff
  1. Input a URL: Just enter a start URL (e.g. https://docs.yourcompany.com) or a sitemap.
  2. Forge processes the data: The actor extracts metadata (author, publish date, language, counts), cleans the markup, deduplicates content, and splits it into semantic chunks.
  3. Deploy: Directly ingest the output into your vector databases (Pinecone, Chroma, Qdrant) or Hugging Face datasets.

🧩 Input Parameters

DomainForge is built for both developers and non-technical builders. You can configure it with a simple JSON or run it through the Apify Console UI. Only startUrls is required; every other field has a sensible default.

Simple Configuration (Just the basics)

{
"startUrls": [{ "url": "https://docs.apify.com" }],
"maxCrawlPages": 100
}

Full Enterprise Settings (Complete control)

{
"startUrls": [
{ "url": "https://example.com/blog" },
{ "url": "https://example.com/sitemap.xml" }
],
"maxCrawlPages": 1000,
"maxDepth": 3,
"includePatterns": ["*/blog/*", "*/docs/*"],
"excludePatterns": ["*/admin/*", "*/login*"],
"respectRobotsTxt": true,
"requestDelay": 1000,
"saveMarkdown": true,
"saveHtml": false,
"enableDeduplication": true,
"embeddingProvider": "none",
"chunkSize": 1024,
"chunkOverlap": 200
}

Parameter Reference

ParameterTypeDefaultDescription
startUrls *requiredarray of { url }One or more HTTP(S) URLs to begin crawling. Sitemap URLs (sitemap.xml) are detected and expanded automatically.
maxCrawlPagesinteger1000Hard cap on pages crawled. Crawl stops once reached. (1–1,000,000)
maxDepthinteger3Link levels to follow from start URLs. 0 crawls only the start URLs / sitemap entries — ideal for sitemap-driven ingestion. (0–10)
includePatternsstring[][]Glob (*/blog/*) or regex (/\/blog\//) patterns. When set, only matching URLs are crawled.
excludePatternsstring[][]Glob/regex patterns to skip. Excludes take precedence over includes.
respectRobotsTxtbooleantrueHonor robots.txt. Disallowed URLs are skipped and counted in the final statistics.
requestDelayinteger1000Milliseconds to wait between consecutive requests to the same domain. Be polite. (100–60,000 ms)
saveMarkdownbooleantrueInclude a cleaned Markdown version of each page in the output.
saveHtmlbooleanfalseInclude the original raw HTML. Off by default to keep datasets lean.
enableDeduplicationbooleantrueRemove exact duplicates (identical SHA-256 hash) from the dataset.
embeddingProviderenum"none"Semantic-dedup provider: "none", "bge" (on-device), or "openai" (API). See note below.
semanticDeduplicationThresholdnumber0.85Cosine-similarity cutoff (0.0–1.0) above which two pages are treated as near-duplicates.
chunkSizeinteger1024Target chunk size in characters for RAG splitting. (100–10,000 chars)
chunkOverlapinteger200Character overlap between consecutive chunks. Must be less than chunkSize. (0–9,999 chars)

ℹ️ About embeddingProvider: The semantic (embedding-based) deduplication path is configurable now but currently stubbed — when a provider other than "none" is selected, the pipeline runs normally and emits similarityScore: null on each item until the embedding engine ships. Exact SHA-256 deduplication (controlled by enableDeduplication) is fully active regardless. Set the provider now to keep your runs future-ready.


📊 Output Dataset Schema

Each crawled page is returned as one structured item in the default dataset, ready for database storage or model ingestion.

{
"url": "https://example.com/blog/getting-started",
"title": "Getting Started with AI Datasets",
"markdown": "# Getting Started with AI Datasets\n\nHigh quality data is all you need...",
"text": "Getting Started with AI Datasets High quality data is all you need...",
"html": "<html><body>...</body></html>",
"metadata": {
"author": "Jane Doe",
"publishDate": "2026-06-21T00:00:00.000Z",
"language": "en",
"wordCount": 745,
"tokenCountApprox": 931,
"crawledAt": "2026-06-21T12:00:00.000Z"
},
"chunks": [
{ "text": "High quality data is all you need to train custom LLMs...", "tokenCount": 256 }
],
"dedupHash": "8f3b2a9ec1d74f6e0b9a2c5d8e1f4a7b3c9d0e2f5a8b1c4d7e0a3b6c9d2e5f8a",
"similarityScore": null
}

Field Reference

FieldTypeRequiredDescription
urlstringFully-qualified URL of the crawled page.
titlestringPage title extracted from content.
markdownstringwhen saveMarkdown: trueCleaned Markdown of the page content.
textstringNormalised plain text of the cleaned content.
htmlstringwhen saveHtml: trueOriginal raw HTML of the page.
metadataobjectEnriched metadata (see below).
metadata.authorstring | nullAuthor from meta tags, or null.
metadata.publishDatestring | nullISO date from meta tags / <time>, or null.
metadata.languagestringLanguage code from <html lang> (e.g. en), or unknown.
metadata.wordCountintegerWord count of the plain text.
metadata.tokenCountApproxintegerEstimated token count (wordCount × 1.25).
metadata.crawledAtstringISO 8601 timestamp when the page was processed.
chunksarrayOverlapping text chunks for RAG ingestion.
chunks[].textstringText content of the chunk.
chunks[].tokenCountintegerApproximate token count for the chunk.
dedupHashstring64-char SHA-256 hash of normalised text.
similarityScorenumber | nullCosine similarity for semantic dedup (null until the embedding engine ships).

The Apify Console renders the dataset with three pre-built views — Overview, Content, and Chunks (RAG-ready) (the last flattens chunks into one row each for direct vector-store ingestion).


📁 Example Configurations

Pre-built input presets for common scenarios live in the ./examples directory. Copy one into the Apify Console (or pass it as the Actor input) and adjust the URLs.

PresetBest forHighlights
./examples/blog-crawl.jsonHarvesting a blogPath-scoped to /blog/, skips tag/author/pagination noise, exact-dedup on
./examples/documentation-sitemap.jsonIngesting a whole docs siteSitemap-driven, maxDepth 0 crawls only listed pages — fast and complete
./examples/rag-chunking.jsonBuilding a RAG / vector corpusSmall 512-char chunks, tight overlap, markdown omitted for a lean dataset
./examples/semantic-dedup.jsonNear-duplicate news/PR corpusEnables the bge embedding path + 0.85 threshold (future-ready)

🛠️ Usage

Run on the Apify Console

Get started in seconds — no install required. Click Start with an input on the Actor's Apify Store page, paste a startUrls value, and hit Start.

Run from the CLI

# Run with the default input (set startUrls in the Console form)
apify call domainforge-llm-dataset-builder
# Run with a local input file and fetch the dataset as JSONL (Hugging Face ready)
apify call domainforge-llm-dataset-builder --input examples/blog-crawl.json
apify get-dataset --dataset-id <DATASET_ID> --format=jsonl > dataset.jsonl

Run programmatically (JavaScript / Node.js)

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' });
const run = await client.actor('domainforge-llm-dataset-builder').call({
startUrls: [{ url: 'https://docs.apify.com' }],
maxCrawlPages: 100,
enableDeduplication: true,
chunkSize: 1024,
chunkOverlap: 200,
});
// Stream the cleaned dataset
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(`${items.length} pages forged`);

Run programmatically (Python)

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("domainforge-llm-dataset-builder").call(run_input={
"startUrls": [{"url": "https://docs.apify.com"}],
"maxCrawlPages": 100,
"enableDeduplication": True,
"chunkSize": 1024,
"chunkOverlap": 200,
})
dataset = client.dataset(run["defaultDatasetId"]).list_items()
print(f"{dataset.count} pages forged")

Load into Hugging Face

Download the dataset as JSONL from the Console (or apify get-dataset --format=jsonl), then:

from datasets import load_dataset
ds = load_dataset("json", data_files="dataset.jsonl", split="train")

💡 Quick Tips for Best Results

  • Sitemaps are your friend: Use the sitemap URL (e.g. https://example.com/sitemap.xml) to crawl an entire site instantly. Set maxDepth to 0 to only crawl pages listed in the sitemap.
  • Targeted Ingestion: Use includePatterns (like */docs/* or */help/*) to prevent wasting compute on irrelevant pages like contacts or terms of service.
  • Hugging Face Friendly: Download the dataset in JSONL format using the Apify API for native integration with Hugging Face pipelines.

🩺 Troubleshooting

SymptomLikely causeFix
Zero pages crawled / "Pages crawled: 0"All start URLs blocked by robots.txt or outside includePatternsSet respectRobotsTxt: false only if you have permission, or widen includePatterns. Check the run log for the count of robots-disallowed URLs.
Empty text / pages "skipped (empty/short)"Readability couldn't find an article body (JS-rendered content, paywall, login wall)DomainForge uses static HTTP (Cheerio) crawling. For heavily JS-rendered sites, pre-render the HTML or point at a server-rendered/AMP version.
Dataset missing markdown or htmlsaveMarkdown / saveHtml togglesSet saveMarkdown: true (default on) and saveHtml: true (default off) in your input.
similarityScore is always nullExpected today — the embedding engine is not yet shippedExact SHA-256 dedup is active; semantic dedup is configured-but-stubbed. See the embeddingProvider note above.
chunkOverlap rejectedOverlap ≥ chunkSizeLower chunkOverlap so it is strictly less than chunkSize.
Crawl hits maxCrawlPages too earlyDefault cap is 1000Raise maxCrawlPages. For very large sites, prefer sitemap mode (maxDepth: 0) for predictable, complete coverage.
Run times out or runs out of memoryVery large crawl + default resourcesDefaults are 2048 MB / 3600 s. Raise memory in the Console run settings, and/or split the crawl across multiple runs by path.
Lots of duplicate-looking pages surviveRepublished content with edits won't match an exact hashEnable embeddingProvider so near-duplicates are flagged once the semantic engine ships; until then, deduplicate downstream on dedupHash and metadata.title.

DomainForge is a general-purpose web crawler. You are responsible for using it lawfully and ethically.

  • Respect Terms of Service. Review the target site's ToS before crawling. Some sites prohibit automated access; honoring that is your responsibility.
  • Honor robots.txt. respectRobotsTxt is on by default. Keep it on unless you have explicit authorization, and even then slow down with requestDelay.
  • Rate-limit politely. Use a sensible requestDelay (default 1000 ms) to avoid overloading the target server. Treat shared infrastructure as you would want yours treated.
  • Personal data & GDPR/CCPA. Do not crawl, store, or republish personal data without a lawful basis. If your target contains PII, redact it before storage and avoid passing it into training pipelines.
  • Copyright & licensing. Crawled content may be copyrighted. Building a dataset for internal RAG is typically different from redistributing the text publicly — verify the license, attribute authors (metadata.author, metadata.publishDate are captured to help), and seek permission before publishing derived datasets.
  • Don't republish raw scraped content as if it were your own. Use output for legitimate training/retrieval purposes consistent with fair use and the source's license.

If you are unsure whether a use case is permitted, ask the data owner. Apify's Acceptable Use Policy applies to all Actors on the platform.


🔗 Try DomainForge Now

Get started in seconds! Run this actor directly on the Apify Console:

$apify call domainforge-llm-dataset-builder

For custom builds, bug reports, or feature requests, feel free to check the project LICENSE and source code.