# DomainForge LLM Dataset Builder (`actorify/domainforge-llm-dataset-builder`) Actor Crawls websites and transforms web content into clean, structured datasets optimized for LLM fine-tuning, RAG applications, and knowledge base construction - **URL**: https://apify.com/actorify/domainforge-llm-dataset-builder.md - **Developed by:** [fanio zilla](https://apify.com/actorify) (community) - **Categories:** AI, Automation - **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks - **User rating**: No ratings yet ## Pricing from $0.01 / 1,000 results This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events. Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event ## What's an Apify Actor? Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases. In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours, and optionally produces a well-defined JSON output, datasets with results, or files in key-value store. In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server. Actors are written with capital "A". ## How to integrate an Actor? If asked about integration, you help developers integrate Actors into their projects. You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready. The best way to integrate Actors is as follows. In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md): ```bash npm install apify-client ``` In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md): ```bash pip install apify-client ``` In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md): ````bash # MacOS / Linux curl -fsSL https://apify.com/install-cli.sh | bash # Windows irm https://apify.com/install-cli.ps1 | iex ```bash In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md). If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md). For usage examples, see the [API](#api) section below. For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt). # README ## DomainForge LLM Dataset Builder 🛠️

Turn Any Website into Production-Ready LLM Training & RAG Data with One Click.

--- ### 💡 Stop Scraping, Start Training AI Engineers and researchers spend **70-80% of their time** building web scrapers, stripping boilerplate HTML, deduplicating content, and chunking files. **DomainForge** does all of this for you automatically. Whether you are fine-tuning a custom LLM, building a Retrieval-Augmented Generation (RAG) pipeline, or feeding a vector database, DomainForge crawls your target websites and processes them into clean, structured, and deduplicated datasets instantly. --- ### Table of Contents - [✨ Superpowers](#-superpowers) - [🚀 How It Works](#-how-it-works) - [🧩 Input Parameters](#-input-parameters) - [📊 Output Dataset Schema](#-output-dataset-schema) - [📁 Example Configurations](#-example-configurations) - [🛠️ Usage](#-usage) - [🩺 Troubleshooting](#-troubleshooting) - [⚖️ Legal & Ethical Use](#-legal--ethical-use) - [🔗 Try DomainForge Now](#-try-domainforge-now) --- ### ✨ Superpowers * **🕷️ Smart Website Crawling**: Crawl recursively, target specific URL paths (using globs or regex), or ingest entire sites instantly using `sitemap.xml` files. * **🧼 High-Fidelity Noise Removal**: Powered by Mozilla's Readability engine. We strip headers, footers, sidebars, cookie banners, navigation menus, and ads—leaving only the pristine, valuable content. * **🧩 LLM-Optimized Auto-Chunking**: Automatically splits long articles into clean, overlapping chunks (customizable chunk sizes and overlaps) optimized for vector store embeddings. * **⚡ Exact Deduplication**: Removes duplicate pages and identical content blocks using SHA-256 hashing, collapsing republished or near-identical content into a single canonical record. *(Semantic, embedding-based near-duplicate deduplication is configurable and on the roadmap — see [Input Parameters](#-input-parameters).)* * **📂 Multi-Format Export**: Download your clean dataset as **JSONL (Hugging Face ready)**, **JSON**, **CSV**, or raw **Markdown files**. --- ### 🚀 How It Works DomainForge operates as a state-of-the-art data refinery: ```mermaid graph LR A[Raw URL / Sitemap] --> B[Smart Crawler] B --> C[Noise & Boilerplate Stripper] C --> D[SHA-256 Deduplication] D --> E[Semantic Chunking] E --> F[LLM-Ready Dataset] style A fill:#4dabf7,stroke:#228be6,stroke-width:2px,color:#fff style F fill:#37b24d,stroke:#2b8a3e,stroke-width:2px,color:#fff style B fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff style C fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff style D fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff style E fill:#1c7ced6,stroke:#1971c2,stroke-width:1px,color:#fff ```` 1. **Input a URL**: Just enter a start URL (e.g. `https://docs.yourcompany.com`) or a sitemap. 2. **Forge processes the data**: The actor extracts metadata (author, publish date, language, counts), cleans the markup, deduplicates content, and splits it into semantic chunks. 3. **Deploy**: Directly ingest the output into your vector databases (Pinecone, Chroma, Qdrant) or Hugging Face datasets. *** ### 🧩 Input Parameters DomainForge is built for both developers and non-technical builders. You can configure it with a simple JSON or run it through the Apify Console UI. Only `startUrls` is required; every other field has a sensible default. #### Simple Configuration (Just the basics) ```json { "startUrls": [{ "url": "https://docs.apify.com" }], "maxCrawlPages": 100 } ``` #### Full Enterprise Settings (Complete control) ```json { "startUrls": [ { "url": "https://example.com/blog" }, { "url": "https://example.com/sitemap.xml" } ], "maxCrawlPages": 1000, "maxDepth": 3, "includePatterns": ["*/blog/*", "*/docs/*"], "excludePatterns": ["*/admin/*", "*/login*"], "respectRobotsTxt": true, "requestDelay": 1000, "saveMarkdown": true, "saveHtml": false, "enableDeduplication": true, "embeddingProvider": "none", "chunkSize": 1024, "chunkOverlap": 200 } ``` #### Parameter Reference | Parameter | Type | Default | Description | | --- | --- | --- | --- | | `startUrls` **\*required** | array of `{ url }` | — | One or more HTTP(S) URLs to begin crawling. Sitemap URLs (`sitemap.xml`) are detected and expanded automatically. | | `maxCrawlPages` | integer | `1000` | Hard cap on pages crawled. Crawl stops once reached. (1–1,000,000) | | `maxDepth` | integer | `3` | Link levels to follow from start URLs. `0` crawls **only** the start URLs / sitemap entries — ideal for sitemap-driven ingestion. (0–10) | | `includePatterns` | string\[] | `[]` | Glob (`*/blog/*`) or regex (`/\/blog\//`) patterns. When set, **only** matching URLs are crawled. | | `excludePatterns` | string\[] | `[]` | Glob/regex patterns to skip. **Excludes take precedence over includes.** | | `respectRobotsTxt` | boolean | `true` | Honor `robots.txt`. Disallowed URLs are skipped and counted in the final statistics. | | `requestDelay` | integer | `1000` | Milliseconds to wait between consecutive requests to the same domain. Be polite. (100–60,000 ms) | | `saveMarkdown` | boolean | `true` | Include a cleaned **Markdown** version of each page in the output. | | `saveHtml` | boolean | `false` | Include the original raw **HTML**. Off by default to keep datasets lean. | | `enableDeduplication` | boolean | `true` | Remove exact duplicates (identical SHA-256 hash) from the dataset. | | `embeddingProvider` | enum | `"none"` | Semantic-dedup provider: `"none"`, `"bge"` (on-device), or `"openai"` (API). See note below. | | `semanticDeduplicationThreshold` | number | `0.85` | Cosine-similarity cutoff (0.0–1.0) above which two pages are treated as near-duplicates. | | `chunkSize` | integer | `1024` | Target chunk size in characters for RAG splitting. (100–10,000 chars) | | `chunkOverlap` | integer | `200` | Character overlap between consecutive chunks. Must be less than `chunkSize`. (0–9,999 chars) | > **ℹ️ About `embeddingProvider`:** The semantic (embedding-based) deduplication path is **configurable now but currently stubbed** — when a provider other than `"none"` is selected, the pipeline runs normally and emits `similarityScore: null` on each item until the embedding engine ships. Exact SHA-256 deduplication (controlled by `enableDeduplication`) is fully active regardless. Set the provider now to keep your runs future-ready. *** ### 📊 Output Dataset Schema Each crawled page is returned as one structured item in the default dataset, ready for database storage or model ingestion. ```json { "url": "https://example.com/blog/getting-started", "title": "Getting Started with AI Datasets", "markdown": "# Getting Started with AI Datasets\n\nHigh quality data is all you need...", "text": "Getting Started with AI Datasets High quality data is all you need...", "html": "...", "metadata": { "author": "Jane Doe", "publishDate": "2026-06-21T00:00:00.000Z", "language": "en", "wordCount": 745, "tokenCountApprox": 931, "crawledAt": "2026-06-21T12:00:00.000Z" }, "chunks": [ { "text": "High quality data is all you need to train custom LLMs...", "tokenCount": 256 } ], "dedupHash": "8f3b2a9ec1d74f6e0b9a2c5d8e1f4a7b3c9d0e2f5a8b1c4d7e0a3b6c9d2e5f8a", "similarityScore": null } ``` #### Field Reference | Field | Type | Required | Description | | --- | --- | --- | --- | | `url` | string | ✅ | Fully-qualified URL of the crawled page. | | `title` | string | ✅ | Page title extracted from content. | | `markdown` | string | when `saveMarkdown: true` | Cleaned Markdown of the page content. | | `text` | string | ✅ | Normalised plain text of the cleaned content. | | `html` | string | when `saveHtml: true` | Original raw HTML of the page. | | `metadata` | object | ✅ | Enriched metadata (see below). | | `metadata.author` | string | null | — | Author from meta tags, or `null`. | | `metadata.publishDate` | string | null | — | ISO date from meta tags / ``, or `null`. | | `metadata.language` | string | — | Language code from `` (e.g. `en`), or `unknown`. | | `metadata.wordCount` | integer | — | Word count of the plain text. | | `metadata.tokenCountApprox` | integer | — | Estimated token count (`wordCount × 1.25`). | | `metadata.crawledAt` | string | — | ISO 8601 timestamp when the page was processed. | | `chunks` | array | ✅ | Overlapping text chunks for RAG ingestion. | | `chunks[].text` | string | — | Text content of the chunk. | | `chunks[].tokenCount` | integer | — | Approximate token count for the chunk. | | `dedupHash` | string | ✅ | 64-char SHA-256 hash of normalised text. | | `similarityScore` | number | null | — | Cosine similarity for semantic dedup (`null` until the embedding engine ships). | The Apify Console renders the dataset with three pre-built views — **Overview**, **Content**, and **Chunks (RAG-ready)** (the last flattens chunks into one row each for direct vector-store ingestion). *** ### 📁 Example Configurations Pre-built input presets for common scenarios live in the [`examples/`](./examples) directory. Copy one into the Apify Console (or pass it as the Actor input) and adjust the URLs. | Preset | Best for | Highlights | | --- | --- | --- | | [`blog-crawl.json`](./examples/blog-crawl.json) | Harvesting a blog | Path-scoped to `/blog/`, skips tag/author/pagination noise, exact-dedup on | | [`documentation-sitemap.json`](./examples/documentation-sitemap.json) | Ingesting a whole docs site | Sitemap-driven, `maxDepth 0` crawls only listed pages — fast and complete | | [`rag-chunking.json`](./examples/rag-chunking.json) | Building a RAG / vector corpus | Small 512-char chunks, tight overlap, markdown omitted for a lean dataset | | [`semantic-dedup.json`](./examples/semantic-dedup.json) | Near-duplicate news/PR corpus | Enables the `bge` embedding path + 0.85 threshold (future-ready) | *** ### 🛠️ Usage #### Run on the Apify Console Get started in seconds — no install required. Click **Start with an input** on the Actor's Apify Store page, paste a `startUrls` value, and hit **Start**. #### Run from the CLI ```bash ## Run with the default input (set startUrls in the Console form) apify call domainforge-llm-dataset-builder ## Run with a local input file and fetch the dataset as JSONL (Hugging Face ready) apify call domainforge-llm-dataset-builder --input examples/blog-crawl.json apify get-dataset --dataset-id --format=jsonl > dataset.jsonl ``` #### Run programmatically (JavaScript / Node.js) ```javascript import { ApifyClient } from 'apify-client'; const client = new ApifyClient({ token: 'YOUR_APIFY_TOKEN' }); const run = await client.actor('domainforge-llm-dataset-builder').call({ startUrls: [{ url: 'https://docs.apify.com' }], maxCrawlPages: 100, enableDeduplication: true, chunkSize: 1024, chunkOverlap: 200, }); // Stream the cleaned dataset const { items } = await client.dataset(run.defaultDatasetId).listItems(); console.log(`${items.length} pages forged`); ``` #### Run programmatically (Python) ```python from apify_client import ApifyClient client = ApifyClient("YOUR_APIFY_TOKEN") run = client.actor("domainforge-llm-dataset-builder").call(run_input={ "startUrls": [{"url": "https://docs.apify.com"}], "maxCrawlPages": 100, "enableDeduplication": True, "chunkSize": 1024, "chunkOverlap": 200, }) dataset = client.dataset(run["defaultDatasetId"]).list_items() print(f"{dataset.count} pages forged") ``` #### Load into Hugging Face Download the dataset as **JSONL** from the Console (or `apify get-dataset --format=jsonl`), then: ```python from datasets import load_dataset ds = load_dataset("json", data_files="dataset.jsonl", split="train") ``` *** ### 💡 Quick Tips for Best Results - **Sitemaps are your friend**: Use the sitemap URL (e.g. `https://example.com/sitemap.xml`) to crawl an entire site instantly. Set `maxDepth` to `0` to only crawl pages listed in the sitemap. - **Targeted Ingestion**: Use `includePatterns` (like `*/docs/*` or `*/help/*`) to prevent wasting compute on irrelevant pages like contacts or terms of service. - **Hugging Face Friendly**: Download the dataset in `JSONL` format using the Apify API for native integration with Hugging Face pipelines. *** ### 🩺 Troubleshooting | Symptom | Likely cause | Fix | | --- | --- | --- | | **Zero pages crawled / "Pages crawled: 0"** | All start URLs blocked by `robots.txt` or outside `includePatterns` | Set `respectRobotsTxt: false` **only** if you have permission, or widen `includePatterns`. Check the run log for the count of robots-disallowed URLs. | | **Empty `text` / pages "skipped (empty/short)"** | Readability couldn't find an article body (JS-rendered content, paywall, login wall) | DomainForge uses static HTTP (Cheerio) crawling. For heavily JS-rendered sites, pre-render the HTML or point at a server-rendered/AMP version. | | **Dataset missing `markdown` or `html`** | `saveMarkdown` / `saveHtml` toggles | Set `saveMarkdown: true` (default on) and `saveHtml: true` (default off) in your input. | | **`similarityScore` is always `null`** | Expected today — the embedding engine is not yet shipped | Exact SHA-256 dedup is active; semantic dedup is configured-but-stubbed. See the `embeddingProvider` note above. | | **`chunkOverlap` rejected** | Overlap ≥ `chunkSize` | Lower `chunkOverlap` so it is strictly less than `chunkSize`. | | **Crawl hits `maxCrawlPages` too early** | Default cap is 1000 | Raise `maxCrawlPages`. For very large sites, prefer sitemap mode (`maxDepth: 0`) for predictable, complete coverage. | | **Run times out or runs out of memory** | Very large crawl + default resources | Defaults are 2048 MB / 3600 s. Raise memory in the Console run settings, and/or split the crawl across multiple runs by path. | | **Lots of duplicate-looking pages survive** | Republished content with edits won't match an exact hash | Enable `embeddingProvider` so near-duplicates are flagged once the semantic engine ships; until then, deduplicate downstream on `dedupHash` and `metadata.title`. | *** ### ⚖️ Legal & Ethical Use DomainForge is a general-purpose web crawler. You are responsible for using it lawfully and ethically. - **Respect Terms of Service.** Review the target site's ToS before crawling. Some sites prohibit automated access; honoring that is your responsibility. - **Honor `robots.txt`.** `respectRobotsTxt` is on by default. Keep it on unless you have explicit authorization, and even then slow down with `requestDelay`. - **Rate-limit politely.** Use a sensible `requestDelay` (default 1000 ms) to avoid overloading the target server. Treat shared infrastructure as you would want yours treated. - **Personal data & GDPR/CCPA.** Do not crawl, store, or republish personal data without a lawful basis. If your target contains PII, redact it before storage and avoid passing it into training pipelines. - **Copyright & licensing.** Crawled content may be copyrighted. Building a dataset for internal RAG is typically different from redistributing the text publicly — verify the license, attribute authors (`metadata.author`, `metadata.publishDate` are captured to help), and seek permission before publishing derived datasets. - **Don't republish raw scraped content** as if it were your own. Use output for legitimate training/retrieval purposes consistent with fair use and the source's license. If you are unsure whether a use case is permitted, ask the data owner. Apify's [Acceptable Use Policy](https://docs.apify.com/terms/policies) applies to all Actors on the platform. *** ### 🔗 Try DomainForge Now Get started in seconds! Run this actor directly on the **[Apify Console](https://apify.com/store)**: ```bash apify call domainforge-llm-dataset-builder ``` *For custom builds, bug reports, or feature requests, feel free to check the project [License](LICENSE) and source code.* # Actor input Schema ## `startUrls` (type: `array`): One or more URLs to begin crawling. Each entry must be a valid HTTP or HTTPS URL. ## `maxCrawlPages` (type: `integer`): Maximum number of pages to crawl during this run. Crawling stops when this limit is reached. ## `maxDepth` (type: `integer`): Maximum number of link levels to follow from start URLs. Depth 0 means only the start URLs are crawled. ## `includePatterns` (type: `array`): Glob or regex patterns for URLs to include. When non-empty, only matching URLs are crawled. Use /regex/ syntax or glob patterns like */blog/*. ## `excludePatterns` (type: `array`): Glob or regex patterns for URLs to exclude. Exclude patterns take precedence over include patterns. ## `respectRobotsTxt` (type: `boolean`): Follow robots.txt rules when crawling. Disallowed URLs are skipped and counted in crawl statistics. ## `linkEnqueueStrategy` (type: `string`): Controls which links discovered on crawled pages are added to the queue. 'same-domain' restricts the crawl to the start URL domains. 'all' allows following links to external websites (useful for news hub crawls). ## `requestDelay` (type: `integer`): Delay between consecutive requests to the same domain. Helps avoid overloading target servers. ## `saveMarkdown` (type: `boolean`): Include Markdown formatted content in the output dataset item. ## `saveHtml` (type: `boolean`): Include original HTML content in the output dataset item. ## `enableDeduplication` (type: `boolean`): Remove exact duplicate content (identical SHA-256 hash) from the dataset. ## `embeddingProvider` (type: `string`): Provider for generating embeddings used in semantic deduplication. Set to 'none' to disable semantic deduplication. ## `semanticDeduplicationThreshold` (type: `number`): Cosine similarity threshold above which two pages are considered semantically duplicate (0.0-1.0). ## `chunkSize` (type: `integer`): Target size for content chunks in characters. Used by the Chunking Engine for RAG applications. ## `chunkOverlap` (type: `integer`): Number of characters to overlap between consecutive chunks. Must be less than chunkSize. ## Actor input object example ```json { "startUrls": [ { "url": "https://docs.apify.com/academy/web-scraping-for-beginners" } ], "maxCrawlPages": 10, "maxDepth": 1, "includePatterns": [], "excludePatterns": [], "respectRobotsTxt": true, "linkEnqueueStrategy": "same-domain", "requestDelay": 300, "saveMarkdown": true, "saveHtml": false, "enableDeduplication": true, "embeddingProvider": "none", "semanticDeduplicationThreshold": 0.85, "chunkSize": 1024, "chunkOverlap": 200 } ``` # Actor output Schema ## `dataset` (type: `string`): Cleaned, deduplicated, and chunked content items ready for LLM fine-tuning, RAG, or knowledge bases # API You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup. ## JavaScript example ```javascript import { ApifyClient } from 'apify-client'; // Initialize the ApifyClient with your Apify API token // Replace the '' with your token const client = new ApifyClient({ token: '', }); // Prepare Actor input const input = { "startUrls": [ { "url": "https://docs.apify.com/academy/web-scraping-for-beginners" } ], "maxCrawlPages": 10, "maxDepth": 1, "requestDelay": 300 }; // Run the Actor and wait for it to finish const run = await client.actor("actorify/domainforge-llm-dataset-builder").call(input); // Fetch and print Actor results from the run's dataset (if any) console.log('Results from dataset'); console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`); const { items } = await client.dataset(run.defaultDatasetId).listItems(); items.forEach((item) => { console.dir(item); }); // 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs ``` ## Python example ```python from apify_client import ApifyClient # Initialize the ApifyClient with your Apify API token # Replace '' with your token. client = ApifyClient("") # Prepare the Actor input run_input = { "startUrls": [{ "url": "https://docs.apify.com/academy/web-scraping-for-beginners" }], "maxCrawlPages": 10, "maxDepth": 1, "requestDelay": 300, } # Run the Actor and wait for it to finish run = client.actor("actorify/domainforge-llm-dataset-builder").call(run_input=run_input) # Fetch and print Actor results from the run's dataset (if there are any) print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"]) for item in client.dataset(run["defaultDatasetId"]).iterate_items(): print(item) # 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start ``` ## CLI example ```bash echo '{ "startUrls": [ { "url": "https://docs.apify.com/academy/web-scraping-for-beginners" } ], "maxCrawlPages": 10, "maxDepth": 1, "requestDelay": 300 }' | apify call actorify/domainforge-llm-dataset-builder --silent --output-dataset ``` ## MCP server setup ```json { "mcpServers": { "apify": { "command": "npx", "args": [ "mcp-remote", "https://mcp.apify.com/?tools=actorify/domainforge-llm-dataset-builder", "--header", "Authorization: Bearer " ] } } } ``` ## OpenAPI specification ```json { "openapi": "3.0.1", "info": { "title": "DomainForge LLM Dataset Builder", "description": "Crawls websites and transforms web content into clean, structured datasets optimized for LLM fine-tuning, RAG applications, and knowledge base construction", "version": "1.0", "x-build-id": "dUzaKA5csbsEb1SCV" }, "servers": [ { "url": "https://api.apify.com/v2" } ], "paths": { "/acts/actorify~domainforge-llm-dataset-builder/run-sync-get-dataset-items": { "post": { "operationId": "run-sync-get-dataset-items-actorify-domainforge-llm-dataset-builder", "x-openai-isConsequential": false, "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.", "tags": [ "Run Actor" ], "requestBody": { "required": true, "content": { "application/json": { "schema": { "$ref": "#/components/schemas/inputSchema" } } } }, "parameters": [ { "name": "token", "in": "query", "required": true, "schema": { "type": "string" }, "description": "Enter your Apify token here" } ], "responses": { "200": { "description": "OK" } } } }, "/acts/actorify~domainforge-llm-dataset-builder/runs": { "post": { "operationId": "runs-sync-actorify-domainforge-llm-dataset-builder", "x-openai-isConsequential": false, "summary": "Executes an Actor and returns information about the initiated run in response.", "tags": [ "Run Actor" ], "requestBody": { "required": true, "content": { "application/json": { "schema": { "$ref": "#/components/schemas/inputSchema" } } } }, "parameters": [ { "name": "token", "in": "query", "required": true, "schema": { "type": "string" }, "description": "Enter your Apify token here" } ], "responses": { "200": { "description": "OK", "content": { "application/json": { "schema": { "$ref": "#/components/schemas/runsResponseSchema" } } } } } } }, "/acts/actorify~domainforge-llm-dataset-builder/run-sync": { "post": { "operationId": "run-sync-actorify-domainforge-llm-dataset-builder", "x-openai-isConsequential": false, "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.", "tags": [ "Run Actor" ], "requestBody": { "required": true, "content": { "application/json": { "schema": { "$ref": "#/components/schemas/inputSchema" } } } }, "parameters": [ { "name": "token", "in": "query", "required": true, "schema": { "type": "string" }, "description": "Enter your Apify token here" } ], "responses": { "200": { "description": "OK" } } } } }, "components": { "schemas": { "inputSchema": { "type": "object", "required": [ "startUrls" ], "properties": { "startUrls": { "title": "Start URLs", "minItems": 1, "type": "array", "description": "One or more URLs to begin crawling. Each entry must be a valid HTTP or HTTPS URL.", "items": { "type": "object", "required": [ "url" ], "properties": { "url": { "type": "string", "title": "URL", "description": "A valid HTTP or HTTPS URL to crawl.", "pattern": "^https?://" } }, "additionalProperties": false } }, "maxCrawlPages": { "title": "Maximum pages to crawl", "minimum": 1, "maximum": 1000000, "type": "integer", "description": "Maximum number of pages to crawl during this run. Crawling stops when this limit is reached.", "default": 1000 }, "maxDepth": { "title": "Maximum crawl depth", "minimum": 0, "maximum": 10, "type": "integer", "description": "Maximum number of link levels to follow from start URLs. Depth 0 means only the start URLs are crawled.", "default": 3 }, "includePatterns": { "title": "URL include patterns", "type": "array", "description": "Glob or regex patterns for URLs to include. When non-empty, only matching URLs are crawled. Use /regex/ syntax or glob patterns like */blog/*.", "items": { "type": "string", "minLength": 1 }, "default": [] }, "excludePatterns": { "title": "URL exclude patterns", "type": "array", "description": "Glob or regex patterns for URLs to exclude. Exclude patterns take precedence over include patterns.", "items": { "type": "string", "minLength": 1 }, "default": [] }, "respectRobotsTxt": { "title": "Respect robots.txt", "type": "boolean", "description": "Follow robots.txt rules when crawling. Disallowed URLs are skipped and counted in crawl statistics.", "default": true }, "linkEnqueueStrategy": { "title": "Link enqueueing strategy", "enum": [ "same-domain", "same-hostname", "all" ], "type": "string", "description": "Controls which links discovered on crawled pages are added to the queue. 'same-domain' restricts the crawl to the start URL domains. 'all' allows following links to external websites (useful for news hub crawls).", "default": "same-domain" }, "requestDelay": { "title": "Request delay (milliseconds)", "minimum": 100, "maximum": 60000, "type": "integer", "description": "Delay between consecutive requests to the same domain. Helps avoid overloading target servers.", "default": 1000 }, "saveMarkdown": { "title": "Save Markdown", "type": "boolean", "description": "Include Markdown formatted content in the output dataset item.", "default": true }, "saveHtml": { "title": "Save HTML", "type": "boolean", "description": "Include original HTML content in the output dataset item.", "default": false }, "enableDeduplication": { "title": "Enable deduplication", "type": "boolean", "description": "Remove exact duplicate content (identical SHA-256 hash) from the dataset.", "default": true }, "embeddingProvider": { "title": "Embedding provider for semantic deduplication", "enum": [ "none", "bge", "openai" ], "type": "string", "description": "Provider for generating embeddings used in semantic deduplication. Set to 'none' to disable semantic deduplication.", "default": "none" }, "semanticDeduplicationThreshold": { "title": "Semantic deduplication threshold", "minimum": 0, "maximum": 1, "type": "number", "description": "Cosine similarity threshold above which two pages are considered semantically duplicate (0.0-1.0).", "default": 0.85 }, "chunkSize": { "title": "Chunk size (characters)", "minimum": 100, "maximum": 10000, "type": "integer", "description": "Target size for content chunks in characters. Used by the Chunking Engine for RAG applications.", "default": 1024 }, "chunkOverlap": { "title": "Chunk overlap (characters)", "minimum": 0, "maximum": 9999, "type": "integer", "description": "Number of characters to overlap between consecutive chunks. Must be less than chunkSize.", "default": 200 } } }, "runsResponseSchema": { "type": "object", "properties": { "data": { "type": "object", "properties": { "id": { "type": "string" }, "actId": { "type": "string" }, "userId": { "type": "string" }, "startedAt": { "type": "string", "format": "date-time", "example": "2025-01-08T00:00:00.000Z" }, "finishedAt": { "type": "string", "format": "date-time", "example": "2025-01-08T00:00:00.000Z" }, "status": { "type": "string", "example": "READY" }, "meta": { "type": "object", "properties": { "origin": { "type": "string", "example": "API" }, "userAgent": { "type": "string" } } }, "stats": { "type": "object", "properties": { "inputBodyLen": { "type": "integer", "example": 2000 }, "rebootCount": { "type": "integer", "example": 0 }, "restartCount": { "type": "integer", "example": 0 }, "resurrectCount": { "type": "integer", "example": 0 }, "computeUnits": { "type": "integer", "example": 0 } } }, "options": { "type": "object", "properties": { "build": { "type": "string", "example": "latest" }, "timeoutSecs": { "type": "integer", "example": 300 }, "memoryMbytes": { "type": "integer", "example": 1024 }, "diskMbytes": { "type": "integer", "example": 2048 } } }, "buildId": { "type": "string" }, "defaultKeyValueStoreId": { "type": "string" }, "defaultDatasetId": { "type": "string" }, "defaultRequestQueueId": { "type": "string" }, "buildNumber": { "type": "string", "example": "1.0.0" }, "containerUrl": { "type": "string" }, "usage": { "type": "object", "properties": { "ACTOR_COMPUTE_UNITS": { "type": "integer", "example": 0 }, "DATASET_READS": { "type": "integer", "example": 0 }, "DATASET_WRITES": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_READS": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_WRITES": { "type": "integer", "example": 1 }, "KEY_VALUE_STORE_LISTS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_READS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_WRITES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_INTERNAL_GBYTES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_EXTERNAL_GBYTES": { "type": "integer", "example": 0 }, "PROXY_RESIDENTIAL_TRANSFER_GBYTES": { "type": "integer", "example": 0 }, "PROXY_SERPS": { "type": "integer", "example": 0 } } }, "usageTotalUsd": { "type": "number", "example": 0.00005 }, "usageUsd": { "type": "object", "properties": { "ACTOR_COMPUTE_UNITS": { "type": "integer", "example": 0 }, "DATASET_READS": { "type": "integer", "example": 0 }, "DATASET_WRITES": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_READS": { "type": "integer", "example": 0 }, "KEY_VALUE_STORE_WRITES": { "type": "number", "example": 0.00005 }, "KEY_VALUE_STORE_LISTS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_READS": { "type": "integer", "example": 0 }, "REQUEST_QUEUE_WRITES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_INTERNAL_GBYTES": { "type": "integer", "example": 0 }, "DATA_TRANSFER_EXTERNAL_GBYTES": { "type": "integer", "example": 0 }, "PROXY_RESIDENTIAL_TRANSFER_GBYTES": { "type": "integer", "example": 0 }, "PROXY_SERPS": { "type": "integer", "example": 0 } } } } } } } } } } ```