# Ai Synthetic Data Generator (`ruv/ai-synthetic-data-generator`) Actor

Generate unlimited, high-quality synthetic data for training AI models, testing systems, and building robust agentic applications

- **URL**: https://apify.com/ruv/ai-synthetic-data-generator.md
- **Developed by:** [Reuven Cohen](https://apify.com/ruv) (community)
- **Categories:** Agents, AI, Automation
- **Stats:** 2 total users, 1 monthly users, 100.0% runs succeeded, 1 bookmarks
- **User rating**: No ratings yet

## Pricing

from $0.01 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

<p align="center">
  <img src="https://raw.githubusercontent.com/ruvnet/ruvector/main/assets/synth-logo.png" alt="Agentic Synth" width="140" height="140" />
</p>

<h1 align="center">Agentic Synth</h1>

<p align="center">
  <strong>Enterprise-Grade Simulation Engine with Self-Learning AI</strong>
</p>

<p align="center">
  <a href="https://apify.com/ruv/ai-synthetic-data-generator"><img src="https://img.shields.io/badge/Apify-Actor-FF9900?style=for-the-badge&logo=apify&logoColor=white" alt="Apify Actor"></a>
  <a href="https://github.com/ruvnet/ruvector"><img src="https://img.shields.io/badge/RuVector-Powered-4A90D9?style=for-the-badge" alt="RuVector"></a>
</p>

<p align="center">
  <img src="https://img.shields.io/badge/50K_records-215ms-brightgreen?style=flat-square" alt="50K in 215ms">
  <img src="https://img.shields.io/badge/232K_records/sec-purple?style=flat-square" alt="232K/sec">
  <img src="https://img.shields.io/badge/37_data_types-orange?style=flat-square" alt="37 Data Types">
  <img src="https://img.shields.io/badge/SONA_Self--Learning-blue?style=flat-square" alt="Self-Learning">
  <img src="https://img.shields.io/badge/version-3.0-green?style=flat-square" alt="Version">
</p>

---

### Overview

Agentic Synth is a self-learning simulation engine that generates realistic synthetic data at scale. Unlike static generators that produce random values, this engine learns from every run—extracting patterns from your data to improve quality over time. Generate 100 records in 1ms or 50,000 records in 215ms across **37 different domains**.

**Self-Learning Neural Architecture (SONA)** powers the engine with three learning tiers:

| Tier | What It Does | Example |
|------|--------------|---------|
| **Instant** | Learns patterns during generation | "Electronics products cluster around $200-500" |
| **Background** | Trains on batch completion | "Bloomberg buy ratings correlate with sector performance" |
| **Deep** | Cross-session pattern retention | "Medical diagnoses improve ICD-10 code accuracy over time" |

The engine extracts data-type specific patterns: price distributions correlate with product categories, analyst recommendations match rating distributions, medical billing codes align with procedures, and supply chain lead times reflect regional logistics.

**Key Capabilities:**
- **150x faster** than JavaScript generators (Rust/WASM powered by RuVector)
- **5 embedding models** for semantic search (all-MiniLM-L6-v2, bge-small, all-mpnet, e5-small, gte-small)
- **Real brand matching** per category (Samsung for Electronics, Nike for Sports, LEGO for Toys)
- **Consistent data logic** (stock counts match availability, shipping prices match free flags)
- **Neural pattern training** per data type with EWC++ memory protection

For developers, it eliminates rate limits and captchas. For enterprises, it provides compliant test data without legal risks. For AI teams, it generates unlimited training data with semantic embeddings.

The simulation mode streams data in batches—push 50 records every 2 seconds for real-time pipeline testing. Seeds ensure reproducible results for CI/CD. Pairs with [AI Memory Engine](https://apify.com/ruv/ai-memory-engine) for semantic search and RAG applications.

**Benchmarks:** 100 records in 1ms | 1,000 in 7ms | 10,000 in 53ms | 50,000 in 215ms (232K records/sec)

---

### What's New in v3.0

- **4 Tier-1 Premium APIs**: Bloomberg, ZoomInfo, FactSet, LSEG/Reuters clones ($70K+/year value)
- **5 Biosignal/Security**: EEG brainwaves, CGM glucose, SIEM logs, threat intel, NetFlow
- **5 Industrial/Scientific**: SCADA, LiDAR, CAN bus, genomic VCF, satellite imagery
- **5 Exotic/Research**: fMRI brain scans, protein PDB, power grid, AIS maritime, radar
- **Crunchbase Clone**: Real company data via Gemini Grounding API with web search
- **Memory Session Persistence**: Cross-session data sharing between actors
- **37 total data types** covering web, finance, healthcare, security, industrial, and scientific domains

---

### 37 Data Types - Complete Reference

#### Core Web Data (10 types)
| Type | Description | Use Case |
|------|-------------|----------|
| `ecommerce` | Amazon/eBay style products, reviews, sellers | Scraper testing |
| `social` | Twitter/TikTok posts, likes, comments | Social dashboards |
| `jobs` | LinkedIn/Indeed listings, salaries | Job board testing |
| `real_estate` | Zillow properties, addresses, prices | Real estate apps |
| `search_results` | Google SERPs, snippets, rankings | SEO tools |
| `news` | Articles, authors, engagement | News aggregators |
| `api_response` | REST API mock responses, pagination | Backend mocking |
| `timeseries` | Time-stamped metrics, trends | IoT dashboards |
| `events` | Page views, clicks, form submissions | Analytics testing |
| `embeddings` | Vector data (384-768 dimensions) | ML/RAG training |

#### Tier 1: Ultra-Premium Financial APIs (4 types) - $70K+/year value
| Type | Real API Cost | What You Get |
|------|---------------|--------------|
| `bloomberg` | $24-32K/year | Full terminal data: quotes, fundamentals, analytics, news, consensus |
| `zoominfo` | $15K+/year | B2B contacts, technographics, intent signals, org charts |
| `factset` | $12K/year | Financial analytics, estimates, ownership, supply chain |
| `lseg` | $3.6-22K/year | Reuters news, M&A deals, ESG scores, analyst research |

#### Priority 1: Biosignal & Security (5 types)
| Type | Description | Real-World Application |
|------|-------------|------------------------|
| `eeg` | 5-band neural oscillations, 10-20 electrode system | BCI research, wellness apps |
| `cgm` | Continuous glucose with meal events, trends | Diabetes management ML |
| `siem` | Security events, MITRE ATT&CK, correlations | SOC training, SIEM testing |
| `threat_intel` | IOCs (IPs, domains, hashes), malware families | Threat detection ML |
| `netflow` | Network flows, 5-tuple, application detection | Network security analysis |

#### Priority 2: Industrial & Scientific (5 types)
| Type | Description | Real-World Application |
|------|-------------|------------------------|
| `scada` | PLC registers, process variables, OPC UA format | Digital twin development |
| `lidar` | 3D point clouds, object detection, bounding boxes | Autonomous vehicle ML |
| `canbus` | Vehicle ECU messages, DBC signals | Automotive development |
| `genomic_vcf` | Genetic variants, annotations, population frequencies | Bioinformatics pipelines |
| `satellite` | Multi-spectral bands, NDVI, cloud masks | Remote sensing analysis |

#### Priority 3: Exotic & Research (5 types)
| Type | Description | Real-World Application |
|------|-------------|------------------------|
| `fmri` | BOLD signal voxels, connectivity matrices | Neuroscience research |
| `protein_pdb` | Molecular 3D structures, binding sites | Drug discovery ML |
| `power_grid` | 3-phase electrical, PMU phasors, harmonics | Grid simulation |
| `ais` | Maritime ship tracking, collision risk | Logistics optimization |
| `radar` | Weather reflectivity, vehicle detection | Autonomous systems |

#### Enterprise & Healthcare (4 types)
| Type | Description | Use Case |
|------|-------------|----------|
| `medical` | Patient records, ICD-10, billing | EHR testing |
| `company` | Org structure, financials, leadership | CRM development |
| `supply_chain` | Shipments, inventory, logistics | SCM systems |
| `financial` | Transactions, accounts, fraud detection | Banking apps |

#### Utility Types (2 types)
| Type | Description | Use Case |
|------|-------------|----------|
| `structured` | Custom schema definition | Any specialized need |
| `demo` | Mix of all types | Quick exploration |

---

### Quick Start

#### Basic Usage
```json
{ "dataType": "demo", "count": 100 }
````

#### Premium Financial Data

```json
{ "dataType": "bloomberg", "count": 500 }
```

#### Biosignal Streaming

```json
{ "dataType": "eeg", "count": 1000 }
```

#### Security Operations

```json
{ "dataType": "siem", "count": 500 }
```

#### Industrial Telemetry

```json
{ "dataType": "scada", "count": 200 }
```

***

### Tutorials

#### Tutorial 1: Bloomberg Terminal Alternative

Generate enterprise-grade financial data worth $24K/year:

```json
{
  "dataType": "bloomberg",
  "count": 1000,
  "seed": "financial-test-v1"
}
```

**Sample Output:**

```json
{
  "terminalId": "BBG1734012345678",
  "security": {
    "ticker": "AAPL",
    "name": "Apple Inc",
    "assetClass": "equity",
    "sector": "Technology",
    "exchange": "NASDAQ"
  },
  "pricing": {
    "last": 178.50,
    "bid": 178.45,
    "ask": 178.55,
    "volume": 45000000,
    "vwap": 177.82
  },
  "fundamentals": {
    "marketCap": "2.8T",
    "peRatio": 28.5,
    "eps": 6.26,
    "dividendYield": 0.52
  },
  "analytics": {
    "beta": 1.25,
    "volatility": 22.5,
    "sharpeRatio": 1.45
  },
  "consensus": {
    "recommendation": "buy",
    "targetPrice": 210.00,
    "numAnalysts": 45
  }
}
```

#### Tutorial 2: EEG Brainwave Data for BCI Research

Generate neural oscillation data for brain-computer interface development:

```json
{
  "dataType": "eeg",
  "count": 500,
  "seed": "bci-research-v1"
}
```

**Sample Output:**

```json
{
  "sessionId": "EEG_1734012345678",
  "samplingRate": 250,
  "channels": ["Fp1", "Fp2", "F3", "F4", "C3", "C4", "P3", "P4", "O1", "O2"],
  "epoch": {
    "startTime": "2024-12-14T10:30:00Z",
    "duration": 4000,
    "samples": 1000
  },
  "bands": {
    "delta": { "power": 15.2, "range": "0.5-4Hz" },
    "theta": { "power": 8.7, "range": "4-8Hz" },
    "alpha": { "power": 25.3, "range": "8-13Hz" },
    "beta": { "power": 12.1, "range": "13-30Hz" },
    "gamma": { "power": 5.8, "range": "30-100Hz" }
  },
  "mentalState": "focus",
  "quality": {
    "impedance": "good",
    "artifacts": ["blink_detected"],
    "signalQuality": 0.92
  }
}
```

#### Tutorial 3: SIEM Security Logs for SOC Training

Generate realistic security event logs with MITRE ATT\&CK mapping:

```json
{
  "dataType": "siem",
  "count": 1000,
  "seed": "soc-training-v1"
}
```

**Sample Output:**

```json
{
  "eventId": "SIEM_1734012345678",
  "timestamp": "2024-12-14T10:30:45.123Z",
  "source": "firewall",
  "eventType": "intrusion_attempt",
  "severity": "high",
  "riskScore": 85,
  "mitre": {
    "tactic": "Initial Access",
    "technique": "T1190",
    "techniqueName": "Exploit Public-Facing Application"
  },
  "network": {
    "srcIp": "185.234.xx.xx",
    "dstIp": "10.0.1.50",
    "srcPort": 45678,
    "dstPort": 443,
    "protocol": "TCP"
  },
  "enrichment": {
    "geoLocation": "Russia",
    "threatIntel": "known_scanner",
    "asn": "AS12345"
  },
  "incident": {
    "correlated": true,
    "incidentId": "INC-2024-1234",
    "attackChain": ["reconnaissance", "initial_access"]
  }
}
```

#### Tutorial 4: LiDAR Point Clouds for Autonomous Vehicles

Generate 3D point cloud data for perception system development:

```json
{
  "dataType": "lidar",
  "count": 100,
  "seed": "av-perception-v1"
}
```

**Sample Output:**

```json
{
  "frameId": "LIDAR_1734012345678",
  "timestamp": "2024-12-14T10:30:00.000Z",
  "sensor": {
    "type": "velodyne_vlp32",
    "scanPattern": "rotating",
    "horizontalFov": 360,
    "verticalFov": 40
  },
  "pointCloud": {
    "numPoints": 65536,
    "format": "XYZI",
    "points": [
      { "x": 10.5, "y": 2.3, "z": 0.8, "intensity": 45, "classification": "vehicle" },
      { "x": 15.2, "y": -1.1, "z": 1.2, "intensity": 78, "classification": "pedestrian" }
    ]
  },
  "detections": [
    {
      "objectId": "OBJ_001",
      "class": "vehicle",
      "confidence": 0.95,
      "boundingBox": { "x": 10.5, "y": 2.3, "z": 0.8, "length": 4.5, "width": 1.8, "height": 1.5 },
      "velocity": { "vx": 12.5, "vy": 0.1, "vz": 0 }
    }
  ]
}
```

#### Tutorial 5: Threat Intelligence IOC Feeds

Generate malware IOCs and threat actor data for security ML:

```json
{
  "dataType": "threat_intel",
  "count": 500,
  "seed": "threat-ml-v1"
}
```

**Sample Output:**

```json
{
  "iocId": "IOC_1734012345678",
  "type": "ip",
  "value": "185.234.xx.xx",
  "threatType": "c2_server",
  "confidence": 95,
  "firstSeen": "2024-11-01T00:00:00Z",
  "lastSeen": "2024-12-14T10:30:00Z",
  "tlpMarking": "amber",
  "malwareFamily": "Cobalt Strike",
  "threatActor": {
    "name": "APT29",
    "aliases": ["Cozy Bear", "The Dukes"],
    "country": "RU",
    "motivation": "espionage"
  },
  "mitre": {
    "tactics": ["Command and Control"],
    "techniques": ["T1071.001"]
  },
  "actions": ["block", "alert", "investigate"],
  "sources": ["internal_sandbox", "osint_feed"]
}
```

#### Tutorial 6: Genomic Variant Data for Bioinformatics

Generate VCF-format genetic variant data:

```json
{
  "dataType": "genomic_vcf",
  "count": 1000,
  "seed": "genomics-v1"
}
```

**Sample Output:**

```json
{
  "variantId": "VAR_1734012345678",
  "chromosome": "chr17",
  "position": 7577120,
  "rsId": "rs28934578",
  "reference": "G",
  "alternate": "A",
  "quality": 99,
  "filter": "PASS",
  "genotype": "0/1",
  "annotations": {
    "gene": "TP53",
    "consequence": "missense_variant",
    "impact": "HIGH",
    "aminoAcidChange": "R248W"
  },
  "population": {
    "gnomAD_AF": 0.00001,
    "clinvar": "Pathogenic",
    "dbSNP": true
  },
  "clinical": {
    "significance": "pathogenic",
    "disease": "Li-Fraumeni syndrome",
    "inheritance": "AD"
  }
}
```

***

### Memory Session Persistence

v3.0 introduces cross-session memory for data accumulation and sharing between actors:

```json
{
  "dataType": "bloomberg",
  "count": 1000,
  "memorySessionEnabled": true,
  "memorySessionId": "financial-data-2024",
  "appendToSession": true
}
```

**Benefits:**

- Accumulate data across multiple runs
- Share data between Agentic Synth and AI Memory Engine
- Build persistent datasets over time
- Enable cross-actor workflows

***

### Self-Learning (SONA)

The Self-Optimizing Neural Architecture learns patterns from generated data:

```json
{
  "dataType": "bloomberg",
  "count": 1000,
  "sonaEnabled": true,
  "ewcLambda": 2000,
  "patternThreshold": 0.7
}
```

| Tier | What It Learns | Example |
|------|----------------|---------|
| **Instant** | Real-time patterns | "Tech stocks correlate with NASDAQ" |
| **Background** | Batch patterns | "Q4 retail volume increases 40%" |
| **Deep** | Cross-session | "Pharma P/E ratios range 15-25" |

#### Deep Training & Optimization

For production workloads, use swarm-orchestrated deep training to maximize pattern learning:

```json
{
  "dataType": "bloomberg",
  "count": 1000,
  "sonaEnabled": true,
  "ewcLambda": 2000,
  "patternThreshold": 0.7,
  "seed": "deep-training-financial-v1"
}
```

##### Optimization Strategies

| Strategy | Description | Best For | EWC Lambda |
|----------|-------------|----------|------------|
| **Rapid Learning** | Low protection, fast adaptation | New data types, exploration | 500-1000 |
| **Balanced** | Moderate protection, steady learning | General production use | 2000 |
| **Conservative** | High protection, stable patterns | Critical financial data | 5000+ |
| **Deep Training** | Extended runs with cross-session memory | Enterprise pattern libraries | 2000 + memory persistence |

##### Concurrent Training Results

| Configuration | Runs | Records | Patterns | Duration | Records/sec |
|---------------|------|---------|----------|----------|-------------|
| Single data type | 10 | 1,000 | ~100 | 12s | 83 |
| 5 types parallel | 50 | 5,000 | ~500 | 15s | 333 |
| 20 types parallel | 200 | 20,000 | ~2,000 | 45s | 444 |
| Full swarm (37 types) | 370 | 37,000 | ~3,700 | 90s | 411 |

##### Pattern Learning by Data Type

| Category | Data Types | Patterns/1K Records | Learning Focus |
|----------|------------|---------------------|----------------|
| **Financial** | bloomberg, factset, lseg | 150-200 | Price correlations, sector patterns |
| **Biosignal** | eeg, cgm, fmri | 100-150 | Waveform characteristics, temporal patterns |
| **Security** | siem, threat\_intel | 120-180 | Attack signatures, IOC relationships |
| **Industrial** | scada, lidar, canbus | 80-120 | Sensor correlations, anomaly patterns |
| **Scientific** | genomic\_vcf, protein\_pdb | 90-140 | Sequence patterns, structural motifs |

##### Swarm Training Command

Run deep training across all 37 data types with concurrent execution:

```bash
## Using Apify CLI with parallel execution
for type in bloomberg eeg siem lidar genomic_vcf; do
  apify call ruv/ai-synthetic-data-generator -s \
    --input='{"dataType":"'$type'","count":100,"sonaEnabled":true,"ewcLambda":2000}' &
done
wait
```

##### Training Script (Node.js)

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const DATA_TYPES = ['bloomberg', 'eeg', 'siem', 'lidar', 'genomic_vcf'];

// Run concurrent training batches
const results = await Promise.all(
  DATA_TYPES.map(type =>
    client.actor('ruv/ai-synthetic-data-generator').call({
      dataType: type,
      count: 100,
      sonaEnabled: true,
      ewcLambda: 2000
    })
  )
);
```

#### SONA Learning Benchmark Results

Comprehensive benchmarks measuring SONA's learning capabilities across multiple dimensions:

##### Quantitative Metrics

| Metric | Value | Description |
|--------|-------|-------------|
| **Generation Speed** | 232K records/sec | Peak throughput on Rust/WASM engine |
| **Pattern Detection Rate** | 10-20% | Patterns extracted per 1K records |
| **Learning Convergence** | 3-5 iterations | Iterations to stable pattern set |
| **Memory Retention** | 85-95% | Cross-session pattern preservation |
| **Cross-Domain Transfer** | 60-80% | Pattern applicability across types |

##### EWC Lambda Performance Matrix

| Lambda | Learning Speed | Memory Retention | Stability | Use Case |
|--------|---------------|------------------|-----------|----------|
| 500 | Very Fast | Low (40%) | Volatile | Rapid prototyping |
| 1000 | Fast | Medium (65%) | Moderate | Exploration |
| 2000 | Balanced | High (85%) | Stable | Production |
| 5000 | Slow | Very High (95%) | Very Stable | Critical data |

##### Data Type Learning Profiles

| Category | Types | Pattern Complexity | Learning Rate | Quality Score |
|----------|-------|-------------------|---------------|---------------|
| **Core Web** | 10 | Low-Medium | Fast (1-2 iter) | 90-95% |
| **Financial** | 6 | High | Medium (3-4 iter) | 85-92% |
| **Biosignal** | 3 | Very High | Slow (4-5 iter) | 82-88% |
| **Security** | 3 | High | Medium (3-4 iter) | 85-90% |
| **Industrial** | 3 | Medium-High | Medium (3 iter) | 87-92% |
| **Scientific** | 5 | Very High | Slow (4-5 iter) | 80-88% |
| **Exotic** | 4 | Very High | Slow (5 iter) | 78-85% |

##### Swarm Training Performance

| Topology | Agents | Throughput | Efficiency | Best For |
|----------|--------|------------|------------|----------|
| Sequential | 1 | 30 rec/s | 100% (baseline) | Small batches |
| Parallel (5) | 5 | 140 rec/s | 93% | Standard workloads |
| Parallel (10) | 10 | 260 rec/s | 87% | Large training |
| Parallel (20) | 20 | 440 rec/s | 73% | Deep training |
| Full Swarm (37) | 37 | 720 rec/s | 65% | Comprehensive |

##### Qualitative Learning Capabilities

**Pattern Recognition:**

- Price/value distributions by category
- Temporal correlations in time-series
- Hierarchical relationships in nested data
- Statistical distributions per field type

**Memory Features:**

- EWC++ (Elastic Weight Consolidation) prevents catastrophic forgetting
- Cross-session pattern persistence via Apify KeyValueStore
- Data-type specific pattern libraries
- Trajectory tracking for reward-based learning

**Adaptation Capabilities:**

- Real-time pattern adjustment during generation
- Domain transfer between similar data types
- Quality improvement over successive runs
- Anomaly detection for edge cases

##### Benchmark Methodology

Tests performed on Apify cloud infrastructure:

- **Hardware**: 4GB RAM containers
- **Build**: v3.0.4 with SONA enabled
- **Configuration**: EWC Lambda 2000, Pattern Threshold 0.7
- **Dataset**: 1,000 records per data type, 20 concurrent runs
- **Measurement**: Duration, patterns extracted, quality scores

***

### Performance

#### Benchmark Results (Rust/WASM Engine)

| Records | Time | Records/sec | Use Case |
|---------|------|-------------|----------|
| 100 | 1ms | 100,000 | Unit tests |
| 1,000 | 7ms | 142,857 | Integration tests |
| 10,000 | 53ms | 188,679 | Stress tests |
| 50,000 | 215ms | 232,558 | Load tests |

#### By Data Type Complexity

| Category | Example Type | 1K Records | Complexity |
|----------|--------------|------------|------------|
| Core | ecommerce | 7ms | Low |
| Premium | bloomberg | 15ms | High |
| Biosignal | eeg | 25ms | Very High |
| Scientific | lidar | 30ms | Very High |

***

### API Integration

#### Python

```python
from apify_client import ApifyClient

client = ApifyClient("your-api-token")
run = client.actor("ruv/ai-synthetic-data-generator").call(run_input={
    "dataType": "bloomberg",
    "count": 1000,
    "sonaEnabled": True
})
data = client.dataset(run["defaultDatasetId"]).list_items().items
```

#### JavaScript

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'your-api-token' });
const run = await client.actor('ruv/ai-synthetic-data-generator').call({
    dataType: 'siem',
    count: 500,
    sonaEnabled: true
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
```

#### cURL

```bash
curl -X POST "https://api.apify.com/v2/acts/ruv~ai-synthetic-data-generator/runs?token=$APIFY_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"dataType": "threat_intel", "count": 500}'
```

***

### Pricing

#### Core Data Types

| Event | Price | Description |
|-------|-------|-------------|
| E-commerce Record | $0.001 | Products, reviews |
| Social Media Post | $0.001 | Posts, engagement |
| Job/News/Real Estate | $0.001 | Listings |

#### Premium Data Types

| Event | Price | Description |
|-------|-------|-------------|
| Bloomberg Record | $0.005 | Full terminal data |
| ZoomInfo/FactSet/LSEG | $0.005 | Enterprise financial |
| SIEM/Threat Intel | $0.003 | Security data |
| EEG/CGM Biosignal | $0.003 | Medical streams |
| LiDAR/Satellite | $0.004 | Scientific data |

**Example Costs:**

- 1,000 Bloomberg records: ~$5.00 (vs $24K/year real Bloomberg)
- 500 SIEM events: ~$1.50 (vs $50K/year SIEM platform)
- 1,000 EEG epochs: ~$3.00 (vs $50K research equipment)

***

### Links

- [Agentic Synth on Apify](https://apify.com/ruv/ai-synthetic-data-generator)
- [AI Memory Engine](https://apify.com/ruv/ai-memory-engine) - Companion actor for persistent AI memory
- [GitHub Repository](https://github.com/ruvnet/ruvector)
- [Report Issues](https://github.com/ruvnet/ruvector/issues)

***

**Built with [RuVector](https://github.com/ruvnet/ruvector).** Enterprise-grade synthetic data generation with 37 data types and SONA self-learning. Pairs with [AI Memory Engine](https://apify.com/ruv/ai-memory-engine) for complete AI data solutions.

# Actor input Schema

## `mode` (type: `string`):

Operation mode: generate synthetic data or integrate with other Apify actors

## `dataType` (type: `string`):

What type of website or API do you want to simulate?

## `count` (type: `integer`):

How many records to generate (1-10,000). Start small to test, then scale up.

## `schema` (type: `object`):

Define your own data structure for 'structured' type. Example: {"url": "string (url)", "title": "string", "price": "number (10-500)"}

## `apiEndpoint` (type: `string`):

Base endpoint path for API simulation. Example: /api/products

## `eventTypes` (type: `array`):

Types of web events to generate

## `timeSeriesConfig` (type: `object`):

Settings for time-series data (prices, metrics, etc.)

## `integrateActorId` (type: `string`):

Actor ID to pull data from. 21 actors supported. Used in 'integrate' mode.

## `integrateRunId` (type: `string`):

Specific run ID to pull data from, or 'latest' for most recent run

## `integrateDatasetId` (type: `string`):

Direct dataset ID to pull data from (alternative to Run ID)

## `memorizeFields` (type: `array`):

Which fields to extract for RAG/memory. Leave empty for defaults.

## `useTemplate` (type: `string`):

Pre-built template for common use cases. 12 templates available. Used in 'template' mode.

## `generateEmbeddings` (type: `boolean`):

Generate vector embeddings for all output records (useful for RAG systems)

## `useOnnxEmbeddings` (type: `boolean`):

Use real semantic embeddings via ONNX (slower but more accurate) vs random vectors (fast for testing)

## `embeddingModel` (type: `string`):

Choose embedding model. Smaller models are faster, larger models are more accurate.

## `embeddingDimensions` (type: `integer`):

Vector size for embeddings. 384 (fast), 768 (accurate), 1536 (OpenAI compatible)

## `provider` (type: `string`):

AI provider for enhanced generation. DeepSeek via OpenRouter is default (low cost). Works great without AI too!

## `model` (type: `string`):

AI model to use. DeepSeek is extremely low cost ($0.14/1M input, $0.28/1M output).

## `openrouterApiKey` (type: `string`):

Get your key at https://openrouter.ai/keys - Access DeepSeek, GPT-4, Claude, Llama via single API

## `geminiApiKey` (type: `string`):

Get your key at https://aistudio.google.com/apikey - Free tier available

## `anthropicApiKey` (type: `string`):

Get your key at https://console.anthropic.com - Direct Claude API access

## `quality` (type: `number`):

Quality level (0.1-1.0). Higher = more realistic but uses more AI tokens.

## `sonaEnabled` (type: `boolean`):

Enable TRM/SONA self-learning for intelligent pattern recognition in data generation. Learns from generation patterns to improve data quality.

## `ewcLambda` (type: `number`):

Elastic Weight Consolidation strength for pattern preservation. Higher values maintain more learned patterns across generations.

## `patternThreshold` (type: `number`):

Minimum confidence threshold for pattern recognition in data generation (0-1)

## `sonaLearningTiers` (type: `array`):

SONA learning tiers: instant (real-time), background (async optimization), deep (comprehensive analysis)

## `simulationMode` (type: `boolean`):

Enable simulation mode for testing scrapers that poll for updates. Data is pushed in batches with delays.

## `batchSize` (type: `integer`):

Number of records per batch in simulation mode

## `delayBetweenBatches` (type: `integer`):

Milliseconds to wait between batches in simulation mode. Use 1000 for 1 second delays.

## `seed` (type: `string`):

Set a seed for reproducible results. Same seed = same data every time.

## `outputFormat` (type: `string`):

Format for output data

## `webhookUrl` (type: `string`):

URL to POST results to when generation completes (for async workflows)

## `crunchbaseCompanies` (type: `array`):

Specific company names to research. Leave empty to auto-generate based on industry.

## `crunchbaseIndustry` (type: `string`):

Filter companies by industry (e.g., 'fintech', 'healthcare', 'AI/ML'). Leave empty for all industries.

## `memorySessionEnabled` (type: `boolean`):

Save generated data to a persistent memory session. Data persists across runs and can be shared with AI Memory Engine.

## `memorySessionId` (type: `string`):

Unique session identifier. Use the same ID across runs to accumulate data. Can be accessed by AI Memory Engine.

## `appendToSession` (type: `boolean`):

If enabled, new data is added to existing session. If disabled, session is replaced.

## Actor input object example

```json
{
  "mode": "generate",
  "dataType": "ecommerce",
  "count": 100,
  "schema": {
    "url": "string (url)",
    "title": "string",
    "price": "number (10-500)",
    "rating": "number (1-5)",
    "inStock": "boolean",
    "category": "string (Electronics, Clothing, Home, Sports)"
  },
  "apiEndpoint": "/api/products",
  "eventTypes": [
    "page_view",
    "click",
    "scroll",
    "form_submit",
    "api_call"
  ],
  "timeSeriesConfig": {
    "interval": "1h",
    "trend": "upward",
    "seasonality": true,
    "noise": 0.1,
    "startDate": "2024-01-01"
  },
  "integrateRunId": "latest",
  "memorizeFields": [],
  "generateEmbeddings": false,
  "useOnnxEmbeddings": true,
  "embeddingModel": "all-MiniLM-L6-v2",
  "embeddingDimensions": 384,
  "provider": "none",
  "model": "deepseek/deepseek-chat",
  "quality": 0.8,
  "sonaEnabled": true,
  "ewcLambda": 2000,
  "patternThreshold": 0.7,
  "sonaLearningTiers": [
    "instant",
    "background"
  ],
  "simulationMode": false,
  "batchSize": 100,
  "delayBetweenBatches": 0,
  "outputFormat": "json",
  "crunchbaseCompanies": [],
  "memorySessionEnabled": false,
  "appendToSession": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "schema": {
        "url": "string (url)",
        "title": "string",
        "price": "number (10-500)",
        "rating": "number (1-5)",
        "inStock": "boolean",
        "category": "string (Electronics, Clothing, Home, Sports)"
    },
    "eventTypes": [
        "page_view",
        "click",
        "scroll",
        "form_submit",
        "api_call"
    ],
    "timeSeriesConfig": {
        "interval": "1h",
        "trend": "upward",
        "seasonality": true,
        "noise": 0.1,
        "startDate": "2024-01-01"
    },
    "memorizeFields": [],
    "crunchbaseCompanies": []
};

// Run the Actor and wait for it to finish
const run = await client.actor("ruv/ai-synthetic-data-generator").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "schema": {
        "url": "string (url)",
        "title": "string",
        "price": "number (10-500)",
        "rating": "number (1-5)",
        "inStock": "boolean",
        "category": "string (Electronics, Clothing, Home, Sports)",
    },
    "eventTypes": [
        "page_view",
        "click",
        "scroll",
        "form_submit",
        "api_call",
    ],
    "timeSeriesConfig": {
        "interval": "1h",
        "trend": "upward",
        "seasonality": True,
        "noise": 0.1,
        "startDate": "2024-01-01",
    },
    "memorizeFields": [],
    "crunchbaseCompanies": [],
}

# Run the Actor and wait for it to finish
run = client.actor("ruv/ai-synthetic-data-generator").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "schema": {
    "url": "string (url)",
    "title": "string",
    "price": "number (10-500)",
    "rating": "number (1-5)",
    "inStock": "boolean",
    "category": "string (Electronics, Clothing, Home, Sports)"
  },
  "eventTypes": [
    "page_view",
    "click",
    "scroll",
    "form_submit",
    "api_call"
  ],
  "timeSeriesConfig": {
    "interval": "1h",
    "trend": "upward",
    "seasonality": true,
    "noise": 0.1,
    "startDate": "2024-01-01"
  },
  "memorizeFields": [],
  "crunchbaseCompanies": []
}' |
apify call ruv/ai-synthetic-data-generator --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=ruv/ai-synthetic-data-generator",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Ai Synthetic Data Generator",
        "description": "Generate unlimited, high-quality synthetic data for training AI models, testing systems, and building robust agentic applications",
        "version": "3.0",
        "x-build-id": "NhFpT4KL2tq3gbN4g"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/ruv~ai-synthetic-data-generator/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-ruv-ai-synthetic-data-generator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/ruv~ai-synthetic-data-generator/runs": {
            "post": {
                "operationId": "runs-sync-ruv-ai-synthetic-data-generator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/ruv~ai-synthetic-data-generator/run-sync": {
            "post": {
                "operationId": "run-sync-ruv-ai-synthetic-data-generator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "count"
                ],
                "properties": {
                    "mode": {
                        "title": "Mode",
                        "enum": [
                            "generate",
                            "integrate",
                            "template"
                        ],
                        "type": "string",
                        "description": "Operation mode: generate synthetic data or integrate with other Apify actors",
                        "default": "generate"
                    },
                    "dataType": {
                        "title": "Website/API Type",
                        "enum": [
                            "ecommerce",
                            "social",
                            "api_response",
                            "search_results",
                            "real_estate",
                            "jobs",
                            "news",
                            "stock_trading",
                            "medical",
                            "company",
                            "supply_chain",
                            "financial",
                            "bloomberg",
                            "zoominfo",
                            "factset",
                            "lseg",
                            "crunchbase",
                            "eeg",
                            "cgm",
                            "siem",
                            "threat_intel",
                            "netflow",
                            "scada",
                            "lidar",
                            "canbus",
                            "genomic_vcf",
                            "satellite",
                            "fmri",
                            "protein_pdb",
                            "power_grid",
                            "ais",
                            "radar",
                            "structured",
                            "timeseries",
                            "events",
                            "embeddings",
                            "demo"
                        ],
                        "type": "string",
                        "description": "What type of website or API do you want to simulate?",
                        "default": "ecommerce"
                    },
                    "count": {
                        "title": "Number of Records",
                        "minimum": 1,
                        "maximum": 10000,
                        "type": "integer",
                        "description": "How many records to generate (1-10,000). Start small to test, then scale up.",
                        "default": 100
                    },
                    "schema": {
                        "title": "Custom Data Schema",
                        "type": "object",
                        "description": "Define your own data structure for 'structured' type. Example: {\"url\": \"string (url)\", \"title\": \"string\", \"price\": \"number (10-500)\"}"
                    },
                    "apiEndpoint": {
                        "title": "API Endpoint (for API Response)",
                        "type": "string",
                        "description": "Base endpoint path for API simulation. Example: /api/products",
                        "default": "/api/products"
                    },
                    "eventTypes": {
                        "title": "Event Types (for Web Events)",
                        "type": "array",
                        "description": "Types of web events to generate",
                        "items": {
                            "type": "string"
                        }
                    },
                    "timeSeriesConfig": {
                        "title": "Time-Series Configuration",
                        "type": "object",
                        "description": "Settings for time-series data (prices, metrics, etc.)"
                    },
                    "integrateActorId": {
                        "title": "Apify Actor to Integrate",
                        "enum": [
                            "apify/google-maps-scraper",
                            "apify/google-search-scraper",
                            "apify/instagram-scraper",
                            "apify/tiktok-scraper",
                            "apify/youtube-scraper",
                            "apify/twitter-scraper",
                            "apify/amazon-scraper",
                            "apify/shopify-scraper",
                            "apify/web-scraper",
                            "apify/website-content-crawler",
                            "apify/cheerio-scraper",
                            "apify/news-scraper",
                            "apify/linkedin-scraper",
                            "trudax/tripadvisor-scraper",
                            "maxcopell/yelp-scraper",
                            "trudax/booking-scraper",
                            "petr_cermak/zillow-scraper",
                            "epctex/craigslist-scraper",
                            "apify/reddit-scraper",
                            "apify/facebook-posts-scraper",
                            "compass/google-places-api"
                        ],
                        "type": "string",
                        "description": "Actor ID to pull data from. 21 actors supported. Used in 'integrate' mode."
                    },
                    "integrateRunId": {
                        "title": "Run ID",
                        "type": "string",
                        "description": "Specific run ID to pull data from, or 'latest' for most recent run",
                        "default": "latest"
                    },
                    "integrateDatasetId": {
                        "title": "Dataset ID (Alternative)",
                        "type": "string",
                        "description": "Direct dataset ID to pull data from (alternative to Run ID)"
                    },
                    "memorizeFields": {
                        "title": "Fields to Memorize",
                        "type": "array",
                        "description": "Which fields to extract for RAG/memory. Leave empty for defaults.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "useTemplate": {
                        "title": "Use Case Template",
                        "enum": [
                            "lead-intelligence",
                            "competitor-monitor",
                            "support-knowledge",
                            "research-assistant",
                            "content-library",
                            "product-catalog",
                            "review-aggregator",
                            "price-tracker",
                            "social-listening",
                            "talent-sourcing",
                            "real-estate-intel",
                            "travel-planner"
                        ],
                        "type": "string",
                        "description": "Pre-built template for common use cases. 12 templates available. Used in 'template' mode."
                    },
                    "generateEmbeddings": {
                        "title": "Generate Embeddings",
                        "type": "boolean",
                        "description": "Generate vector embeddings for all output records (useful for RAG systems)",
                        "default": false
                    },
                    "useOnnxEmbeddings": {
                        "title": "Use ONNX Semantic Embeddings",
                        "type": "boolean",
                        "description": "Use real semantic embeddings via ONNX (slower but more accurate) vs random vectors (fast for testing)",
                        "default": true
                    },
                    "embeddingModel": {
                        "title": "ONNX Embedding Model",
                        "enum": [
                            "all-MiniLM-L6-v2",
                            "bge-small-en-v1.5",
                            "all-mpnet-base-v2",
                            "e5-small-v2",
                            "gte-small"
                        ],
                        "type": "string",
                        "description": "Choose embedding model. Smaller models are faster, larger models are more accurate.",
                        "default": "all-MiniLM-L6-v2"
                    },
                    "embeddingDimensions": {
                        "title": "Embedding Dimensions",
                        "minimum": 32,
                        "maximum": 4096,
                        "type": "integer",
                        "description": "Vector size for embeddings. 384 (fast), 768 (accurate), 1536 (OpenAI compatible)",
                        "default": 384
                    },
                    "provider": {
                        "title": "AI Provider",
                        "enum": [
                            "none",
                            "openrouter",
                            "gemini",
                            "anthropic"
                        ],
                        "type": "string",
                        "description": "AI provider for enhanced generation. DeepSeek via OpenRouter is default (low cost). Works great without AI too!",
                        "default": "none"
                    },
                    "model": {
                        "title": "Model",
                        "enum": [
                            "deepseek/deepseek-chat",
                            "deepseek/deepseek-reasoner",
                            "meta-llama/llama-3.3-70b-instruct",
                            "mistralai/mistral-large-2411",
                            "openai/gpt-4o-mini",
                            "openai/gpt-4o",
                            "anthropic/claude-sonnet-4",
                            "anthropic/claude-opus-4",
                            "anthropic/claude-3.5-sonnet-20241022",
                            "anthropic/claude-3.5-haiku-20241022",
                            "gemini-2.0-flash",
                            "gemini-2.0-flash-exp",
                            "gemini-1.5-pro",
                            "gemini-1.5-flash"
                        ],
                        "type": "string",
                        "description": "AI model to use. DeepSeek is extremely low cost ($0.14/1M input, $0.28/1M output).",
                        "default": "deepseek/deepseek-chat"
                    },
                    "openrouterApiKey": {
                        "title": "OpenRouter API Key",
                        "type": "string",
                        "description": "Get your key at https://openrouter.ai/keys - Access DeepSeek, GPT-4, Claude, Llama via single API"
                    },
                    "geminiApiKey": {
                        "title": "Gemini API Key",
                        "type": "string",
                        "description": "Get your key at https://aistudio.google.com/apikey - Free tier available"
                    },
                    "anthropicApiKey": {
                        "title": "Anthropic API Key",
                        "type": "string",
                        "description": "Get your key at https://console.anthropic.com - Direct Claude API access"
                    },
                    "quality": {
                        "title": "Data Quality",
                        "minimum": 0.1,
                        "maximum": 1,
                        "type": "number",
                        "description": "Quality level (0.1-1.0). Higher = more realistic but uses more AI tokens.",
                        "default": 0.8
                    },
                    "sonaEnabled": {
                        "title": "Enable SONA Learning",
                        "type": "boolean",
                        "description": "Enable TRM/SONA self-learning for intelligent pattern recognition in data generation. Learns from generation patterns to improve data quality.",
                        "default": true
                    },
                    "ewcLambda": {
                        "title": "EWC Lambda",
                        "minimum": 100,
                        "maximum": 10000,
                        "type": "number",
                        "description": "Elastic Weight Consolidation strength for pattern preservation. Higher values maintain more learned patterns across generations.",
                        "default": 2000
                    },
                    "patternThreshold": {
                        "title": "Pattern Threshold",
                        "minimum": 0.1,
                        "maximum": 1,
                        "type": "number",
                        "description": "Minimum confidence threshold for pattern recognition in data generation (0-1)",
                        "default": 0.7
                    },
                    "sonaLearningTiers": {
                        "title": "Learning Tiers",
                        "type": "array",
                        "description": "SONA learning tiers: instant (real-time), background (async optimization), deep (comprehensive analysis)",
                        "default": [
                            "instant",
                            "background"
                        ]
                    },
                    "simulationMode": {
                        "title": "Long-Running Simulation",
                        "type": "boolean",
                        "description": "Enable simulation mode for testing scrapers that poll for updates. Data is pushed in batches with delays.",
                        "default": false
                    },
                    "batchSize": {
                        "title": "Batch Size (Simulation)",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Number of records per batch in simulation mode",
                        "default": 100
                    },
                    "delayBetweenBatches": {
                        "title": "Delay Between Batches (ms)",
                        "minimum": 0,
                        "maximum": 60000,
                        "type": "integer",
                        "description": "Milliseconds to wait between batches in simulation mode. Use 1000 for 1 second delays.",
                        "default": 0
                    },
                    "seed": {
                        "title": "Random Seed",
                        "type": "string",
                        "description": "Set a seed for reproducible results. Same seed = same data every time."
                    },
                    "outputFormat": {
                        "title": "Output Format",
                        "enum": [
                            "json",
                            "jsonl",
                            "csv"
                        ],
                        "type": "string",
                        "description": "Format for output data",
                        "default": "json"
                    },
                    "webhookUrl": {
                        "title": "Webhook URL",
                        "type": "string",
                        "description": "URL to POST results to when generation completes (for async workflows)"
                    },
                    "crunchbaseCompanies": {
                        "title": "Company Names (Crunchbase)",
                        "type": "array",
                        "description": "Specific company names to research. Leave empty to auto-generate based on industry.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "crunchbaseIndustry": {
                        "title": "Industry Filter (Crunchbase)",
                        "type": "string",
                        "description": "Filter companies by industry (e.g., 'fintech', 'healthcare', 'AI/ML'). Leave empty for all industries."
                    },
                    "memorySessionEnabled": {
                        "title": "Enable Memory Session",
                        "type": "boolean",
                        "description": "Save generated data to a persistent memory session. Data persists across runs and can be shared with AI Memory Engine.",
                        "default": false
                    },
                    "memorySessionId": {
                        "title": "Memory Session ID",
                        "type": "string",
                        "description": "Unique session identifier. Use the same ID across runs to accumulate data. Can be accessed by AI Memory Engine."
                    },
                    "appendToSession": {
                        "title": "Append to Existing Session",
                        "type": "boolean",
                        "description": "If enabled, new data is added to existing session. If disabled, session is replaced.",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
