# ScraperCodeGenerator (`ohlava/scrapercodegenerator`) Actor

An intelligent web scraping tool that automatically generates custom scraping code for any website.

- **URL**: https://apify.com/ohlava/scrapercodegenerator.md
- **Developed by:** [Ondřej Hlava](https://apify.com/ohlava) (community)
- **Categories:** Automation, Other, Open source
- **Stats:** 21 total users, 0 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🧠 AI-Powered Web Scraper & Code Generator

**Stop writing scraping code manually!** This intelligent actor doesn't just scrape websites - it **automatically generates custom Python scraping code** tailored to your specific needs. 

You get both the *extracted data* AND the *code* to replicate it anytime.

### 🚀 What This Actor Does

The actor will automatically:

- **Test multiple scraping methods**: Runs multiple scraping strategies (Cheerio, Web Scraper, Website Content Crawler, Playwright, etc.) **in parallel** for faster results
- **Evaluate which works best using AI**: Claude AI analyzes each result and selects the best extraction
- **Extract your requested data**: Automatically structures the extracted data based on your requirements
- **🔥 Generate custom Python code that scrapes YOUR website**: Creates personalized Python scraping code that you can run independently
- **Provide the code as a downloadable script you can run anywhere**: Complete, ready-to-use BeautifulSoup script saved to key-value store

### ✨ Key Benefits

- **No Technical Knowledge Required**: Just describe what data you want in plain English
- **Resilient Scraping**: Multiple strategies ensure success even if one method fails
- **AI-Powered**: Uses Claude AI to understand content context and select optimal results
- **🎯 Custom Code Generation**: Get personalized Python code that scrapes YOUR specific website
- **Production Ready**: Generated code is clean, documented, and ready to run independently
- **Reusable**: Use the generated code in your own projects, scripts, or applications

### 📊 Output Data

The actor saves comprehensive results to your default dataset AND saves the generated script to the key-value store.

> **💡 How to Access**: After the actor finishes, go to the "Key-value store" tab in your run details and download the `GENERATED_SCRIPT` file. Rename it to have the extension: **.py**.

#### 🎯 What You Get

- **Extracted Data**: The actual data from the website, structured according to your goal
- **🔥 Generated Python Code**: Ready-to-use BeautifulSoup script that you can run on your own computer
- **💾 Separate Script File**: The Python code is also saved as a downloadable file in the key-value store
- **Quality Scores**: Performance ratings for each scraping method (0-10 scale)
- **Best Method**: Which scraping approach worked best for your website

> **💡 Pro Tip**: The generated Python code is completely standalone - you can copy it, modify it, and use it in your own projects without needing this actor again!

### 🎯 Usage Examples

#### E-commerce Product Scraping

```json
{
    "targetUrl": "https://books.toscrape.com/",
    "userGoal": "Get me a list of all the books on the first page. For each book, I want its title, price, star rating, and whether it is in stock.",
    "claudeApiKey": "sk-ant-..."
}
````

#### News Website Scraping

```json
{
    "targetUrl": "https://www.theverge.com/",
    "userGoal": "I want to scrape the main articles from The Verge homepage. For each article, get me the headline, the author's name, and the link to the full article.",
    "claudeApiKey": "sk-ant-..."
}
```

#### Job Listings Scraping

```json
{
    "targetUrl": "https://www.python.org/jobs/",
    "userGoal": "List all the jobs posted. For each job, I want the job title, the company name, the location, and the date it was posted.",
    "claudeApiKey": "sk-ant-..."
}
```

#### Quote Collection

```json
{
    "targetUrl": "https://quotes.toscrape.com/",
    "userGoal": "I want a list of all quotes on this page. For each one, get the quote text itself, the name of the author, and a list of the tags associated with it.",
    "claudeApiKey": "sk-ant-..."
}
```

#### Business Directory Scraping

```json
{
    "targetUrl": "https://directory.com/restaurants",
    "userGoal": "Get restaurant names, addresses, phone numbers, and ratings",
    "claudeApiKey": "sk-ant-..."
}
```

### 🔧 How to Use

1. **Enter Target URL**: Paste the website URL you want to scrape
2. **Describe Your Goal**: Be specific about what data you need (e.g., "product names and prices" not just "products")
3. **Add Claude API Key**: Your Anthropic API key for AI analysis
4. **Configure Advanced Settings** (optional): Customize Claude model, HTML processing, and actor selection
5. **Run the Actor**: Click "Start" and watch the magic happen!

### ⚙️ Advanced Configuration

#### 🤖 Claude Model Selection

Choose the AI model that best fits your needs:

- **Claude 4 Sonnet** (Default): Latest and most capable model
- **Claude 4 Opus**: Maximum quality for the most complex tasks
- **Claude 3.7 Sonnet**: Enhanced capabilities over 3.5
- **Claude 3.5 Sonnet**: Reliable and well-tested
- **Claude 3.5 Haiku**: Fastest and most cost-effective
- **Claude 3 Sonnet**: Good balance for most tasks
- **Claude 3 Haiku**: Basic tasks with minimal cost

#### 🔧 HTML Processing Settings

Fine-tune how HTML content is processed:

- **Enable HTML Pruning**: Reduces processing time by removing unnecessary content
- **Max List Items**: Controls how many items to keep in lists/tables (1-20)
- **Max Text Length**: Maximum text length in any element (100-2000 chars)
- **Prune Percentage**: How much content to keep (10%-100%)

#### 🎯 Actor Selection

Choose which scraping methods to use:

- **Cheerio Scraper**: Fast jQuery-like scraping (enabled by default)
- **Web Scraper**: Versatile with JavaScript support (enabled by default)
- **Website Content Crawler**: Advanced Playwright crawler (enabled by default)
- **Playwright Scraper**: Modern browser automation (disabled by default)
- **Puppeteer Scraper**: Chrome-based scraping (disabled by default)

> **💡 Pro Tip**: Enable 2-3 actors for the best balance of speed and reliability. More actors = better chances of success but slower execution.

#### 🚀 Performance Settings

- **Concurrent Actors**: Run multiple actors simultaneously for faster results
- **Test Generated Script**: Validate the generated code before saving

The actor will automatically:

- Test multiple scraping methods
- Evaluate which works best using AI
- Extract your requested data
- **🔥 Generate custom Python code that scrapes YOUR website**
- Provide the code as a downloadable script you can run anywhere

#### Common Use Cases

- **Market Research**: Track competitor pricing and products + get code to monitor them daily
- **Content Aggregation**: Collect news articles or blog posts + get code to update your database
- **Lead Generation**: Extract business contact information + get code to scrape new listings
- **Data Analysis**: Gather data for research projects + get code to repeat the process
- **Price Monitoring**: Track product prices over time + get code to check prices automatically

### 🔍 Troubleshooting

#### "No content found" errors

- Try different goal descriptions
- Some websites block automated scraping
- Check if the URL is accessible

#### Poor quality scores

- Be more specific in your goal description
- The website might have complex structure
- Try simpler pages first

#### 🔑 Getting Your Claude API Key

1. Go to [Anthropic Console](https://console.anthropic.com/)
2. Sign up or log in
3. Navigate to API Keys section
4. Create a new API key
5. Copy and paste it into the "Claude API Key" field

#### Claude API errors

- Verify your API key is correct
- Check your Claude API usage limits
- Ensure you have sufficient API credits

### 📋 Input Parameters

| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| **Target URL** | String | Yes | The website URL you want to scrape |
| **User Goal** | String | Yes | Describe what data you want (e.g., "Extract all product names, prices, and ratings") |
| **Claude API Key** | String | Yes | Your Anthropic Claude API key ([Get one here](https://console.anthropic.com/)) |
| **Test Generated Script** | Boolean | No | Whether to test the generated script (default: true) |
| **Claude Model** | String | No | AI model to use (default: Claude 4 Sonnet) |
| **Max Retries** | Number | No | Maximum retry attempts (default: 3) |
| **Timeout** | Number | No | Timeout per attempt in seconds (default: 60) |
| **HTML Pruning Enabled** | Boolean | No | Enable HTML content processing (default: true) |
| **HTML Max List Items** | Number | No | Maximum items in lists to keep (1-20, default: 3) |
| **HTML Max Text Length** | Number | No | Maximum text length in elements (50-2000, default: 200) |
| **HTML Prune Before Evaluation** | Boolean | No | Apply pruning before AI evaluation (default: true) |
| **HTML Prune Percentage** | Number | No | Percentage of content to keep (0-100, default: 80) |
| **Actors** | Array | No | Detailed actor configurations with custom inputs |
| **Concurrent Actors** | Boolean | No | Run actors simultaneously (default: true) |

#### Advanced Configuration Examples

##### Custom Claude Model

```json
{
    "targetUrl": "https://example.com",
    "userGoal": "Extract product data",
    "claudeApiKey": "sk-ant-...",
    "claudeModel": "claude-sonnet-4-20250514"
}
```

##### Custom HTML Processing

```json
{
    "targetUrl": "https://example.com",
    "userGoal": "Extract product data",
    "claudeApiKey": "sk-ant-...",
    "htmlPruningEnabled": true,
    "htmlMaxListItems": 10,
    "htmlMaxTextLength": 1000,
    "htmlPrunePercentage": 90
}
```

##### Custom Actor Selection

```json
{
    "targetUrl": "https://example.com",
    "userGoal": "Extract product data",
    "claudeApiKey": "sk-ant-...",
    "actors": [
        {
            "name": "cheerio-scraper",
            "enabled": true,
            "input": {
                "maxRequestRetries": 5,
                "requestTimeoutSecs": 60,
                "maxPagesPerCrawl": 1,
                "proxyConfiguration": {"useApifyProxy": true}
            }
        },
        {
            "name": "web-scraper",
            "enabled": false,
            "input": {}
        },
        {
            "name": "playwright-scraper",
            "enabled": true,
            "input": {
                "maxRequestRetries": 3,
                "requestTimeoutSecs": 90,
                "maxPagesPerCrawl": 1
            }
        }
    ],
    "concurrentActors": true
}
```

##### Full Configuration Example

```json
{
    "targetUrl": "https://books.toscrape.com/",
    "userGoal": "Get me a list of all the books on the first page. For each book, I want its title, price, star rating, and whether it is in stock.",
    "claudeApiKey": "sk-ant-...",
    "claudeModel": "claude-sonnet-4-20250514",
    "testScript": true,
    "maxRetries": 3,
    "timeout": 60,
    "htmlPruningEnabled": true,
    "htmlMaxListItems": 5,
    "htmlMaxTextLength": 500,
    "htmlPruneBeforeEvaluation": true,
    "htmlPrunePercentage": 80,
    "concurrentActors": true,
    "actors": [
        {
            "name": "cheerio-scraper",
            "enabled": true,
            "input": {
                "maxRequestRetries": 3,
                "requestTimeoutSecs": 30,
                "maxPagesPerCrawl": 1,
                "proxyConfiguration": {"useApifyProxy": true}
            }
        },
        {
            "name": "web-scraper",
            "enabled": true,
            "input": {
                "maxRequestRetries": 3,
                "requestTimeoutSecs": 30,
                "maxPagesPerCrawl": 1,
                "proxyConfiguration": {"useApifyProxy": true}
            }
        },
        {
            "name": "playwright-scraper",
            "enabled": true,
            "input": {
                "maxRequestRetries": 2,
                "requestTimeoutSecs": 45,
                "maxPagesPerCrawl": 1
            }
        }
    ]
}
```

# Actor input Schema

## `targetUrl` (type: `string`):

The URL of the website you want to scrape

## `userGoal` (type: `string`):

Describe what data you want to extract from the website

## `claudeApiKey` (type: `string`):

Your Anthropic Claude API key for AI-powered code generation

## `maxRetries` (type: `integer`):

Maximum number of retry attempts for scraping

## `timeout` (type: `integer`):

Timeout for each scraping attempt in seconds

## `testScript` (type: `boolean`):

Whether to test the generated scraping script before saving it

## `claudeModel` (type: `string`):

Choose which Claude model to use for AI analysis

## `htmlPruningEnabled` (type: `boolean`):

Enable HTML content processing before analysis

## `htmlMaxListItems` (type: `integer`):

Maximum number of items to keep in lists when pruning HTML

## `htmlMaxTextLength` (type: `integer`):

Maximum length of text content to keep when pruning HTML

## `htmlPruneBeforeEvaluation` (type: `boolean`):

Apply HTML pruning before quality evaluation

## `htmlPrunePercentage` (type: `integer`):

Percentage of HTML content to prune (0-100)

## `actors` (type: `array`):

Select and configure which Apify actors to use for scraping

## `concurrentActors` (type: `boolean`):

Run multiple actors simultaneously for faster results

## `forActor` (type: `boolean`):

Choose the output format for the generated script

## Actor input object example

```json
{
  "targetUrl": "https://books.toscrape.com/",
  "userGoal": "Get me a list of all the books on the first page. For each book, I want its title, price, star rating, and whether it is in stock.",
  "maxRetries": 3,
  "timeout": 60,
  "testScript": true,
  "claudeModel": "claude-sonnet-4-20250514",
  "htmlPruningEnabled": true,
  "htmlMaxListItems": 3,
  "htmlMaxTextLength": 200,
  "htmlPruneBeforeEvaluation": true,
  "htmlPrunePercentage": 80,
  "actors": [
    {
      "name": "cheerio-scraper",
      "enabled": true,
      "input": {
        "maxRequestRetries": 3,
        "requestTimeoutSecs": 30,
        "maxPagesPerCrawl": 1,
        "pageFunction": "async function pageFunction(context) {\n    const { request, log, $ } = context;\n    try {\n        const title = $('title').text() || '';\n        const html = $('html').html() || '';\n        return {\n            url: request.url,\n            title: title,\n            html: html\n        };\n    } catch (error) {\n        log.error('Error in pageFunction:', error);\n        return {\n            url: request.url,\n            title: '',\n            html: ''\n        };\n    }\n}",
        "proxyConfiguration": {
          "useApifyProxy": true
        }
      }
    },
    {
      "name": "web-scraper",
      "enabled": false,
      "input": {
        "maxRequestRetries": 3,
        "requestTimeoutSecs": 30,
        "maxPagesPerCrawl": 1,
        "pageFunction": "async function pageFunction(context) {\n    const { request, log, page } = context;\n    try {\n        const title = await page.title();\n        const html = await page.content();\n        return {\n            url: request.url,\n            title: title,\n            html: html\n        };\n    } catch (error) {\n        log.error('Error in pageFunction:', error);\n        return {\n            url: request.url,\n            title: '',\n            html: ''\n        };\n    }\n}",
        "proxyConfiguration": {
          "useApifyProxy": true
        }
      }
    },
    {
      "name": "website-content-crawler",
      "enabled": true,
      "input": {
        "maxCrawlPages": 1,
        "crawler": "playwright",
        "proxyConfiguration": {
          "useApifyProxy": true
        }
      }
    },
    {
      "name": "playwright-scraper",
      "enabled": false,
      "input": {
        "maxRequestRetries": 2,
        "requestTimeoutSecs": 45,
        "maxPagesPerCrawl": 1,
        "pageFunction": "async function pageFunction(context) {\n    const { request, log, page } = context;\n    try {\n        const title = await page.title();\n        const html = await page.content();\n        return {\n            url: request.url,\n            title: title,\n            html: html\n        };\n    } catch (error) {\n        log.error('Error in pageFunction:', error);\n        return {\n            url: request.url,\n            title: '',\n            html: ''\n        };\n    }\n}",
        "proxyConfiguration": {
          "useApifyProxy": true
        }
      }
    }
  ],
  "concurrentActors": true,
  "forActor": true
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "targetUrl": "https://books.toscrape.com/",
    "userGoal": "Get me a list of all the books on the first page. For each book, I want its title, price, star rating, and whether it is in stock.",
    "actors": [
        {
            "name": "cheerio-scraper",
            "enabled": true,
            "input": {
                "maxRequestRetries": 3,
                "requestTimeoutSecs": 30,
                "maxPagesPerCrawl": 1,
                "pageFunction": "async function pageFunction(context) {\n    const { request, log, $ } = context;\n    try {\n        const title = $('title').text() || '';\n        const html = $('html').html() || '';\n        return {\n            url: request.url,\n            title: title,\n            html: html\n        };\n    } catch (error) {\n        log.error('Error in pageFunction:', error);\n        return {\n            url: request.url,\n            title: '',\n            html: ''\n        };\n    }\n}",
                "proxyConfiguration": {
                    "useApifyProxy": true
                }
            }
        },
        {
            "name": "web-scraper",
            "enabled": false,
            "input": {
                "maxRequestRetries": 3,
                "requestTimeoutSecs": 30,
                "maxPagesPerCrawl": 1,
                "pageFunction": "async function pageFunction(context) {\n    const { request, log, page } = context;\n    try {\n        const title = await page.title();\n        const html = await page.content();\n        return {\n            url: request.url,\n            title: title,\n            html: html\n        };\n    } catch (error) {\n        log.error('Error in pageFunction:', error);\n        return {\n            url: request.url,\n            title: '',\n            html: ''\n        };\n    }\n}",
                "proxyConfiguration": {
                    "useApifyProxy": true
                }
            }
        },
        {
            "name": "website-content-crawler",
            "enabled": true,
            "input": {
                "maxCrawlPages": 1,
                "crawler": "playwright",
                "proxyConfiguration": {
                    "useApifyProxy": true
                }
            }
        },
        {
            "name": "playwright-scraper",
            "enabled": false,
            "input": {
                "maxRequestRetries": 2,
                "requestTimeoutSecs": 45,
                "maxPagesPerCrawl": 1,
                "pageFunction": "async function pageFunction(context) {\n    const { request, log, page } = context;\n    try {\n        const title = await page.title();\n        const html = await page.content();\n        return {\n            url: request.url,\n            title: title,\n            html: html\n        };\n    } catch (error) {\n        log.error('Error in pageFunction:', error);\n        return {\n            url: request.url,\n            title: '',\n            html: ''\n        };\n    }\n}",
                "proxyConfiguration": {
                    "useApifyProxy": true
                }
            }
        }
    ]
};

// Run the Actor and wait for it to finish
const run = await client.actor("ohlava/scrapercodegenerator").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "targetUrl": "https://books.toscrape.com/",
    "userGoal": "Get me a list of all the books on the first page. For each book, I want its title, price, star rating, and whether it is in stock.",
    "actors": [
        {
            "name": "cheerio-scraper",
            "enabled": True,
            "input": {
                "maxRequestRetries": 3,
                "requestTimeoutSecs": 30,
                "maxPagesPerCrawl": 1,
                "pageFunction": """async function pageFunction(context) {
    const { request, log, $ } = context;
    try {
        const title = $('title').text() || '';
        const html = $('html').html() || '';
        return {
            url: request.url,
            title: title,
            html: html
        };
    } catch (error) {
        log.error('Error in pageFunction:', error);
        return {
            url: request.url,
            title: '',
            html: ''
        };
    }
}""",
                "proxyConfiguration": { "useApifyProxy": True },
            },
        },
        {
            "name": "web-scraper",
            "enabled": False,
            "input": {
                "maxRequestRetries": 3,
                "requestTimeoutSecs": 30,
                "maxPagesPerCrawl": 1,
                "pageFunction": """async function pageFunction(context) {
    const { request, log, page } = context;
    try {
        const title = await page.title();
        const html = await page.content();
        return {
            url: request.url,
            title: title,
            html: html
        };
    } catch (error) {
        log.error('Error in pageFunction:', error);
        return {
            url: request.url,
            title: '',
            html: ''
        };
    }
}""",
                "proxyConfiguration": { "useApifyProxy": True },
            },
        },
        {
            "name": "website-content-crawler",
            "enabled": True,
            "input": {
                "maxCrawlPages": 1,
                "crawler": "playwright",
                "proxyConfiguration": { "useApifyProxy": True },
            },
        },
        {
            "name": "playwright-scraper",
            "enabled": False,
            "input": {
                "maxRequestRetries": 2,
                "requestTimeoutSecs": 45,
                "maxPagesPerCrawl": 1,
                "pageFunction": """async function pageFunction(context) {
    const { request, log, page } = context;
    try {
        const title = await page.title();
        const html = await page.content();
        return {
            url: request.url,
            title: title,
            html: html
        };
    } catch (error) {
        log.error('Error in pageFunction:', error);
        return {
            url: request.url,
            title: '',
            html: ''
        };
    }
}""",
                "proxyConfiguration": { "useApifyProxy": True },
            },
        },
    ],
}

# Run the Actor and wait for it to finish
run = client.actor("ohlava/scrapercodegenerator").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "targetUrl": "https://books.toscrape.com/",
  "userGoal": "Get me a list of all the books on the first page. For each book, I want its title, price, star rating, and whether it is in stock.",
  "actors": [
    {
      "name": "cheerio-scraper",
      "enabled": true,
      "input": {
        "maxRequestRetries": 3,
        "requestTimeoutSecs": 30,
        "maxPagesPerCrawl": 1,
        "pageFunction": "async function pageFunction(context) {\\n    const { request, log, $ } = context;\\n    try {\\n        const title = $('\''title'\'').text() || '\'''\'';\\n        const html = $('\''html'\'').html() || '\'''\'';\\n        return {\\n            url: request.url,\\n            title: title,\\n            html: html\\n        };\\n    } catch (error) {\\n        log.error('\''Error in pageFunction:'\'', error);\\n        return {\\n            url: request.url,\\n            title: '\'''\'',\\n            html: '\'''\''\\n        };\\n    }\\n}",
        "proxyConfiguration": {
          "useApifyProxy": true
        }
      }
    },
    {
      "name": "web-scraper",
      "enabled": false,
      "input": {
        "maxRequestRetries": 3,
        "requestTimeoutSecs": 30,
        "maxPagesPerCrawl": 1,
        "pageFunction": "async function pageFunction(context) {\\n    const { request, log, page } = context;\\n    try {\\n        const title = await page.title();\\n        const html = await page.content();\\n        return {\\n            url: request.url,\\n            title: title,\\n            html: html\\n        };\\n    } catch (error) {\\n        log.error('\''Error in pageFunction:'\'', error);\\n        return {\\n            url: request.url,\\n            title: '\'''\'',\\n            html: '\'''\''\\n        };\\n    }\\n}",
        "proxyConfiguration": {
          "useApifyProxy": true
        }
      }
    },
    {
      "name": "website-content-crawler",
      "enabled": true,
      "input": {
        "maxCrawlPages": 1,
        "crawler": "playwright",
        "proxyConfiguration": {
          "useApifyProxy": true
        }
      }
    },
    {
      "name": "playwright-scraper",
      "enabled": false,
      "input": {
        "maxRequestRetries": 2,
        "requestTimeoutSecs": 45,
        "maxPagesPerCrawl": 1,
        "pageFunction": "async function pageFunction(context) {\\n    const { request, log, page } = context;\\n    try {\\n        const title = await page.title();\\n        const html = await page.content();\\n        return {\\n            url: request.url,\\n            title: title,\\n            html: html\\n        };\\n    } catch (error) {\\n        log.error('\''Error in pageFunction:'\'', error);\\n        return {\\n            url: request.url,\\n            title: '\'''\'',\\n            html: '\'''\''\\n        };\\n    }\\n}",
        "proxyConfiguration": {
          "useApifyProxy": true
        }
      }
    }
  ]
}' |
apify call ohlava/scrapercodegenerator --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=ohlava/scrapercodegenerator",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "ScraperCodeGenerator",
        "description": "An intelligent web scraping tool that automatically generates custom scraping code for any website.",
        "version": "1.1",
        "x-build-id": "bF9DDedBOF6H3shA4"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/ohlava~scrapercodegenerator/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-ohlava-scrapercodegenerator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/ohlava~scrapercodegenerator/runs": {
            "post": {
                "operationId": "runs-sync-ohlava-scrapercodegenerator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/ohlava~scrapercodegenerator/run-sync": {
            "post": {
                "operationId": "run-sync-ohlava-scrapercodegenerator",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "targetUrl",
                    "userGoal",
                    "claudeApiKey"
                ],
                "properties": {
                    "targetUrl": {
                        "title": "Target URL",
                        "type": "string",
                        "description": "The URL of the website you want to scrape"
                    },
                    "userGoal": {
                        "title": "Scraping Goal",
                        "type": "string",
                        "description": "Describe what data you want to extract from the website"
                    },
                    "claudeApiKey": {
                        "title": "Claude API Key",
                        "type": "string",
                        "description": "Your Anthropic Claude API key for AI-powered code generation"
                    },
                    "maxRetries": {
                        "title": "Max Retries",
                        "minimum": 1,
                        "maximum": 10,
                        "type": "integer",
                        "description": "Maximum number of retry attempts for scraping",
                        "default": 3
                    },
                    "timeout": {
                        "title": "Timeout (seconds)",
                        "minimum": 10,
                        "maximum": 300,
                        "type": "integer",
                        "description": "Timeout for each scraping attempt in seconds",
                        "default": 60
                    },
                    "testScript": {
                        "title": "Test Generated Script",
                        "type": "boolean",
                        "description": "Whether to test the generated scraping script before saving it",
                        "default": true
                    },
                    "claudeModel": {
                        "title": "Claude Model",
                        "enum": [
                            "claude-sonnet-4-20250514",
                            "claude-opus-4-20250514",
                            "claude-3-7-sonnet-20250219",
                            "claude-3-5-sonnet-20241022",
                            "claude-3-5-haiku-20241022",
                            "claude-3-sonnet-20240229",
                            "claude-3-haiku-20240307"
                        ],
                        "type": "string",
                        "description": "Choose which Claude model to use for AI analysis",
                        "default": "claude-sonnet-4-20250514"
                    },
                    "htmlPruningEnabled": {
                        "title": "Enable HTML Pruning",
                        "type": "boolean",
                        "description": "Enable HTML content processing before analysis",
                        "default": true
                    },
                    "htmlMaxListItems": {
                        "title": "Max List Items",
                        "minimum": 1,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Maximum number of items to keep in lists when pruning HTML",
                        "default": 3
                    },
                    "htmlMaxTextLength": {
                        "title": "Max Text Length",
                        "minimum": 50,
                        "maximum": 2000,
                        "type": "integer",
                        "description": "Maximum length of text content to keep when pruning HTML",
                        "default": 200
                    },
                    "htmlPruneBeforeEvaluation": {
                        "title": "Prune Before Evaluation",
                        "type": "boolean",
                        "description": "Apply HTML pruning before quality evaluation",
                        "default": true
                    },
                    "htmlPrunePercentage": {
                        "title": "Prune Percentage",
                        "minimum": 0,
                        "maximum": 100,
                        "type": "integer",
                        "description": "Percentage of HTML content to prune (0-100)",
                        "default": 80
                    },
                    "actors": {
                        "title": "Scraping Actors Configuration",
                        "type": "array",
                        "description": "Select and configure which Apify actors to use for scraping"
                    },
                    "concurrentActors": {
                        "title": "Concurrent Actors",
                        "type": "boolean",
                        "description": "Run multiple actors simultaneously for faster results",
                        "default": true
                    },
                    "forActor": {
                        "title": "Generate for Apify Actor",
                        "type": "boolean",
                        "description": "Choose the output format for the generated script",
                        "default": true
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
