# Universal Web Scraper - Extract Any URL (`lazymac/web-scraper-toolkit`) Actor

Pay-per-result web scraper with JS rendering, CSS selector / XPath / regex extraction, schema validation, retry on failure. Use for product catalogs, competitor pricing, news aggregation, lead generation. Fast (<2s/page), respects robots.txt by default.

- **URL**: https://apify.com/lazymac/web-scraper-toolkit.md
- **Developed by:** [2x lazymac](https://apify.com/lazymac) (community)
- **Categories:** Business
- **Stats:** 8 total users, 2 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

$30.00 / 1,000 web scrape results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Web Scraper Toolkit

Extract structured data from any webpage -- metadata, links, headlines, images, tables, full text, or custom CSS selectors. Scrape up to 10 URLs in a single run with 8 flexible extraction modes. No browser required, no API keys needed.

Built for developers, data analysts, content marketers, and anyone who needs to pull structured data from the web quickly and reliably.

---

### What It Does

Web Scraper Toolkit fetches public web pages and extracts data in one of 8 modes. You can grab just the metadata (title, description, OG tags), extract all links on a page, pull out headlines, collect images with alt text, parse HTML tables into structured rows, extract clean body text, or target specific elements using custom CSS selectors. The "full" mode combines metadata, headlines, links, images, and tables in a single pass.

Each URL is processed independently, and results are pushed to the Apify dataset one by one. If a URL fails, the others still succeed -- you never lose partial results.

#### Key Capabilities

- **8 Extraction Modes**: full, metadata, links, headlines, images, tables, text, custom
- **Batch Processing**: Scrape up to 10 URLs per run
- **Custom CSS Selectors**: Target any element on the page with standard CSS selector syntax
- **Automatic Redirect Handling**: Follows HTTP redirects transparently
- **Link Resolution**: Relative URLs are automatically resolved to absolute URLs
- **Deduplication**: Link extraction removes duplicate URLs automatically
- **Graceful Error Handling**: Failed URLs are reported with error messages, other URLs continue processing
- **Lightweight**: No browser rendering -- pure HTTP + HTML parsing for fast, cost-effective execution

---

### What Data You Get

#### Common Fields (All Modes)

| Field | Type | Description |
|-------|------|-------------|
| `url` | string | The URL that was scraped |
| `status` | number | HTTP status code |
| `timestamp` | number | Unix timestamp of when the scrape occurred |

#### Metadata Mode

| Field | Type | Description |
|-------|------|-------------|
| `metadata.title` | string | Page title from `<title>` tag |
| `metadata.description` | string | Meta description content |
| `metadata.ogImage` | string | Open Graph image URL |
| `metadata.ogTitle` | string | Open Graph title |
| `metadata.canonical` | string | Canonical URL |
| `metadata.language` | string | Page language from `lang` attribute |
| `metadata.url` | string | The requested URL |

#### Links Mode

| Field | Type | Description |
|-------|------|-------------|
| `links` | array | Array of link objects |
| `links[].url` | string | Absolute URL of the link |
| `links[].text` | string | Anchor text (null if empty) |
| `count` | number | Total number of unique links found |

#### Headlines Mode

| Field | Type | Description |
|-------|------|-------------|
| `headlines` | array | Array of headline objects |
| `headlines[].tag` | string | HTML tag (h1, h2, or h3) |
| `headlines[].text` | string | Headline text content |
| `count` | number | Total number of headlines found |

#### Images Mode

| Field | Type | Description |
|-------|------|-------------|
| `images` | array | Array of image objects |
| `images[].url` | string | Absolute URL of the image |
| `images[].alt` | string | Alt text (null if missing) |
| `count` | number | Total number of images found |

#### Tables Mode

| Field | Type | Description |
|-------|------|-------------|
| `tables` | array | Array of table objects |
| `tables[].headers` | array | Column headers from `<th>` elements |
| `tables[].rows` | array | Array of row arrays (each row is an array of cell text) |
| `tables[].rowCount` | number | Number of data rows |
| `count` | number | Total number of tables found |

#### Text Mode

| Field | Type | Description |
|-------|------|-------------|
| `text` | string | Clean body text with scripts, styles, nav, footer, and header removed |
| `length` | number | Character count of the extracted text |

#### Custom Mode

| Field | Type | Description |
|-------|------|-------------|
| `results` | array | Array of matched element objects |
| `results[].text` | string | Text content of the matched element |
| `results[].tag` | string | HTML tag name of the matched element |
| `count` | number | Total number of matched elements |

#### Full Mode

Returns `metadata`, `headlines`, `links` (top 50), `images` (top 20), and `tables` all in one result object.

---

### How to Use

#### Basic Usage

1. Open the Web Scraper Toolkit on Apify
2. Enter your URLs as a JSON array (e.g., `["https://example.com"]`)
3. Select a scraping mode (default: `full`)
4. Click "Start"
5. View results in the "Dataset" tab

#### Custom CSS Selector

1. Set mode to `custom`
2. Enter your CSS selector in the "CSS Selector" field (e.g., `.article-title`, `#main-content p`, `table.data-table tr`)
3. The actor extracts text content and tag name for every matching element

---

### Input Configuration

| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| `urls` | array | Yes | -- | JSON array of URLs to scrape. Maximum 10 URLs per run. Each URL must be a publicly accessible webpage. Example: `["https://example.com", "https://github.com"]` |
| `mode` | string | No | `full` | Extraction mode. One of: `full` (metadata + headlines + links + images + tables), `metadata` (page title, description, OG tags, canonical, language), `links` (all unique links with anchor text), `headlines` (H1, H2, H3 headings), `images` (all images with alt text), `tables` (HTML tables parsed into rows), `text` (clean body text), `custom` (elements matching a CSS selector). |
| `selector` | string | No | -- | CSS selector for `custom` mode. Supports any valid CSS selector syntax: element selectors (`div`, `p`), class selectors (`.class-name`), ID selectors (`#id`), attribute selectors (`[data-type="value"]`), combinators (`div > p`, `ul li`), pseudo-classes (`:first-child`, `:nth-child(2)`). Ignored when mode is not `custom`. |

---

### Output Example

#### Full Mode

```json
{
  "url": "https://example.com",
  "status": 200,
  "timestamp": 1713264000000,
  "metadata": {
    "title": "Example Domain",
    "description": null,
    "ogImage": null,
    "ogTitle": null,
    "canonical": null,
    "language": null,
    "url": "https://example.com"
  },
  "headlines": [
    { "tag": "h1", "text": "Example Domain" }
  ],
  "links": [
    { "url": "https://www.iana.org/domains/example", "text": "More information..." }
  ],
  "images": [],
  "tables": []
}
````

#### Links Mode

```json
{
  "url": "https://news.ycombinator.com",
  "status": 200,
  "timestamp": 1713264000000,
  "links": [
    { "url": "https://news.ycombinator.com/newest", "text": "new" },
    { "url": "https://news.ycombinator.com/front", "text": "past" },
    { "url": "https://news.ycombinator.com/newcomments", "text": "comments" },
    { "url": "https://some-article.com/post", "text": "Show HN: My new project" }
  ],
  "count": 187
}
```

#### Tables Mode

```json
{
  "url": "https://en.wikipedia.org/wiki/List_of_countries",
  "status": 200,
  "timestamp": 1713264000000,
  "tables": [
    {
      "headers": ["Country", "Population", "Area (km2)"],
      "rows": [
        ["China", "1,425,671,352", "9,596,961"],
        ["India", "1,428,627,663", "3,287,263"]
      ],
      "rowCount": 195
    }
  ],
  "count": 1
}
```

#### Custom Mode

```json
{
  "url": "https://example.com",
  "status": 200,
  "timestamp": 1713264000000,
  "results": [
    { "text": "Example Domain", "tag": "h1" },
    { "text": "This domain is for use in illustrative examples.", "tag": "p" }
  ],
  "count": 2
}
```

***

### Cost Estimation

This actor uses the pay-per-event pricing model. You are charged per URL scraped.

| Action | Event | Estimated Cost |
|--------|-------|----------------|
| Scrape 1 URL | 1 event | ~$0.01 - $0.03 per URL |
| Scrape 10 URLs | 10 events | ~$0.10 - $0.30 per run |

Typical run uses minimal compute (128 MB RAM, 1-3 seconds per URL) because there is no browser involved.

***

### Integration Guide

#### Python

```python
from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")

## Scrape metadata from multiple URLs
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
    "urls": ["https://github.com", "https://gitlab.com", "https://bitbucket.org"],
    "mode": "metadata"
})

dataset = client.dataset(run["defaultDatasetId"])
for item in dataset.iterate_items():
    print(f"{item['url']}: {item['metadata']['title']}")
```

```python
## Extract all links from a page
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
    "urls": ["https://news.ycombinator.com"],
    "mode": "links"
})

dataset = client.dataset(run["defaultDatasetId"])
for item in dataset.iterate_items():
    print(f"Found {item['count']} links")
    for link in item['links'][:10]:
        print(f"  {link['text']}: {link['url']}")
```

```python
## Custom CSS selector extraction
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
    "urls": ["https://example.com"],
    "mode": "custom",
    "selector": "h1, h2, p"
})
```

#### Node.js

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });

// Full extraction
const run = await client.actor('lazymac/web-scraper-toolkit').call({
    urls: ['https://github.com'],
    mode: 'full',
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items[0].metadata.title);
console.log(`Headlines: ${items[0].headlines.length}`);
console.log(`Links: ${items[0].links.length}`);
```

```javascript
// Extract tables
const tableRun = await client.actor('lazymac/web-scraper-toolkit').call({
    urls: ['https://en.wikipedia.org/wiki/List_of_programming_languages'],
    mode: 'tables',
});

const { items: tableItems } = await client.dataset(tableRun.defaultDatasetId).listItems();
tableItems[0].tables.forEach(table => {
    console.log(`Table with ${table.rowCount} rows, headers: ${table.headers.join(', ')}`);
});
```

#### Apify API (cURL)

```bash
## Start a run
curl -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/runs" \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://example.com"], "mode": "metadata"}'

## Get results
curl "https://api.apify.com/v2/datasets/DATASET_ID/items?format=json" \
  -H "Authorization: Bearer YOUR_API_TOKEN"
```

***

### Use Cases

- **Content Monitoring**: Track headlines and text changes on competitor websites
- **Link Analysis**: Extract all outbound/inbound links from a page for SEO research
- **Data Collection**: Scrape HTML tables from Wikipedia, government sites, or any public data source
- **Social Media Preview**: Check OG tags and metadata before sharing links
- **Research Automation**: Collect structured data from multiple pages in one run
- **Image Auditing**: Find all images on a page and check for missing alt text
- **Custom Extraction**: Use CSS selectors to target specific page elements for any use case
- **Price Monitoring**: Extract product prices from e-commerce pages on a schedule to track price changes
- **News Aggregation**: Scrape headlines from multiple news sources and compile a daily digest
- **Accessibility Auditing**: Extract all images and check for missing alt text across your site's pages
- **Sitemap Verification**: Scrape links from key pages to verify your internal linking structure matches your sitemap
- **Academic Research**: Collect structured data from public data portals and government websites

***

### Integration with Other Tools

#### Zapier

1. Create a Zap with your desired trigger (schedule, new spreadsheet row, webhook, etc.)
2. Add an action: **Apify -- Run Actor**
3. Select `lazymac/web-scraper-toolkit` and configure URLs and mode
4. Add downstream actions to send extracted data to Google Sheets, Slack, email, Airtable, or any Zapier-connected app
5. Map fields like `metadata.title`, `links[].url`, or `headlines[].text` to your destination columns

#### Make (Integromat)

1. Create a new scenario with the **Apify** module
2. Select "Run an Actor" and choose `lazymac/web-scraper-toolkit`
3. Use an iterator to process each scraped URL result individually
4. Route data to Google Sheets, a REST API, database, or notification service based on conditions

#### Google Sheets Integration

```python
from apify_client import ApifyClient
import gspread
from oauth2client.service_account import ServiceAccountCredentials

## Scrape metadata from multiple pages
client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("lazymac/web-scraper-toolkit").call(run_input={
    "urls": ["https://example.com", "https://github.com"],
    "mode": "metadata"
})

dataset = client.dataset(run["defaultDatasetId"])
results = list(dataset.iterate_items())

## Write to Google Sheets
scope = ["https://spreadsheets.google.com/feeds"]
creds = ServiceAccountCredentials.from_json_keyfile_name("creds.json", scope)
gc = gspread.authorize(creds)
sheet = gc.open("Web Data").sheet1

sheet.append_row(["URL", "Title", "Description", "OG Image", "Language"])
for r in results:
    m = r.get("metadata", {})
    sheet.append_row([r["url"], m.get("title"), m.get("description"), m.get("ogImage"), m.get("language")])
```

#### Webhooks

Set up an Apify webhook with event `ACTOR.RUN.SUCCEEDED` to receive a notification when scraping completes. The webhook payload includes the run ID and dataset ID, allowing you to fetch results immediately from your backend.

#### Scheduled Monitoring Pipeline

```bash
## Schedule this as a daily cron job to track headline changes
RESULT=$(curl -s -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/run-sync-get-dataset-items" \
  -H "Authorization: Bearer YOUR_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"urls": ["https://news.ycombinator.com"], "mode": "headlines"}')

echo "$RESULT" | jq '.[0].headlines[] | .text' > today_headlines.txt
diff yesterday_headlines.txt today_headlines.txt > headline_changes.txt
cp today_headlines.txt yesterday_headlines.txt
```

#### CI/CD Pipeline Integration

Add link validation to your deployment pipeline:

```yaml
## GitHub Actions example
- name: Check Links After Deploy
  run: |
    RESULT=$(curl -s -X POST "https://api.apify.com/v2/acts/lazymac~web-scraper-toolkit/run-sync-get-dataset-items" \
      -H "Authorization: Bearer ${{ secrets.APIFY_TOKEN }}" \
      -H "Content-Type: application/json" \
      -d '{"urls": ["${{ env.DEPLOY_URL }}"], "mode": "links"}')
    COUNT=$(echo $RESULT | jq '.[0].count')
    echo "Found $COUNT links on deployed page"
```

***

### Tips and Tricks

1. **Use metadata mode for quick page audits.** If you only need the title, description, and OG tags, the metadata mode is the fastest option. It skips link, image, and table extraction entirely.

2. **Combine modes across multiple runs.** Run once with `headlines` mode and once with `links` mode to get targeted datasets. This is more efficient than `full` mode if you only need specific data types.

3. **Use custom CSS selectors for precision extraction.** Instead of parsing the entire page, target exactly the elements you need. For example, `.product-price` on e-commerce pages or `article p` for blog content.

4. **Batch related URLs together.** Scrape up to 10 URLs per run to minimize API calls and overhead. Group URLs by site or purpose for cleaner dataset organization.

5. **Check the status code in results.** A 200 status means the page loaded successfully. A 301/302 means it was redirected. A 403/404 means access was denied or the page does not exist. Always filter by status in your downstream processing.

6. **Export tables directly to CSV.** The tables mode output is already structured with headers and rows, making it trivial to convert to CSV format for spreadsheet import.

7. **Use text mode for content analysis.** The text extraction removes nav, footer, header, scripts, and styles, giving you clean body content. This is ideal for word count analysis, sentiment analysis, or content comparison.

8. **Schedule regular scrapes for monitoring.** Use Apify's built-in scheduler to run the actor daily or weekly on specific pages. Track changes by comparing datasets over time.

***

### Frequently Asked Questions

**Q: Does this actor render JavaScript?**
A: No. It fetches raw HTML using a lightweight HTTP client. For JavaScript-rendered pages (SPAs built with React, Vue, Angular), you may not get the full content. Consider using a browser-based scraper for such sites.

**Q: What is the maximum number of URLs per run?**
A: 10 URLs per run. For larger batches, run the actor multiple times programmatically using the Apify API or schedule multiple runs.

**Q: How does the actor handle failed URLs?**
A: Each URL is processed independently. If one URL fails (timeout, DNS error, HTTP error), it is reported with an error message, and the remaining URLs continue processing normally.

**Q: Can I scrape pages behind authentication?**
A: No. The actor can only access publicly available URLs. Pages requiring login will return the login page instead of the actual content.

**Q: What CSS selectors are supported in custom mode?**
A: All standard CSS selectors are supported, including element (`div`), class (`.class`), ID (`#id`), attribute (`[href]`), combinators (`div > p`, `ul li`), and pseudo-classes (`:first-child`, `:nth-of-type(2)`). The selector is passed to cheerio, which implements the CSS Selectors Level 3 specification.

**Q: Are relative URLs in link extraction resolved to absolute?**
A: Yes. All relative URLs are automatically resolved to full absolute URLs using the page's base URL.

**Q: How does deduplication work in links mode?**
A: Links are deduplicated by URL. If the same URL appears multiple times with different anchor text, only the first occurrence is kept.

**Q: What content is removed in text mode?**
A: Script tags, style tags, `<nav>`, `<footer>`, and `<header>` elements are removed before extracting body text. This gives you the main content without navigation, boilerplate, or code.

**Q: Can I export results to CSV or Excel?**
A: Yes. Apify datasets support export to JSON, CSV, XML, and Excel formats. After the run completes, use the dataset export API or download directly from the Apify Console.

**Q: Is there a timeout per URL?**
A: Yes, each URL has a 15-second timeout. If a page does not respond within 15 seconds, it is skipped with an error message.

**Q: Can I use this actor with the Apify CLI?**
A: Yes. Install the Apify CLI (`npm install -g apify-cli`), then run: `apify call lazymac/web-scraper-toolkit -i '{"urls": ["https://example.com"], "mode": "metadata"}'`. Results are saved to the local dataset.

**Q: Does the actor handle rate limiting?**
A: The actor processes URLs sequentially, which naturally avoids hitting rate limits. For sites with aggressive rate limiting, consider adding a proxy configuration or reducing the number of URLs per run.

**Q: Can I scrape PDF files or images?**
A: No. The actor is designed for HTML web pages only. It sends `Accept: text/html` headers and parses the response as HTML. Non-HTML responses (PDFs, images, JSON APIs) will either fail or return empty results.

**Q: What happens if a URL returns a redirect loop?**
A: The actor follows redirects up to a reasonable limit. If a redirect loop is detected (too many redirects), the URL is reported with an error message and processing continues with the next URL.

**Q: Can I extract data from iframes?**
A: No. The actor fetches and parses only the main page HTML. Content inside iframes (including embedded videos, maps, and third-party widgets) is not included in the extraction.

**Q: How do I scrape more than 10 URLs?**
A: Run the actor multiple times with batches of 10 URLs each. You can automate this using the Apify API in a loop, or set up multiple scheduled runs with different URL batches.

**Q: What encoding does the actor support?**
A: The actor handles UTF-8 encoded pages by default. Most modern web pages use UTF-8. Pages with other encodings (ISO-8859-1, Shift\_JIS, etc.) may have character display issues in the output.

***

### Limitations

- Does not execute JavaScript (static HTML analysis only)
- Maximum 10 URLs per run
- Cannot access authenticated or paywalled pages
- 15-second timeout per URL
- Full mode limits links to 50 and images to 20 per URL
- Custom mode returns text content only, not HTML

***

### Changelog

- **v1.0** - Initial release with 8 extraction modes and pay-per-event pricing

# Actor input Schema

## `urls` (type: `array`):

JSON array of URLs to scrape (maximum 10 per run). Each URL must be publicly accessible. Example: \["https://example.com", "https://github.com"]. Each URL is processed independently — if one fails, the others still succeed.

## `mode` (type: `string`):

What data to extract from each page. 'full' returns metadata + headlines + links + images + tables. 'metadata' returns title, description, OG tags, canonical, language. 'links' returns all unique links with anchor text. 'headlines' returns H1, H2, H3 headings. 'images' returns all images with alt text. 'tables' parses HTML tables into structured rows. 'text' returns clean body text (no scripts/styles/nav/footer). 'custom' extracts elements matching a CSS selector.

## `selector` (type: `string`):

CSS selector for custom mode extraction. Supports all standard CSS selectors: element (div, p), class (.class-name), ID (#id), attribute (\[href]), combinators (div > p, ul li), pseudo-classes (:first-child, :nth-of-type(2)). Only used when mode is 'custom'. Examples: '.article-title', '#main-content p', 'table.data-table tr td'.

## `timeout` (type: `integer`):

Maximum time in milliseconds to wait for each URL to respond. Default is 15000 (15 seconds). Increase for slow servers or large pages. Each URL is timed independently.

## `maxLinksPerPage` (type: `integer`):

Maximum number of links to extract per page in 'links' and 'full' modes. Default is 50 for full mode, unlimited for links mode. Set a lower value to limit output size.

## `maxImagesPerPage` (type: `integer`):

Maximum number of images to extract per page in 'images' and 'full' modes. Default is 20 for full mode, unlimited for images mode. Set a lower value to limit output size.

## `userAgent` (type: `string`):

Custom User-Agent header for HTTP requests. By default, uses a standard browser-like User-Agent. Set a custom value to simulate a specific bot or browser.

## `includeHtml` (type: `boolean`):

When enabled, includes the raw HTML of each matched element in custom mode, or the full page HTML in other modes. Useful for debugging. Disabled by default to keep output compact.

## `proxyConfiguration` (type: `object`):

Apify proxy configuration. Use residential or datacenter proxies to avoid IP blocking. Format: {"useApifyProxy": true, "apifyProxyGroups": \["RESIDENTIAL"]}.

## Actor input object example

```json
{
  "urls": [
    "https://example.com",
    "https://github.com"
  ],
  "mode": "links",
  "selector": ".article-title",
  "timeout": 15000,
  "maxLinksPerPage": 500,
  "maxImagesPerPage": 200,
  "includeHtml": false
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://example.com"
    ],
    "mode": "links"
};

// Run the Actor and wait for it to finish
const run = await client.actor("lazymac/web-scraper-toolkit").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "urls": ["https://example.com"],
    "mode": "links",
}

# Run the Actor and wait for it to finish
run = client.actor("lazymac/web-scraper-toolkit").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://example.com"
  ],
  "mode": "links"
}' |
apify call lazymac/web-scraper-toolkit --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=lazymac/web-scraper-toolkit",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Universal Web Scraper - Extract Any URL",
        "description": "Pay-per-result web scraper with JS rendering, CSS selector / XPath / regex extraction, schema validation, retry on failure. Use for product catalogs, competitor pricing, news aggregation, lead generation. Fast (<2s/page), respects robots.txt by default.",
        "version": "1.0",
        "x-build-id": "Vq3oc9mhkcDnln8wG"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/lazymac~web-scraper-toolkit/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-lazymac-web-scraper-toolkit",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/lazymac~web-scraper-toolkit/runs": {
            "post": {
                "operationId": "runs-sync-lazymac-web-scraper-toolkit",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/lazymac~web-scraper-toolkit/run-sync": {
            "post": {
                "operationId": "run-sync-lazymac-web-scraper-toolkit",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls"
                ],
                "properties": {
                    "urls": {
                        "title": "URLs to Scrape",
                        "type": "array",
                        "description": "JSON array of URLs to scrape (maximum 10 per run). Each URL must be publicly accessible. Example: [\"https://example.com\", \"https://github.com\"]. Each URL is processed independently — if one fails, the others still succeed."
                    },
                    "mode": {
                        "title": "Extraction Mode",
                        "enum": [
                            "full",
                            "metadata",
                            "links",
                            "headlines",
                            "images",
                            "tables",
                            "text",
                            "custom"
                        ],
                        "type": "string",
                        "description": "What data to extract from each page. 'full' returns metadata + headlines + links + images + tables. 'metadata' returns title, description, OG tags, canonical, language. 'links' returns all unique links with anchor text. 'headlines' returns H1, H2, H3 headings. 'images' returns all images with alt text. 'tables' parses HTML tables into structured rows. 'text' returns clean body text (no scripts/styles/nav/footer). 'custom' extracts elements matching a CSS selector.",
                        "default": "full"
                    },
                    "selector": {
                        "title": "CSS Selector (for Custom Mode)",
                        "type": "string",
                        "description": "CSS selector for custom mode extraction. Supports all standard CSS selectors: element (div, p), class (.class-name), ID (#id), attribute ([href]), combinators (div > p, ul li), pseudo-classes (:first-child, :nth-of-type(2)). Only used when mode is 'custom'. Examples: '.article-title', '#main-content p', 'table.data-table tr td'."
                    },
                    "timeout": {
                        "title": "Request Timeout (ms)",
                        "minimum": 1000,
                        "maximum": 60000,
                        "type": "integer",
                        "description": "Maximum time in milliseconds to wait for each URL to respond. Default is 15000 (15 seconds). Increase for slow servers or large pages. Each URL is timed independently.",
                        "default": 15000
                    },
                    "maxLinksPerPage": {
                        "title": "Max Links Per Page",
                        "minimum": 1,
                        "maximum": 5000,
                        "type": "integer",
                        "description": "Maximum number of links to extract per page in 'links' and 'full' modes. Default is 50 for full mode, unlimited for links mode. Set a lower value to limit output size.",
                        "default": 500
                    },
                    "maxImagesPerPage": {
                        "title": "Max Images Per Page",
                        "minimum": 1,
                        "maximum": 2000,
                        "type": "integer",
                        "description": "Maximum number of images to extract per page in 'images' and 'full' modes. Default is 20 for full mode, unlimited for images mode. Set a lower value to limit output size.",
                        "default": 200
                    },
                    "userAgent": {
                        "title": "Custom User-Agent",
                        "type": "string",
                        "description": "Custom User-Agent header for HTTP requests. By default, uses a standard browser-like User-Agent. Set a custom value to simulate a specific bot or browser."
                    },
                    "includeHtml": {
                        "title": "Include Raw HTML",
                        "type": "boolean",
                        "description": "When enabled, includes the raw HTML of each matched element in custom mode, or the full page HTML in other modes. Useful for debugging. Disabled by default to keep output compact.",
                        "default": false
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Apify proxy configuration. Use residential or datacenter proxies to avoid IP blocking. Format: {\"useApifyProxy\": true, \"apifyProxyGroups\": [\"RESIDENTIAL\"]}."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
