# Rust Scraper (`lukaskrivka/rust-scraper`) Actor

Speed of light scraping with Rust programming language! This is an early alpha version for experimenting, use at your own risk!

- **URL**: https://apify.com/lukaskrivka/rust-scraper.md
- **Developed by:** [Lukáš Křivka](https://apify.com/lukaskrivka) (community)
- **Categories:** Developer tools, Open source
- **Stats:** 60 total users, 0 monthly users, 100.0% runs succeeded, 3 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

<!-- toc start -->
### Rust Scraper

<!-- toc end -->
**This is super early version for experimentation. Use at your own risk!**

Speed of light scraping with Rust programming language. This is meant to be a faster (but less flexible) version of Apify's JavaScript based [Cheerio Scraper](https://apify.com/apify/cheerio-scraper).

Rust is one of the fastest programming languages out there. In many cases, it matches the speed of C. Although JavaScript offers huge flexibility and development speed, we can use Rust to significantly speed up the crawling and/or reduce costs. Rust scraper is both faster and requires less memory.

#### Changelog
You can read about fixes and updates in the detailed [changelog file](https://github.com/metalwarrior665/actor-rust-scraper/blob/master/CHANGELOG.md).

#### WARNING!!! Don't DDOS a website!
Because this scraper is so fast, you can easily take a website down. This matters especially if you scrape **more than few hundred URLs** and use the **async** scraping mode.
How to prevent that:
- Set reasonable `max_concurrency` input field. You can still scrape very fast and with tiny memory footprint if you set it below `10`.
- If you want to set high `max_concurrency`, only scrape large websites that can handle a load of 1000 requests/second and more.
- Use large pool of proxies so they are not immediately banned.

**If we see you abusing this scraper for attacks on Apify platform, your account can be banned**.

#### Why it is faster/cheaper than Cheerio Scraper?
Rust is statically typed language compiled directly into machine code. Because of this, it can optimize the code into the most efficient structures and algorithms. Of course, it is also job of the programmer to write the code efficiently so we expect further improvements for this scraper.

- HTML parsing is about 3 times faster because of efficient data structures.
- HTTP requests are also faster.
- Very efficient async implementation with futures (promises in JS).
- Can offload work to other CPU cores via system threads, scales to full actor memory (native JS doesn't support user created threads).
- Much lower memory usage due to efficient data structures.

#### Limitations of this actor (some will be solved in the future)
- This actor only works for scraping pure HTML websites (basically an alternative for [Cheerio Scraper](https://apify.com/apify/cheerio-scraper))
- You can only provide static list of URLs, it cannot enqueue any more.
- It doesn't have a page function, only simplified interface (`extract` object) to define what should be scraped.
- Retries are very simplistic
- It doesn't have a sophisticated concurrency system. It will grow to `max_concurrency` unless CPU gets overwhelmed.

#### Input
Input is a JSON object with the properties below explained in detail on the [Apify Store page](https://apify.com/lukaskrivka/rust-scraper/input-schema). You can also set it up on Apify platform with a nice UI.

#### Data extraction
You need to provide an [extraction configuration object](https://apify.com/lukaskrivka/rust-scraper/input-schema#extract). This object defines selectors to find on the page, what to extract from those selector and finally names of the fields that the data should be saved as.

`extract` (array) is an array of objects where each object has:
- `field_name` (string) Defines to which field will the data be assigned in your resulting dataset
- `selector` (string) CSS selector to find the data to extract
- `extract_type` (object) What to extract
    - `type` (string) Can be `Text` or `Attribute`
    - `content` (string) Provide only when `type` is `Attribute`

Full INPUT example:
````

{
"proxy\_settings": {
"useApifyProxy": true,
"apifyProxyGroups": \["SHADER"]
},
"urls": \[
{ "url": "https://www.amazon.com/dp/B01CYYU8YW" },
{ "url": "https://www.amazon.com/dp/B01FXMDA2O" },
{ "url": "https://www.amazon.com/dp/B00UNT0Y2M" }
],
"extract": \[
{
"field\_name": "title",
"selector": "#productTitle",
"extract\_type": {
"type": "Text"
}
},
{
"field\_name": "customer\_reviews",
"selector": "#acrCustomerReviewText",
"extract\_type": {
"type": "Text"
}
},
{
"field\_name": "seller\_link",
"selector": "#bylineInfo",
"extract\_type": {
"type": "Attribute",
"content": "href"
}
}\
]
}

```

Output example in JSON (This depends purely on your `extract` config)
```

\[
{
"seller\_link":"/Propack/b/ref=bl\_dp\_s\_web\_3039360011?ie=UTF8\&node=3039360011\&field-lbr\_brands\_browse-bin=Propack","customer\_reviews":"208 customer reviews",
"title":"Propack Twist - Tie Gallon Size Storage Bags 100 Bags Pack Of 4"
},
{
"byline\_link":"/Ziploc/b/ref=bl\_dp\_s\_web\_2581449011?ie=UTF8\&node=2581449011\&field-lbr\_brands\_browse-bin=Ziploc","customers":"561 customer reviews",
"title":"Ziploc Gallon Slider Storage Bags, 96 Count"
},
{
"byline\_link":"/Reynolds/b/ref=bl\_dp\_s\_web\_2599601011?ie=UTF8\&node=2599601011\&field-lbr\_brands\_browse-bin=Reynolds","customers":"456 customer reviews",
"title":"Reynolds Wrap Aluminum Foil (200 Square Foot Roll)"
}
]

````
#### Local usage
You can run this locally if you have Rust installed. You need to build it before running. If you want to use Apify Proxy, don't forget to add your `APIFY_PROXY_PASSWORD` into the environment, otherwise you will get a nasty error.

# Actor input Schema

## `urls` (type: `array`):

URLs that will be scraped. Must be an array of objects with "url" property.
## `extract` (type: `array`):

Array that defines what and how should be scraped from a page HTML. See readme for more info.
## `proxy_settings` (type: `object`):

Select proxies to be used by your crawler. For most use cases we recommend the default Apify automatic proxy.
## `max_concurrency` (type: `integer`):

Sets the maximum concurrency (parallelism) for the crawl. Keep this is reasonable level because this scraper can go really fast.
## `max_request_retries` (type: `integer`):

Sets the maximum number of retries for each request(URL).
## `debug_log` (type: `boolean`):

Shows when each URL starts and ends scraping with timings. Don't use for larger runs as the log gets filled quickly.
## `push_data_size` (type: `integer`):

Buffers results into vector (array) before pushing to a dataset. This prevents overwhelming Apify API. The default number is usually a good choice.
## `force_cloud` (type: `boolean`):

This allows local runs to use cloud storage, mainly for testing. On Apify platform this has no effect.

## Actor input object example

```json
{
  "urls": [
    {
      "url": "http://example.com"
    }
  ],
  "extract": [
    {
      "field_name": "title",
      "selector": "h1",
      "extract_type": {
        "type": "Text"
      }
    },
    {
      "field_name": "description",
      "selector": "p",
      "extract_type": {
        "type": "Text"
      }
    }
  ],
  "proxy_settings": {
    "useApifyProxy": true
  },
  "max_concurrency": 50,
  "max_request_retries": 3,
  "debug_log": false,
  "push_data_size": 500,
  "force_cloud": false
}
````

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        {
            "url": "http://example.com"
        }
    ],
    "extract": [
        {
            "field_name": "title",
            "selector": "h1",
            "extract_type": {
                "type": "Text"
            }
        },
        {
            "field_name": "description",
            "selector": "p",
            "extract_type": {
                "type": "Text"
            }
        }
    ],
    "proxy_settings": {
        "useApifyProxy": true
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("lukaskrivka/rust-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "urls": [{ "url": "http://example.com" }],
    "extract": [
        {
            "field_name": "title",
            "selector": "h1",
            "extract_type": { "type": "Text" },
        },
        {
            "field_name": "description",
            "selector": "p",
            "extract_type": { "type": "Text" },
        },
    ],
    "proxy_settings": { "useApifyProxy": True },
}

# Run the Actor and wait for it to finish
run = client.actor("lukaskrivka/rust-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    {
      "url": "http://example.com"
    }
  ],
  "extract": [
    {
      "field_name": "title",
      "selector": "h1",
      "extract_type": {
        "type": "Text"
      }
    },
    {
      "field_name": "description",
      "selector": "p",
      "extract_type": {
        "type": "Text"
      }
    }
  ],
  "proxy_settings": {
    "useApifyProxy": true
  }
}' |
apify call lukaskrivka/rust-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=lukaskrivka/rust-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Rust Scraper",
        "description": "Speed of light scraping with Rust programming language! This is an early alpha version for experimenting, use at your own risk!",
        "version": "0.0",
        "x-build-id": "wPCaLsaZAUU5C7IRT"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/lukaskrivka~rust-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-lukaskrivka-rust-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/lukaskrivka~rust-scraper/runs": {
            "post": {
                "operationId": "runs-sync-lukaskrivka-rust-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/lukaskrivka~rust-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-lukaskrivka-rust-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls",
                    "extract"
                ],
                "properties": {
                    "urls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "URLs that will be scraped. Must be an array of objects with \"url\" property.",
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "extract": {
                        "title": "Extraction config",
                        "type": "array",
                        "description": "Array that defines what and how should be scraped from a page HTML. See readme for more info."
                    },
                    "proxy_settings": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Select proxies to be used by your crawler. For most use cases we recommend the default Apify automatic proxy."
                    },
                    "max_concurrency": {
                        "title": "Max concurrency",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Sets the maximum concurrency (parallelism) for the crawl. Keep this is reasonable level because this scraper can go really fast.",
                        "default": 50
                    },
                    "max_request_retries": {
                        "title": "Max request retries",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Sets the maximum number of retries for each request(URL).",
                        "default": 3
                    },
                    "debug_log": {
                        "title": "Debug log",
                        "type": "boolean",
                        "description": "Shows when each URL starts and ends scraping with timings. Don't use for larger runs as the log gets filled quickly.",
                        "default": false
                    },
                    "push_data_size": {
                        "title": "Push data buffer size",
                        "type": "integer",
                        "description": "Buffers results into vector (array) before pushing to a dataset. This prevents overwhelming Apify API. The default number is usually a good choice.",
                        "default": 500
                    },
                    "force_cloud": {
                        "title": "Force cloud",
                        "type": "boolean",
                        "description": "This allows local runs to use cloud storage, mainly for testing. On Apify platform this has no effect.",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
