# Shopify Scraper (`pocesar/shopify-scraper`) Actor

Automate monitoring prices on the most popular solution for building online stores and selling products online. Crawl arbitrary Shopify-powered online stores and extract a list of all products in a structured form, including product title, price, description, etc.

- **URL**: https://apify.com/pocesar/shopify-scraper.md
- **Developed by:** [Paulo Cesar](https://apify.com/pocesar) (community)
- **Categories:** E-commerce, Open source
- **Stats:** 2,289 total users, 26 monthly users, 100.0% runs succeeded, 41 bookmarks
- **User rating**: 1.00 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

### What does Shopify Scraper do?

Using this tool, you can automate monitoring prices on the most popular solution for building online stores and selling products online. Crawl arbitrary Shopify-powered online stores and extract a list of all products in a structured form, including product title, price, description, etc.

### Need to find product pairs between Shopify and another online shop?

Use the [AI Product Matcher](https://apify.com/equidem/ai-product-matcher). This AI model allows you to compare items from different web stores, identifying exact matches and comparing real-time data obtained via web scraping. With the AI Product Matcher, you can use scraped product data to monitor product matches across the industry, implement dynamic pricing for your website, replace or complement manual mapping, and obtain realistic estimates against your competition for upcoming promo campaigns. 

Most importantly, it is relatively easy to get started with (just follow [this guide](https://blog.apify.com/product-matching-ai-pricing-intelligence-web-scraping/)) and it can match thousands of product pairs.

### Extend Scraper and Output Function

Extend output function allows to filter the items that are output:

```js
async ({ item, customData }) => {
    if (!item.title.includes('cuisine')) {
        return null; // omit the output
    }

    delete item.additional; // remove data from output

    item.requestId = customData.requestId; // add data from the outside

    return item;
}
````

Extend scraper function allows you to interact with scraper phases:

```js
async ({ label, url, filter, fns, filteredSitemapUrls, customData }) => {
    switch (label) {
        case 'FILTER_SITEMAP_URL': {
            // product url, like .../products/cooking-for-dummies-2002-289854
            filter(
                url.includes('cooking') || url.includes(customData.filter)
            );
            break;
        }
        case 'SETUP': {
            // filteredSitemapUrls is a `Set` instance and can be edited in-place
            filteredSitemapUrls.add('https://example.com/secret-unlisted-sitemap.xml');
            filteredSitemapUrls.forEach((sitemapURL) => {
                if (!sitemapURL.includes('en-us')) {
                    filteredSitemapUrls.delete(sitemapURL);
                }
            });
            break;
        }
    }
}
```

### License

Apache 2.0

# Actor input Schema

## `startUrls` (type: `array`):

Provide Shopify shop URLs as the starting point

## `maxRequestsPerCrawl` (type: `integer`):

Maximum number of items to scrape. Set it to 0 to scrape everything.

## `proxyConfig` (type: `object`):

Use either automatic Apify proxies, Residentials or your own.

## `checkForBanner` (type: `boolean`):

Ensure that the remote robots.txt file contains the Shopify keyword.

## `extendOutputFunction` (type: `string`):

Add or remove properties on the output object or omit the output returning null

## `extendScraperFunction` (type: `string`):

Advanced function that allows you to extend the default scraper functionality, allowing you to manually perform actions on the page

## `customData` (type: `object`):

Any data that you want to have available inside the Extend Output/Scraper Function

## `fetchHtml` (type: `boolean`):

If you decide to fetch the HTML of the pages, it will take twice as long. Make sure to only enable this if needed

## `maxConcurrency` (type: `integer`):

Max concurrency to use

## `maxRequestRetries` (type: `integer`):

Set the max request retries

## `debugLog` (type: `boolean`):

Enable a more verbose logging to be able to understand what's happening during the scraping

## Actor input object example

```json
{
  "startUrls": [
    {
      "url": "https://www.decathlon.com"
    }
  ],
  "maxRequestsPerCrawl": 10,
  "proxyConfig": {
    "useApifyProxy": true
  },
  "checkForBanner": true,
  "extendOutputFunction": "async ({ data, item, product, images, fns, name, request, variants, context, customData, input, Apify }) => {\n  return item;\n}",
  "extendScraperFunction": "async ({ fns, customData, Apify, label }) => {\n \n}",
  "customData": {},
  "fetchHtml": false,
  "maxConcurrency": 10,
  "maxRequestRetries": 3,
  "debugLog": false
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        {
            "url": "https://www.decathlon.com"
        }
    ],
    "maxRequestsPerCrawl": 10,
    "proxyConfig": {
        "useApifyProxy": true
    },
    "extendOutputFunction": async ({ data, item, product, images, fns, name, request, variants, context, customData, input, Apify }) => {
      return item;
    },
    "extendScraperFunction": async ({ fns, customData, Apify, label }) => {
     
    },
    "customData": {},
    "maxConcurrency": 10,
    "maxRequestRetries": 3
};

// Run the Actor and wait for it to finish
const run = await client.actor("pocesar/shopify-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [{ "url": "https://www.decathlon.com" }],
    "maxRequestsPerCrawl": 10,
    "proxyConfig": { "useApifyProxy": True },
    "extendOutputFunction": """async ({ data, item, product, images, fns, name, request, variants, context, customData, input, Apify }) => {
  return item;
}""",
    "extendScraperFunction": """async ({ fns, customData, Apify, label }) => {
 
}""",
    "customData": {},
    "maxConcurrency": 10,
    "maxRequestRetries": 3,
}

# Run the Actor and wait for it to finish
run = client.actor("pocesar/shopify-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    {
      "url": "https://www.decathlon.com"
    }
  ],
  "maxRequestsPerCrawl": 10,
  "proxyConfig": {
    "useApifyProxy": true
  },
  "extendOutputFunction": "async ({ data, item, product, images, fns, name, request, variants, context, customData, input, Apify }) => {\\n  return item;\\n}",
  "extendScraperFunction": "async ({ fns, customData, Apify, label }) => {\\n \\n}",
  "customData": {},
  "maxConcurrency": 10,
  "maxRequestRetries": 3
}' |
apify call pocesar/shopify-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=pocesar/shopify-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Shopify Scraper",
        "description": "Automate monitoring prices on the most popular solution for building online stores and selling products online. Crawl arbitrary Shopify-powered online stores and extract a list of all products in a structured form, including product title, price, description, etc.",
        "version": "0.0",
        "x-build-id": "DtIrdlTLpaaZqtlda"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/pocesar~shopify-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-pocesar-shopify-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/pocesar~shopify-scraper/runs": {
            "post": {
                "operationId": "runs-sync-pocesar-shopify-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/pocesar~shopify-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-pocesar-shopify-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Start URLs",
                        "type": "array",
                        "description": "Provide Shopify shop URLs as the starting point",
                        "default": [],
                        "items": {
                            "type": "object",
                            "required": [
                                "url"
                            ],
                            "properties": {
                                "url": {
                                    "type": "string",
                                    "title": "URL of a web page",
                                    "format": "uri"
                                }
                            }
                        }
                    },
                    "maxRequestsPerCrawl": {
                        "title": "Max items",
                        "type": "integer",
                        "description": "Maximum number of items to scrape. Set it to 0 to scrape everything.",
                        "default": 10
                    },
                    "proxyConfig": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Use either automatic Apify proxies, Residentials or your own.",
                        "default": {
                            "useApifyProxy": true
                        }
                    },
                    "checkForBanner": {
                        "title": "Check for Shopify on robots",
                        "type": "boolean",
                        "description": "Ensure that the remote robots.txt file contains the Shopify keyword.",
                        "default": true
                    },
                    "extendOutputFunction": {
                        "title": "Extend Output Function",
                        "type": "string",
                        "description": "Add or remove properties on the output object or omit the output returning null",
                        "default": "async ({ data, item, product, images, fns, name, request, variants, context, customData, input, Apify }) => {\n  return item;\n}"
                    },
                    "extendScraperFunction": {
                        "title": "Extend Scraper Function",
                        "type": "string",
                        "description": "Advanced function that allows you to extend the default scraper functionality, allowing you to manually perform actions on the page",
                        "default": "async ({ fns, customData, Apify, label }) => {\n \n}"
                    },
                    "customData": {
                        "title": "Custom data",
                        "type": "object",
                        "description": "Any data that you want to have available inside the Extend Output/Scraper Function",
                        "default": {}
                    },
                    "fetchHtml": {
                        "title": "Fetch HTML",
                        "type": "boolean",
                        "description": "If you decide to fetch the HTML of the pages, it will take twice as long. Make sure to only enable this if needed",
                        "default": false
                    },
                    "maxConcurrency": {
                        "title": "Max concurrency",
                        "type": "integer",
                        "description": "Max concurrency to use",
                        "default": 10
                    },
                    "maxRequestRetries": {
                        "title": "Max request retries",
                        "type": "integer",
                        "description": "Set the max request retries",
                        "default": 3
                    },
                    "debugLog": {
                        "title": "Debug Log",
                        "type": "boolean",
                        "description": "Enable a more verbose logging to be able to understand what's happening during the scraping",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
