# Check Page Sizes (`zerobreak/check-page-sizes`) Actor

Page size checker that crawls any website and flags HTML pages over 2MB or PDFs over 64MB, the exact thresholds where Google stops indexing — so SEO teams can fix oversized files before they drop from search.

- **URL**: https://apify.com/zerobreak/check-page-sizes.md
- **Developed by:** [ZeroBreak](https://apify.com/zerobreak) (community)
- **Categories:** SEO tools
- **Stats:** 2 total users, 0 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

$4.99/month + usage

To use this Actor, you pay a monthly rental fee to the developer. The rent is subtracted from your prepaid usage every month after the free trial period.You also pay for the Apify platform usage, which gets cheaper the higher Apify subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#rental-actors

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Page Size Checker: Audit HTML and PDF sizes for Google indexing

Google won't index HTML pages over 2 MB or PDF files over 64 MB. Most sites are fine. But content-heavy sites, documentation hubs, and large PDF libraries can get caught off guard, and you won't know until pages stop showing up in search. This actor crawls every internal page on your site, measures the actual content size, and flags anything that exceeds Google's limits.

### Use cases

- **SEO audits**: rule out page size as a reason Google stopped indexing certain pages
- **Documentation sites**: check whether long-form content pages are pushing past the 2 MB limit
- **PDF libraries**: find oversized PDF files before they fall outside Google's 64 MB indexing range
- **Pre-launch checks**: run a size audit before deploying a new site or major content update
- **Ongoing monitoring**: schedule regular runs to catch newly added pages that grow too large

### Input

| Parameter | Type | Default | Description |
|-----------|------|---------|-------------|
| `startUrl` | string | (required) | The website URL to start crawling from |
| `maxUrls` | integer | 100 | Maximum number of pages to check |
| `checkPdfs` | boolean | true | Also check linked PDF files |
| `htmlSizeLimitMb` | number | 2 | Flag HTML pages above this size in MB |
| `pdfSizeLimitMb` | number | 64 | Flag PDF files above this size in MB |
| `requestTimeoutSecs` | integer | 30 | Per-request timeout in seconds |

#### Example input

```json
{
    "startUrl": "https://apify.com",
    "maxUrls": 500,
    "checkPdfs": true,
    "htmlSizeLimitMb": 2,
    "pdfSizeLimitMb": 64
}
````

### Output

The actor stores one record per page in a dataset. Each entry includes:

```json
{
    "url": "https://apify.com/blog/web-scraping-guide",
    "pageType": "html",
    "sizeBytes": 2458624,
    "sizeMb": 2.345,
    "limitMb": 2,
    "exceedsLimit": true,
    "statusCode": 200,
    "scrapedAt": "2025-06-01T12:34:56.789Z"
}
```

| Field | Type | Description |
|-------|------|-------------|
| `url` | string | Final URL after any redirects |
| `pageType` | string | `html` or `pdf` |
| `sizeBytes` | integer | Decompressed page size in bytes |
| `sizeMb` | number | Page size in megabytes, rounded to 3 decimal places |
| `limitMb` | number | Applicable Google indexing limit in MB |
| `exceedsLimit` | boolean | True if the page exceeds the limit |
| `statusCode` | integer | HTTP response status code |
| `error` | string | Error message if the page could not be fetched |
| `scrapedAt` | string | ISO 8601 timestamp |

### How it works

1. The actor starts at the URL you provide and fetches the page
2. It measures the full decompressed content size using the response body
3. For HTML pages, it extracts all internal links and adds them to the crawl queue
4. PDF files linked from those pages are optionally checked against the 64 MB limit
5. Results are pushed to the dataset as each page is checked, with `exceedsLimit: true` for any pages over the limit

### FAQ

**Does Google actually stop indexing large pages?**

Yes. Google updated its indexing rules to skip HTML files over 2 MB and PDFs over 64 MB. Both are fairly high limits. Most sites won't hit them. But large CMS exports, documentation pages, or auto-generated reports occasionally push past 2 MB.

**Does this actor check JavaScript-rendered content?**

No. It measures the raw HTML size served by the server, which is what Google's crawler sees. JavaScript that expands the DOM after load is not counted.

**Can I adjust the size limits?**

Yes. Use `htmlSizeLimitMb` and `pdfSizeLimitMb` to set custom thresholds. Setting a lower value, say 1.5 MB, lets you catch pages that are getting close before they actually hit the Google limit.

**How many pages can it check?**

Up to 1000 per run using the `maxUrls` input. For larger sites, run multiple times starting from different sections, or increase the limit toward the maximum.

### Integrations

Connect Page Size Checker with other apps and services using [Apify integrations](https://apify.com/integrations). You can integrate with Make, Zapier, Slack, Airbyte, GitHub, Google Sheets, Google Drive, and many more. You can also use [webhooks](https://docs.apify.com/integrations/webhooks) to trigger actions whenever results are available.

# Actor input Schema

## `startUrl` (type: `string`):

The website URL to start crawling from. The actor will follow all internal links on this domain and check each page's size.

## `maxUrls` (type: `integer`):

Maximum number of pages to check per run. Raise this for larger sites.

## `checkPdfs` (type: `boolean`):

When enabled, the actor also checks PDF files linked from crawled pages against Google's 64MB PDF indexing limit.

## `htmlSizeLimitMb` (type: `number`):

HTML pages above this threshold are flagged. Default is 2 MB — Google's current HTML indexing limit.

## `pdfSizeLimitMb` (type: `number`):

PDF files above this threshold are flagged. Default is 64 MB — Google's current PDF indexing limit.

## `requestTimeoutSecs` (type: `integer`):

How long to wait for each page to respond before timing out and moving on.

## Actor input object example

```json
{
  "startUrl": "https://apify.com",
  "maxUrls": 100,
  "checkPdfs": true,
  "htmlSizeLimitMb": 2,
  "pdfSizeLimitMb": 64,
  "requestTimeoutSecs": 30
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrl": "https://apify.com"
};

// Run the Actor and wait for it to finish
const run = await client.actor("zerobreak/check-page-sizes").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = { "startUrl": "https://apify.com" }

# Run the Actor and wait for it to finish
run = client.actor("zerobreak/check-page-sizes").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrl": "https://apify.com"
}' |
apify call zerobreak/check-page-sizes --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=zerobreak/check-page-sizes",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Check Page Sizes",
        "description": "Page size checker that crawls any website and flags HTML pages over 2MB or PDFs over 64MB, the exact thresholds where Google stops indexing — so SEO teams can fix oversized files before they drop from search.",
        "version": "0.0",
        "x-build-id": "ra26rILddq2ZKz62N"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/zerobreak~check-page-sizes/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-zerobreak-check-page-sizes",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/zerobreak~check-page-sizes/runs": {
            "post": {
                "operationId": "runs-sync-zerobreak-check-page-sizes",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/zerobreak~check-page-sizes/run-sync": {
            "post": {
                "operationId": "run-sync-zerobreak-check-page-sizes",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "startUrl"
                ],
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "type": "string",
                        "description": "The website URL to start crawling from. The actor will follow all internal links on this domain and check each page's size."
                    },
                    "maxUrls": {
                        "title": "Max pages",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum number of pages to check per run. Raise this for larger sites.",
                        "default": 100
                    },
                    "checkPdfs": {
                        "title": "Check PDF files",
                        "type": "boolean",
                        "description": "When enabled, the actor also checks PDF files linked from crawled pages against Google's 64MB PDF indexing limit.",
                        "default": true
                    },
                    "htmlSizeLimitMb": {
                        "title": "HTML size limit (MB)",
                        "minimum": 0.1,
                        "type": "number",
                        "description": "HTML pages above this threshold are flagged. Default is 2 MB — Google's current HTML indexing limit.",
                        "default": 2
                    },
                    "pdfSizeLimitMb": {
                        "title": "PDF size limit (MB)",
                        "minimum": 0.1,
                        "type": "number",
                        "description": "PDF files above this threshold are flagged. Default is 64 MB — Google's current PDF indexing limit.",
                        "default": 64
                    },
                    "requestTimeoutSecs": {
                        "title": "Request timeout (seconds)",
                        "minimum": 5,
                        "maximum": 120,
                        "type": "integer",
                        "description": "How long to wait for each page to respond before timing out and moving on.",
                        "default": 30
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
