Website Content Crawler Lite avatar

Website Content Crawler Lite

Pricing

from $0.10 / 1,000 page extracteds

Go to Apify Store
Website Content Crawler Lite

Website Content Crawler Lite

Crawl public website pages and extract clean text, Markdown, metadata, and links for AI, SEO, and monitoring workflows.

Pricing

from $0.10 / 1,000 page extracteds

Rating

0.0

(0)

Developer

Hanna Nosova

Hanna Nosova

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

20 hours ago

Last modified

Share

Turn public website pages into clean content for AI, SEO, research, monitoring, and documentation workflows.

Website Content Crawler Lite starts from one or more public URLs, follows links within your chosen scope, and saves page metadata, clean text, Markdown, optional HTML, discovered links, crawl depth, and skip/error context.

It is designed for teams that want a simple, predictable website crawler without heavyweight setup.


What does Website Content Crawler Lite do?

Website Content Crawler Lite crawls public web pages and creates structured page records.

Each successful page can include:

  • ✅ requested URL
  • ✅ final loaded URL
  • ✅ page title
  • ✅ meta description
  • ✅ first H1
  • ✅ clean text
  • ✅ Markdown content
  • ✅ optional cleaned HTML
  • ✅ discovered links
  • ✅ HTTP status
  • ✅ content type
  • ✅ crawl depth
  • ✅ parent URL
  • ✅ timestamp

Skipped pages are also reported with a reason, so you can see what happened without paying for a successful extraction that did not happen.


Who is it for?

Website Content Crawler Lite is for teams that need structured public website content without building and maintaining their own crawler.

Who is this website crawler for?

AI and RAG teams

Use it to collect clean page text and Markdown for retrieval-augmented generation, internal knowledge bases, chatbot grounding, and document pipelines.

SEO and content teams

Use it to audit page titles, descriptions, H1 tags, internal links, and crawl coverage across a small website or documentation section.

Growth and operations teams

Use it to monitor public pages, collect content snapshots, or build lightweight lead and content enrichment workflows.

Developers and automation builders

Use it as a low-friction page extraction step inside Apify tasks, Make scenarios, Zapier flows, or custom API jobs.


Why use this actor?

  • 🚀 Simple inputs: start URLs, page limit, depth, and filters
  • 🧭 Safe defaults: same-domain crawling is enabled by default
  • 📄 Useful output: text, Markdown, metadata, and links in one row
  • 💸 Predictable pricing: charged per successfully extracted page
  • 🧱 Automation-ready: dataset output works with exports and APIs
  • 🔍 Transparent skips: blocked, filtered, and non-HTML pages are marked

What data can you extract?

FieldDescription
urlURL originally scheduled for crawling
loadedUrlFinal URL after redirects
titleHTML title or Open Graph title
descriptionMeta description or Open Graph description
h1First H1 text on the page
textClean plain text from the page
markdownMarkdown version when outputFormat is markdown
htmlCleaned HTML when outputFormat is html
linksPublic links discovered on the page
statusCodeHTTP status code
contentTypeResponse content type
depthLink depth from the start URL
parentUrlPage that discovered this URL
fetchedAtISO timestamp for the fetch
errorError detail when a request fails
skippedReasonReason a page was skipped

How much does it cost to crawl website content?

The actor uses pay-per-event pricing.

  • A small one-time start event covers run setup.
  • A page event is charged only for each successfully extracted HTML page.
  • Skipped pages, non-HTML files, robots-disallowed pages, filtered URLs, and request failures are not charged as extracted pages.

For example, if you crawl 100 successful pages, you pay for 100 extracted page events plus the small run start event.

This makes it practical for small tests and recurring website monitoring jobs.


Quick start

  1. Open the actor on Apify.
  2. Add one or more public start URLs.
  3. Set maxPages to a small number for the first run.
  4. Keep sameDomainOnly enabled unless you intentionally want broader crawling.
  5. Choose markdown, text, or html output.
  6. Run the actor.
  7. Export the dataset as JSON, CSV, Excel, or connect it to your workflow.

Example input

{
"startUrls": [
{ "url": "https://example.com" }
],
"maxPages": 5,
"maxDepth": 1,
"sameDomainOnly": true,
"outputFormat": "markdown",
"respectRobotsTxt": true,
"requestTimeoutSecs": 20
}

Input options

startUrls

A list of public HTTP or HTTPS pages where crawling should start.

maxPages

The maximum number of pages to fetch in the run. Start low when testing.

maxDepth

How many link levels to follow. Use 0 to extract only the start URLs.

sameDomainOnly

When true, the actor crawls only links on the same domain as the start URL.

includeGlobs

Optional URL patterns that must match for a URL to be crawled.

Example:

["https://example.com/docs/**"]

excludeGlobs

Optional URL patterns to skip.

Example:

["**/login**", "**/signup**"]

outputFormat

Choose one of:

  • markdown
  • text
  • html

respectRobotsTxt

When enabled, the actor checks robots rules and skips disallowed pages.

requestTimeoutSecs

Maximum time to wait for one page response.


Output example

{
"url": "https://example.com/",
"loadedUrl": "https://example.com/",
"title": "Example Domain",
"description": null,
"h1": "Example Domain",
"text": "Example Domain This domain is for use in illustrative examples in documents.",
"markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
"links": ["https://www.iana.org/domains/example"],
"statusCode": 200,
"contentType": "text/html",
"depth": 0,
"parentUrl": null,
"fetchedAt": "2026-06-20T00:00:00.000Z"
}

Tips for better crawl results

  • Start with maxPages between 5 and 20.
  • Use maxDepth: 0 for page extraction only.
  • Use maxDepth: 1 for small site sections.
  • Keep sameDomainOnly enabled for predictable runs.
  • Use include globs for documentation folders.
  • Use exclude globs for login, cart, account, and search pages.
  • Prefer Markdown for AI and RAG workflows.
  • Prefer text for simple keyword or monitoring jobs.
  • Prefer HTML only when downstream tools need markup.

Common use cases

Build a RAG content feed

Crawl a public documentation site and export Markdown rows to a vector database pipeline.

Audit SEO metadata

Crawl a content section and review titles, descriptions, H1s, and status codes.

Monitor public website copy

Run the actor on a schedule and compare page text over time.

Extract documentation pages

Use include globs like https://example.com/docs/** to stay inside a docs section.

Use links, depth, and parentUrl to understand how a small site section connects.


Integrations

Apify datasets

Every page result is saved to the default dataset. Export as JSON, CSV, Excel, XML, RSS, or HTML.

Apify API

Start runs from your own backend and fetch the dataset after completion.

Make and Zapier

Use completed runs as triggers for notifications, indexing, spreadsheets, or content review workflows.

Vector databases

Send Markdown or text fields to Pinecone, Weaviate, Qdrant, Chroma, or your internal embeddings pipeline.

Monitoring workflows

Schedule recurring runs and compare exported text fields between runs.


API usage with Node.js

import { ApifyClient } from 'apify-client';
const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('fetch_cat/website-content-crawler-lite').call({
startUrls: [{ url: 'https://example.com' }],
maxPages: 5,
maxDepth: 1,
outputFormat: 'markdown'
});
const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

API usage with Python

from apify_client import ApifyClient
import os
client = ApifyClient(os.environ['APIFY_TOKEN'])
run = client.actor('fetch_cat/website-content-crawler-lite').call(run_input={
'startUrls': [{'url': 'https://example.com'}],
'maxPages': 5,
'maxDepth': 1,
'outputFormat': 'markdown',
})
items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

API usage with cURL

curl "https://api.apify.com/v2/acts/fetch_cat~website-content-crawler-lite/runs?token=$APIFY_TOKEN" \
-H 'Content-Type: application/json' \
-d '{
"startUrls": [{"url": "https://example.com"}],
"maxPages": 5,
"maxDepth": 1,
"outputFormat": "markdown"
}'

MCP usage

Use this actor from MCP-enabled tools through Apify MCP Server.

MCP URL:

https://mcp.apify.com/?tools=fetch_cat/website-content-crawler-lite

Claude Code MCP setup

$claude mcp add apify https://mcp.apify.com/?tools=fetch_cat/website-content-crawler-lite

Claude Desktop MCP JSON config

{
"mcpServers": {
"apify": {
"url": "https://mcp.apify.com/?tools=fetch_cat/website-content-crawler-lite"
}
}
}

Example prompts showing MCP usage

  • "Use the Apify MCP tool fetch_cat/website-content-crawler-lite to crawl https://docs.apify.com/academy with 10 pages and summarize the Markdown output."
  • "Run Website Content Crawler Lite through MCP and extract titles and H1s from the first 10 pages on this site."
  • "Use MCP to find all links discovered from this landing page and group them by domain."

Claude Code prompt examples

  • "Use Apify MCP to crawl this public documentation section and return a Markdown summary."
  • "With the Website Content Crawler Lite MCP tool, extract page titles, H1s, and status codes from this website."

Claude Desktop prompt examples

  • "Use Website Content Crawler Lite via Apify MCP to collect clean text from this website section."
  • "Run a small MCP crawl and prepare a content audit table with URL, title, H1, and status code."

Legality and responsible crawling

This actor is intended for public web pages only.

Do not use it to bypass logins, paywalls, CAPTCHA, private content, or access controls.

Respect website terms, robots rules, copyright, privacy rules, and applicable laws.

Use conservative page limits and crawl only content you have the right to process.


Troubleshooting

Why did a page have skippedReason?

The page may have been filtered by your globs, disallowed by robots rules, returned an HTTP error, served non-HTML content, or failed before extraction.

Why did I get fewer pages than maxPages?

The site may not have enough in-scope links, your filters may be strict, or the actor may have skipped pages that are not suitable for extraction.

Why is Markdown missing?

Set outputFormat to markdown. When outputFormat is text or html, the actor focuses on that format.

Can it crawl private dashboards?

No. This actor is for public pages. It does not accept credentials or browser sessions.


Limits

  • Public HTTP/HTTPS pages only.
  • No login or private account access.
  • No CAPTCHA bypass.
  • No file download extraction in the first version.
  • Very JavaScript-heavy pages may return limited text if the content is not present in the public response.

You may also find these actors useful:


FAQ

Is this a full website crawler?

It is a lightweight crawler for public website content. It is best for controlled page limits and scoped crawling.

Can I crawl multiple domains?

Yes. Add multiple start URLs. With sameDomainOnly enabled, each start URL stays on its own domain.

Can I crawl only one folder?

Yes. Use includeGlobs, for example https://example.com/docs/**.

Do skipped pages cost the same as extracted pages?

No. The per-page extraction charge is applied only after a successful HTML page extraction.

What format should I use for LLMs?

Markdown is usually the best first choice for LLM, RAG, and documentation workflows.

What should I do before a large crawl?

Run a small crawl first, review the dataset, then increase maxPages gradually.