Pricing

from $0.10 / 1,000 page extracteds

Go to Apify Store

Website Content Crawler Lite

Try for free

Crawl public website pages and extract clean text, Markdown, metadata, and links for AI, SEO, and monitoring workflows.

Pricing

from $0.10 / 1,000 page extracteds

Rating

0.0

(0)

Developer

Hanna Nosova

Actor stats

Bookmarked

Total users

Monthly active users

20 hours ago

Last modified

What does Website Content Crawler Lite do?

Website Content Crawler Lite crawls public web pages and creates structured page records.

Each successful page can include:

✅ requested URL
✅ final loaded URL
✅ page title
✅ meta description
✅ first H1
✅ clean text
✅ Markdown content
✅ optional cleaned HTML
✅ discovered links
✅ HTTP status
✅ content type
✅ crawl depth
✅ parent URL
✅ timestamp

Skipped pages are also reported with a reason, so you can see what happened without paying for a successful extraction that did not happen.

Who is it for?

Website Content Crawler Lite is for teams that need structured public website content without building and maintaining their own crawler.

Who is this website crawler for?

AI and RAG teams

Use it to collect clean page text and Markdown for retrieval-augmented generation, internal knowledge bases, chatbot grounding, and document pipelines.

SEO and content teams

Use it to audit page titles, descriptions, H1 tags, internal links, and crawl coverage across a small website or documentation section.

Growth and operations teams

Use it to monitor public pages, collect content snapshots, or build lightweight lead and content enrichment workflows.

Developers and automation builders

Use it as a low-friction page extraction step inside Apify tasks, Make scenarios, Zapier flows, or custom API jobs.

Why use this actor?

🚀 Simple inputs: start URLs, page limit, depth, and filters
🧭 Safe defaults: same-domain crawling is enabled by default
📄 Useful output: text, Markdown, metadata, and links in one row
💸 Predictable pricing: charged per successfully extracted page
🧱 Automation-ready: dataset output works with exports and APIs
🔍 Transparent skips: blocked, filtered, and non-HTML pages are marked

What data can you extract?

Field	Description
`url`	URL originally scheduled for crawling
`loadedUrl`	Final URL after redirects
`title`	HTML title or Open Graph title
`description`	Meta description or Open Graph description
`h1`	First H1 text on the page
`text`	Clean plain text from the page
`markdown`	Markdown version when `outputFormat` is `markdown`
`html`	Cleaned HTML when `outputFormat` is `html`
`links`	Public links discovered on the page
`statusCode`	HTTP status code
`contentType`	Response content type
`depth`	Link depth from the start URL
`parentUrl`	Page that discovered this URL
`fetchedAt`	ISO timestamp for the fetch
`error`	Error detail when a request fails
`skippedReason`	Reason a page was skipped

How much does it cost to crawl website content?

The actor uses pay-per-event pricing.

A small one-time start event covers run setup.
A page event is charged only for each successfully extracted HTML page.
Skipped pages, non-HTML files, robots-disallowed pages, filtered URLs, and request failures are not charged as extracted pages.

For example, if you crawl 100 successful pages, you pay for 100 extracted page events plus the small run start event.

This makes it practical for small tests and recurring website monitoring jobs.

Quick start

Open the actor on Apify.
Add one or more public start URLs.
Set maxPages to a small number for the first run.
Keep sameDomainOnly enabled unless you intentionally want broader crawling.
Choose markdown, text, or html output.
Run the actor.
Export the dataset as JSON, CSV, Excel, or connect it to your workflow.

Example input

{
  "startUrls": [
    { "url": "https://example.com" }
  ],
  "maxPages": 5,
  "maxDepth": 1,
  "sameDomainOnly": true,
  "outputFormat": "markdown",
  "respectRobotsTxt": true,
  "requestTimeoutSecs": 20
}

Input options

`startUrls`

A list of public HTTP or HTTPS pages where crawling should start.

`maxPages`

The maximum number of pages to fetch in the run. Start low when testing.

`maxDepth`

How many link levels to follow. Use 0 to extract only the start URLs.

`sameDomainOnly`

When true, the actor crawls only links on the same domain as the start URL.

`includeGlobs`

Optional URL patterns that must match for a URL to be crawled.

Example:

["https://example.com/docs/**"]

`excludeGlobs`

Optional URL patterns to skip.

Example:

["**/login**", "**/signup**"]

`outputFormat`

Choose one of:

markdown
text
html

`respectRobotsTxt`

When enabled, the actor checks robots rules and skips disallowed pages.

`requestTimeoutSecs`

Maximum time to wait for one page response.

Output example

{
  "url": "https://example.com/",
  "loadedUrl": "https://example.com/",
  "title": "Example Domain",
  "description": null,
  "h1": "Example Domain",
  "text": "Example Domain This domain is for use in illustrative examples in documents.",
  "markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.",
  "links": ["https://www.iana.org/domains/example"],
  "statusCode": 200,
  "contentType": "text/html",
  "depth": 0,
  "parentUrl": null,
  "fetchedAt": "2026-06-20T00:00:00.000Z"
}

Tips for better crawl results

Start with maxPages between 5 and 20.
Use maxDepth: 0 for page extraction only.
Use maxDepth: 1 for small site sections.
Keep sameDomainOnly enabled for predictable runs.
Use include globs for documentation folders.
Use exclude globs for login, cart, account, and search pages.
Prefer Markdown for AI and RAG workflows.
Prefer text for simple keyword or monitoring jobs.
Prefer HTML only when downstream tools need markup.

Common use cases

Build a RAG content feed

Crawl a public documentation site and export Markdown rows to a vector database pipeline.

Audit SEO metadata

Crawl a content section and review titles, descriptions, H1s, and status codes.

Monitor public website copy

Run the actor on a schedule and compare page text over time.

Extract documentation pages

Use include globs like https://example.com/docs/** to stay inside a docs section.

Collect link maps

Use links, depth, and parentUrl to understand how a small site section connects.

Integrations

Apify datasets

Every page result is saved to the default dataset. Export as JSON, CSV, Excel, XML, RSS, or HTML.

Apify API

Start runs from your own backend and fetch the dataset after completion.

Make and Zapier

Use completed runs as triggers for notifications, indexing, spreadsheets, or content review workflows.

Vector databases

Send Markdown or text fields to Pinecone, Weaviate, Qdrant, Chroma, or your internal embeddings pipeline.

Monitoring workflows

Schedule recurring runs and compare exported text fields between runs.

API usage with Node.js

import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: process.env.APIFY_TOKEN });
const run = await client.actor('fetch_cat/website-content-crawler-lite').call({
  startUrls: [{ url: 'https://example.com' }],
  maxPages: 5,
  maxDepth: 1,
  outputFormat: 'markdown'
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);

API usage with Python

from apify_client import ApifyClient
import os

client = ApifyClient(os.environ['APIFY_TOKEN'])
run = client.actor('fetch_cat/website-content-crawler-lite').call(run_input={
    'startUrls': [{'url': 'https://example.com'}],
    'maxPages': 5,
    'maxDepth': 1,
    'outputFormat': 'markdown',
})

items = client.dataset(run['defaultDatasetId']).list_items().items
print(items)

API usage with cURL

curl "https://api.apify.com/v2/acts/fetch_cat~website-content-crawler-lite/runs?token=$APIFY_TOKEN" \
  -H 'Content-Type: application/json' \
  -d '{
    "startUrls": [{"url": "https://example.com"}],
    "maxPages": 5,
    "maxDepth": 1,
    "outputFormat": "markdown"
  }'

MCP usage

Use this actor from MCP-enabled tools through Apify MCP Server.

MCP URL:

https://mcp.apify.com/?tools=fetch_cat/website-content-crawler-lite

Claude Code MCP setup

$claude mcp add apify https://mcp.apify.com/?tools=fetch_cat/website-content-crawler-lite

Claude Desktop MCP JSON config

{
  "mcpServers": {
    "apify": {
      "url": "https://mcp.apify.com/?tools=fetch_cat/website-content-crawler-lite"
    }
  }
}

Example prompts showing MCP usage

"Use the Apify MCP tool fetch_cat/website-content-crawler-lite to crawl https://docs.apify.com/academy with 10 pages and summarize the Markdown output."
"Run Website Content Crawler Lite through MCP and extract titles and H1s from the first 10 pages on this site."
"Use MCP to find all links discovered from this landing page and group them by domain."

Claude Code prompt examples

"Use Apify MCP to crawl this public documentation section and return a Markdown summary."
"With the Website Content Crawler Lite MCP tool, extract page titles, H1s, and status codes from this website."

Claude Desktop prompt examples

"Use Website Content Crawler Lite via Apify MCP to collect clean text from this website section."
"Run a small MCP crawl and prepare a content audit table with URL, title, H1, and status code."

Legality and responsible crawling

This actor is intended for public web pages only.

Do not use it to bypass logins, paywalls, CAPTCHA, private content, or access controls.

Respect website terms, robots rules, copyright, privacy rules, and applicable laws.

Use conservative page limits and crawl only content you have the right to process.

Troubleshooting

Why did a page have `skippedReason`?

The page may have been filtered by your globs, disallowed by robots rules, returned an HTTP error, served non-HTML content, or failed before extraction.

Why did I get fewer pages than `maxPages`?

The site may not have enough in-scope links, your filters may be strict, or the actor may have skipped pages that are not suitable for extraction.

Why is Markdown missing?

Set outputFormat to markdown. When outputFormat is text or html, the actor focuses on that format.

Can it crawl private dashboards?

No. This actor is for public pages. It does not accept credentials or browser sessions.

Limits

Public HTTP/HTTPS pages only.
No login or private account access.
No CAPTCHA bypass.
No file download extraction in the first version.
Very JavaScript-heavy pages may return limited text if the content is not present in the public response.

You may also find these actors useful:

FAQ

Is this a full website crawler?

It is a lightweight crawler for public website content. It is best for controlled page limits and scoped crawling.

Can I crawl multiple domains?

Yes. Add multiple start URLs. With sameDomainOnly enabled, each start URL stays on its own domain.

Can I crawl only one folder?

Yes. Use includeGlobs, for example https://example.com/docs/**.

Do skipped pages cost the same as extracted pages?

No. The per-page extraction charge is applied only after a successful HTML page extraction.

What format should I use for LLMs?

Markdown is usually the best first choice for LLM, RAG, and documentation workflows.

What should I do before a large crawl?

Run a small crawl first, review the dataset, then increase maxPages gradually.

Elite Web Scraper Lite

thepattyroller/elite-web-scraper-lite

Lightning-fast web scraper for static websites. Extract titles, headings, links, and content from any webpage using Cheerio. Perfect for simple scraping tasks without the overhead of browser automation. Supports custom CSS selectors and link extraction.

Logan Kiser

Knowledge Intelligence Engine — Website to Markdown for RAG

ryanclinton/website-content-to-markdown

Turn any website, documentation site or help centre into a retrieval-ready knowledge corpus for RAG and AI search. Clean Markdown plus chunks, change detection, deduplication, retrieval scoring, version awareness and a full corpus audit, in one run.

Ryan Clinton

ImmoScout24 Scraper (API) Lite - Telegram Alerts

clearpath/immoscout24-api-lite

ImmoScout24 / ImmobilienScout24 Scraper API for German real estate monitoring. Track new rental listings with real-time Telegram alerts. 90% cheaper than browser scrapers. Ideal for apartment hunting and property data extraction.

ClearPath

5.0

(1)

Ultimate Reddit Profile Scraper (Lite)

potatopeeler/reddit-account-scraper-lite

Pay per result. Seamlessly download full Reddit user accounts, capturing posts, images, activity, and historical data, including URLs and media comments. Export detailed insights to CSV, JSON, XML, EXCEL formats, or effortlessly import them into your email for comprehensive analysis and easy access.

Jamie Potato

152

1.0

(1)

Reddit Scraper All-in-One - Posts, Comments,& Email Finder

pro100chok/reddit-scraper-all-in-one

Scrape Reddit posts, comments, communities, users & search — no login, no API key. The only Reddit scraper that also pulls emails, phone numbers & social links from users' profile websites. Fast & async. Export JSON, CSV or Excel for lead gen, research & monitoring.

Raven

Fast Twitter (X) User Scraper API | Extract Profiles, Followers

apidojo/twitter-user-scraper

Introducing Twitter (X) User Scraper, the ultimate solution for direct user extraction from Twitter (X). It offers blazing speed and comprehensiveness, delivering lightning-fast user extraction features.

API Dojo

6.7K

4.0

(14)

Fast Twitter List Scraper API | Extract Tweets & Members

apidojo/twitter-list-scraper

Discover the Twitter (X) List Scraper you've been looking for! Find the ultimate tool for extracting Tweets List from X / Twitter! It offers unparalleled speed and comprehensiveness, ensuring lightning-fast extraction of Tweets.

API Dojo

948

5.0

(8)

Reddit Post & Comment Scraper

ionbelei549/reddit-parsed-posts

Scrape unlimited comments from any posts with 99% accuracy (highest of the Apify Store). Input any Reddit post URL and get complete, rich JSON data, including deeply nested comment threads, scores, author details, and awards. Comments tree is already built for you.

Ion Belei

Tiktok Shop Product Scraper

lemur/tiktok-shop-products

Retrieve accurate prices, stock values, commissions and other product metadata in this single actor.

Lemur

Light-Weight Reddit Scraper

glitch_404/RedditScraper

Fast Reddit data without the bloat. Scrape posts, feeds, search results, media, and comments with flexible filters and clean structured output for research, monitoring, and content discovery

Yousif Wael

Website Content Crawler Lite

What does Website Content Crawler Lite do?

Who is it for?

Who is this website crawler for?

AI and RAG teams

SEO and content teams

Growth and operations teams

Developers and automation builders

Why use this actor?

What data can you extract?

How much does it cost to crawl website content?

Quick start

Example input

Input options

startUrls

maxPages

maxDepth

sameDomainOnly

includeGlobs

excludeGlobs

outputFormat

respectRobotsTxt

requestTimeoutSecs

Output example

Tips for better crawl results

Common use cases

Build a RAG content feed

Audit SEO metadata

Monitor public website copy

Extract documentation pages

Collect link maps

Integrations

Apify datasets

Apify API

Make and Zapier

Vector databases

Monitoring workflows

API usage with Node.js

API usage with Python

API usage with cURL

MCP usage

Claude Code MCP setup

Claude Desktop MCP JSON config

Example prompts showing MCP usage

Claude Code prompt examples

Claude Desktop prompt examples

Legality and responsible crawling

Troubleshooting

Why did a page have skippedReason?

Why did I get fewer pages than maxPages?

Why is Markdown missing?

Can it crawl private dashboards?

Limits

Related scrapers

FAQ

Is this a full website crawler?

Can I crawl multiple domains?

Can I crawl only one folder?

Do skipped pages cost the same as extracted pages?

What format should I use for LLMs?

What should I do before a large crawl?

You might also like

Elite Web Scraper Lite

Knowledge Intelligence Engine — Website to Markdown for RAG

ImmoScout24 Scraper (API) Lite - Telegram Alerts

Ultimate Reddit Profile Scraper (Lite)

Reddit Scraper All-in-One - Posts, Comments,& Email Finder

Fast Twitter (X) User Scraper API | Extract Profiles, Followers

Fast Twitter List Scraper API | Extract Tweets & Members

Reddit Post & Comment Scraper

Tiktok Shop Product Scraper

Light-Weight Reddit Scraper

`startUrls`

`maxPages`

`maxDepth`

`sameDomainOnly`

`includeGlobs`

`excludeGlobs`

`outputFormat`

`respectRobotsTxt`

`requestTimeoutSecs`

Why did a page have `skippedReason`?

Why did I get fewer pages than `maxPages`?