Website Content Crawler Lite
Pricing
from $0.10 / 1,000 page extracteds
Website Content Crawler Lite
Crawl public website pages and extract clean text, Markdown, metadata, and links for AI, SEO, and monitoring workflows.
Pricing
from $0.10 / 1,000 page extracteds
Rating
0.0
(0)
Developer
Hanna Nosova
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
20 hours ago
Last modified
Categories
Share
Turn public website pages into clean content for AI, SEO, research, monitoring, and documentation workflows.
Website Content Crawler Lite starts from one or more public URLs, follows links within your chosen scope, and saves page metadata, clean text, Markdown, optional HTML, discovered links, crawl depth, and skip/error context.
It is designed for teams that want a simple, predictable website crawler without heavyweight setup.
What does Website Content Crawler Lite do?
Website Content Crawler Lite crawls public web pages and creates structured page records.
Each successful page can include:
- ✅ requested URL
- ✅ final loaded URL
- ✅ page title
- ✅ meta description
- ✅ first H1
- ✅ clean text
- ✅ Markdown content
- ✅ optional cleaned HTML
- ✅ discovered links
- ✅ HTTP status
- ✅ content type
- ✅ crawl depth
- ✅ parent URL
- ✅ timestamp
Skipped pages are also reported with a reason, so you can see what happened without paying for a successful extraction that did not happen.
Who is it for?
Website Content Crawler Lite is for teams that need structured public website content without building and maintaining their own crawler.
Who is this website crawler for?
AI and RAG teams
Use it to collect clean page text and Markdown for retrieval-augmented generation, internal knowledge bases, chatbot grounding, and document pipelines.
SEO and content teams
Use it to audit page titles, descriptions, H1 tags, internal links, and crawl coverage across a small website or documentation section.
Growth and operations teams
Use it to monitor public pages, collect content snapshots, or build lightweight lead and content enrichment workflows.
Developers and automation builders
Use it as a low-friction page extraction step inside Apify tasks, Make scenarios, Zapier flows, or custom API jobs.
Why use this actor?
- 🚀 Simple inputs: start URLs, page limit, depth, and filters
- 🧭 Safe defaults: same-domain crawling is enabled by default
- 📄 Useful output: text, Markdown, metadata, and links in one row
- 💸 Predictable pricing: charged per successfully extracted page
- 🧱 Automation-ready: dataset output works with exports and APIs
- 🔍 Transparent skips: blocked, filtered, and non-HTML pages are marked
What data can you extract?
| Field | Description |
|---|---|
url | URL originally scheduled for crawling |
loadedUrl | Final URL after redirects |
title | HTML title or Open Graph title |
description | Meta description or Open Graph description |
h1 | First H1 text on the page |
text | Clean plain text from the page |
markdown | Markdown version when outputFormat is markdown |
html | Cleaned HTML when outputFormat is html |
links | Public links discovered on the page |
statusCode | HTTP status code |
contentType | Response content type |
depth | Link depth from the start URL |
parentUrl | Page that discovered this URL |
fetchedAt | ISO timestamp for the fetch |
error | Error detail when a request fails |
skippedReason | Reason a page was skipped |
How much does it cost to crawl website content?
The actor uses pay-per-event pricing.
- A small one-time start event covers run setup.
- A page event is charged only for each successfully extracted HTML page.
- Skipped pages, non-HTML files, robots-disallowed pages, filtered URLs, and request failures are not charged as extracted pages.
For example, if you crawl 100 successful pages, you pay for 100 extracted page events plus the small run start event.
This makes it practical for small tests and recurring website monitoring jobs.
Quick start
- Open the actor on Apify.
- Add one or more public start URLs.
- Set
maxPagesto a small number for the first run. - Keep
sameDomainOnlyenabled unless you intentionally want broader crawling. - Choose
markdown,text, orhtmloutput. - Run the actor.
- Export the dataset as JSON, CSV, Excel, or connect it to your workflow.
Example input
{"startUrls": [{ "url": "https://example.com" }],"maxPages": 5,"maxDepth": 1,"sameDomainOnly": true,"outputFormat": "markdown","respectRobotsTxt": true,"requestTimeoutSecs": 20}
Input options
startUrls
A list of public HTTP or HTTPS pages where crawling should start.
maxPages
The maximum number of pages to fetch in the run. Start low when testing.
maxDepth
How many link levels to follow. Use 0 to extract only the start URLs.
sameDomainOnly
When true, the actor crawls only links on the same domain as the start URL.
includeGlobs
Optional URL patterns that must match for a URL to be crawled.
Example:
["https://example.com/docs/**"]
excludeGlobs
Optional URL patterns to skip.
Example:
["**/login**", "**/signup**"]
outputFormat
Choose one of:
markdowntexthtml
respectRobotsTxt
When enabled, the actor checks robots rules and skips disallowed pages.
requestTimeoutSecs
Maximum time to wait for one page response.
Output example
{"url": "https://example.com/","loadedUrl": "https://example.com/","title": "Example Domain","description": null,"h1": "Example Domain","text": "Example Domain This domain is for use in illustrative examples in documents.","markdown": "# Example Domain\n\nThis domain is for use in illustrative examples in documents.","links": ["https://www.iana.org/domains/example"],"statusCode": 200,"contentType": "text/html","depth": 0,"parentUrl": null,"fetchedAt": "2026-06-20T00:00:00.000Z"}
Tips for better crawl results
- Start with
maxPagesbetween 5 and 20. - Use
maxDepth: 0for page extraction only. - Use
maxDepth: 1for small site sections. - Keep
sameDomainOnlyenabled for predictable runs. - Use include globs for documentation folders.
- Use exclude globs for login, cart, account, and search pages.
- Prefer Markdown for AI and RAG workflows.
- Prefer text for simple keyword or monitoring jobs.
- Prefer HTML only when downstream tools need markup.
Common use cases
Build a RAG content feed
Crawl a public documentation site and export Markdown rows to a vector database pipeline.
Audit SEO metadata
Crawl a content section and review titles, descriptions, H1s, and status codes.
Monitor public website copy
Run the actor on a schedule and compare page text over time.
Extract documentation pages
Use include globs like https://example.com/docs/** to stay inside a docs section.
Collect link maps
Use links, depth, and parentUrl to understand how a small site section connects.
Integrations
Apify datasets
Every page result is saved to the default dataset. Export as JSON, CSV, Excel, XML, RSS, or HTML.
Apify API
Start runs from your own backend and fetch the dataset after completion.
Make and Zapier
Use completed runs as triggers for notifications, indexing, spreadsheets, or content review workflows.
Vector databases
Send Markdown or text fields to Pinecone, Weaviate, Qdrant, Chroma, or your internal embeddings pipeline.
Monitoring workflows
Schedule recurring runs and compare exported text fields between runs.
API usage with Node.js
import { ApifyClient } from 'apify-client';const client = new ApifyClient({ token: process.env.APIFY_TOKEN });const run = await client.actor('fetch_cat/website-content-crawler-lite').call({startUrls: [{ url: 'https://example.com' }],maxPages: 5,maxDepth: 1,outputFormat: 'markdown'});const { items } = await client.dataset(run.defaultDatasetId).listItems();console.log(items);
API usage with Python
from apify_client import ApifyClientimport osclient = ApifyClient(os.environ['APIFY_TOKEN'])run = client.actor('fetch_cat/website-content-crawler-lite').call(run_input={'startUrls': [{'url': 'https://example.com'}],'maxPages': 5,'maxDepth': 1,'outputFormat': 'markdown',})items = client.dataset(run['defaultDatasetId']).list_items().itemsprint(items)
API usage with cURL
curl "https://api.apify.com/v2/acts/fetch_cat~website-content-crawler-lite/runs?token=$APIFY_TOKEN" \-H 'Content-Type: application/json' \-d '{"startUrls": [{"url": "https://example.com"}],"maxPages": 5,"maxDepth": 1,"outputFormat": "markdown"}'
MCP usage
Use this actor from MCP-enabled tools through Apify MCP Server.
MCP URL:
https://mcp.apify.com/?tools=fetch_cat/website-content-crawler-lite
Claude Code MCP setup
$claude mcp add apify https://mcp.apify.com/?tools=fetch_cat/website-content-crawler-lite
Claude Desktop MCP JSON config
{"mcpServers": {"apify": {"url": "https://mcp.apify.com/?tools=fetch_cat/website-content-crawler-lite"}}}
Example prompts showing MCP usage
- "Use the Apify MCP tool
fetch_cat/website-content-crawler-liteto crawlhttps://docs.apify.com/academywith 10 pages and summarize the Markdown output." - "Run Website Content Crawler Lite through MCP and extract titles and H1s from the first 10 pages on this site."
- "Use MCP to find all links discovered from this landing page and group them by domain."
Claude Code prompt examples
- "Use Apify MCP to crawl this public documentation section and return a Markdown summary."
- "With the Website Content Crawler Lite MCP tool, extract page titles, H1s, and status codes from this website."
Claude Desktop prompt examples
- "Use Website Content Crawler Lite via Apify MCP to collect clean text from this website section."
- "Run a small MCP crawl and prepare a content audit table with URL, title, H1, and status code."
Legality and responsible crawling
This actor is intended for public web pages only.
Do not use it to bypass logins, paywalls, CAPTCHA, private content, or access controls.
Respect website terms, robots rules, copyright, privacy rules, and applicable laws.
Use conservative page limits and crawl only content you have the right to process.
Troubleshooting
Why did a page have skippedReason?
The page may have been filtered by your globs, disallowed by robots rules, returned an HTTP error, served non-HTML content, or failed before extraction.
Why did I get fewer pages than maxPages?
The site may not have enough in-scope links, your filters may be strict, or the actor may have skipped pages that are not suitable for extraction.
Why is Markdown missing?
Set outputFormat to markdown. When outputFormat is text or html, the actor focuses on that format.
Can it crawl private dashboards?
No. This actor is for public pages. It does not accept credentials or browser sessions.
Limits
- Public HTTP/HTTPS pages only.
- No login or private account access.
- No CAPTCHA bypass.
- No file download extraction in the first version.
- Very JavaScript-heavy pages may return limited text if the content is not present in the public response.
Related scrapers
You may also find these actors useful:
FAQ
Is this a full website crawler?
It is a lightweight crawler for public website content. It is best for controlled page limits and scoped crawling.
Can I crawl multiple domains?
Yes. Add multiple start URLs. With sameDomainOnly enabled, each start URL stays on its own domain.
Can I crawl only one folder?
Yes. Use includeGlobs, for example https://example.com/docs/**.
Do skipped pages cost the same as extracted pages?
No. The per-page extraction charge is applied only after a successful HTML page extraction.
What format should I use for LLMs?
Markdown is usually the best first choice for LLM, RAG, and documentation workflows.
What should I do before a large crawl?
Run a small crawl first, review the dataset, then increase maxPages gradually.