Website Content Crawler
Pricing
Pay per usage
Website Content Crawler
Universal website crawler that extracts clean text/markdown content, metadata, links, and images from any URL. Features sitemap parsing, robots.txt respect, and multi-page BFS crawling with depth control.
Pricing
Pay per usage
Rating
0.0
(0)
Developer
Ali haydar Karadaş
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
a day ago
Last modified
Categories
Share
Website Content Crawler extracts clean text and markdown content from any website, along with metadata, links, and images. Whether you need to scrape a single page, crawl an entire site, or parse a sitemap, this actor handles it with minimal setup.
What does Website Content Crawler do?
This actor provides four endpoints that cover different content extraction needs. Crawl Page scrapes a single URL and returns the page content, metadata, links, and images. Crawl Site follows links from a starting URL and crawls multiple pages up to a configurable depth and page limit. Get Sitemap parses a site's sitemap.xml and returns all listed URLs with their last modified dates, change frequencies, and priorities. Extract Content pulls just the main content from a page in either plain text or markdown format.
The crawler respects robots.txt by default (configurable), extracts Open Graph and meta tags, identifies internal vs. external links, and captures image alt text and dimensions. Output is clean and structured -- ready for AI training data, content analysis, SEO audits, or database storage.
What data do you get?
Page content:
- url, title, description
- text_content -- extracted plain text
- markdown_content -- content converted to markdown
- author, published_date, language
- word_count, char_count
Page metadata:
- status_code, content_type, response_time_ms
- canonical_url, og_tags, meta_tags
Links found on page:
- url, text, is_internal, is_nofollow
Images found on page:
- url, alt_text, width, height
Sitemap data:
- url, lastmod, changefreq, priority
Crawl summary:
- start_url, pages_crawled, total_links
Who is this for?
- AI and ML engineers -- collect training data from websites in clean text or markdown format
- SEO professionals -- audit site structure, meta tags, internal linking, and content quality
- Content analysts -- extract and compare content across competitor websites
- Researchers -- build text corpora from web sources for academic or commercial analysis
- Developers -- integrate website content extraction into pipelines, chatbots, or knowledge bases
How to use it
- Open the actor in Apify Console and select an endpoint (crawl_page, crawl_site, get_sitemap, or extract_content).
- Enter the URL you want to crawl or extract content from.
- For crawl_site, set the crawl depth and page limit.
- Click "Start" to run the crawler.
- Export results as JSON from the Dataset tab or use the Apify API.
Input parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
| endpoint | string | crawl_page | API endpoint: crawl_page, crawl_site, get_sitemap, or extract_content |
| url | string | -- | The URL to crawl or extract content from (required) |
| depth | integer | 1 | Maximum crawl depth, 1-5 (crawl_site only) |
| limit | integer | 10 | Maximum number of pages to crawl, 1-100 (crawl_site only) |
| output_format | string | text | Output format for extract_content: text or markdown |
| respect_robots | boolean | true | Whether to respect robots.txt rules |
Sample output
{"url": "https://example.com/blog/intro-to-web-scraping","content": {"url": "https://example.com/blog/intro-to-web-scraping","title": "Introduction to Web Scraping","description": "A beginner's guide to web scraping with Python","text_content": "Web scraping is the process of extracting data from websites...","markdown_content": "# Introduction to Web Scraping\n\nWeb scraping is the process...","author": "Jane Smith","published_date": "2026-05-10","language": "en","word_count": 1245,"char_count": 7830},"metadata": {"url": "https://example.com/blog/intro-to-web-scraping","status_code": 200,"content_type": "text/html","response_time_ms": 234.5,"canonical_url": "https://example.com/blog/intro-to-web-scraping","og_tags": {"og:title": "Introduction to Web Scraping","og:type": "article"},"meta_tags": {"description": "A beginner's guide to web scraping with Python"}},"links": [{"url": "https://example.com/blog/advanced-scraping","text": "Advanced Scraping Techniques","is_internal": true,"is_nofollow": false}],"images": [{"url": "https://example.com/images/scraping-diagram.png","alt_text": "Web scraping workflow diagram","width": 800,"height": 450}]}
How much does it cost?
Each result costs $0.002. Crawling 1,000 pages costs just $2, and 10,000 pages costs $20.
Apify gives every new user $5 in free monthly credits, so you can crawl about 2,500 pages for free.
Common questions
Can I get the content in markdown format? Yes. Use the extract_content endpoint and set output_format to "markdown." The crawl_page endpoint also returns markdown_content alongside plain text by default.
Does it follow links across different domains? The crawl_site endpoint only follows internal links (same domain). External links are captured in the output but not followed. This prevents the crawl from spiraling across the entire web.
Does it handle JavaScript-rendered pages? The crawler works with server-rendered HTML. Pages that require JavaScript execution to load content may return incomplete results. For heavy SPA sites, consider using a browser-based crawler instead.
Contact & Custom Solutions
Need a custom scraper, higher volume, or a specific integration? We're here to help.
If anything isn't working right or you need support, don't hesitate to reach out.
- Telegram: t.me/novashield_dev
- Email: novashield.dev@gmail.com