Pricing

Pay per usage

Website Content Crawler

Universal website crawler that extracts clean text/markdown content, metadata, links, and images from any URL. Features sitemap parsing, robots.txt respect, and multi-page BFS crawling with depth control.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Ali haydar Karadaş

Actor stats

Bookmarked

Total users

Monthly active users

a day ago

Last modified

What does Website Content Crawler do?

This actor provides four endpoints that cover different content extraction needs. Crawl Page scrapes a single URL and returns the page content, metadata, links, and images. Crawl Site follows links from a starting URL and crawls multiple pages up to a configurable depth and page limit. Get Sitemap parses a site's sitemap.xml and returns all listed URLs with their last modified dates, change frequencies, and priorities. Extract Content pulls just the main content from a page in either plain text or markdown format.

The crawler respects robots.txt by default (configurable), extracts Open Graph and meta tags, identifies internal vs. external links, and captures image alt text and dimensions. Output is clean and structured -- ready for AI training data, content analysis, SEO audits, or database storage.

What data do you get?

Page content:

url, title, description
text_content -- extracted plain text
markdown_content -- content converted to markdown
author, published_date, language
word_count, char_count

Page metadata:

status_code, content_type, response_time_ms
canonical_url, og_tags, meta_tags

Links found on page:

url, text, is_internal, is_nofollow

Images found on page:

url, alt_text, width, height

Sitemap data:

url, lastmod, changefreq, priority

Crawl summary:

start_url, pages_crawled, total_links

Who is this for?

AI and ML engineers -- collect training data from websites in clean text or markdown format
SEO professionals -- audit site structure, meta tags, internal linking, and content quality
Content analysts -- extract and compare content across competitor websites
Researchers -- build text corpora from web sources for academic or commercial analysis
Developers -- integrate website content extraction into pipelines, chatbots, or knowledge bases

How to use it

Open the actor in Apify Console and select an endpoint (crawl_page, crawl_site, get_sitemap, or extract_content).
Enter the URL you want to crawl or extract content from.
For crawl_site, set the crawl depth and page limit.
Click "Start" to run the crawler.
Export results as JSON from the Dataset tab or use the Apify API.

Input parameters

Parameter	Type	Default	Description
endpoint	string	crawl_page	API endpoint: crawl_page, crawl_site, get_sitemap, or extract_content
url	string	--	The URL to crawl or extract content from (required)
depth	integer	1	Maximum crawl depth, 1-5 (crawl_site only)
limit	integer	10	Maximum number of pages to crawl, 1-100 (crawl_site only)
output_format	string	text	Output format for extract_content: text or markdown
respect_robots	boolean	true	Whether to respect robots.txt rules

Sample output

{
  "url": "https://example.com/blog/intro-to-web-scraping",
  "content": {
    "url": "https://example.com/blog/intro-to-web-scraping",
    "title": "Introduction to Web Scraping",
    "description": "A beginner's guide to web scraping with Python",
    "text_content": "Web scraping is the process of extracting data from websites...",
    "markdown_content": "# Introduction to Web Scraping\n\nWeb scraping is the process...",
    "author": "Jane Smith",
    "published_date": "2026-05-10",
    "language": "en",
    "word_count": 1245,
    "char_count": 7830
  },
  "metadata": {
    "url": "https://example.com/blog/intro-to-web-scraping",
    "status_code": 200,
    "content_type": "text/html",
    "response_time_ms": 234.5,
    "canonical_url": "https://example.com/blog/intro-to-web-scraping",
    "og_tags": {
      "og:title": "Introduction to Web Scraping",
      "og:type": "article"
    },
    "meta_tags": {
      "description": "A beginner's guide to web scraping with Python"
    }
  },
  "links": [
    {
      "url": "https://example.com/blog/advanced-scraping",
      "text": "Advanced Scraping Techniques",
      "is_internal": true,
      "is_nofollow": false
    }
  ],
  "images": [
    {
      "url": "https://example.com/images/scraping-diagram.png",
      "alt_text": "Web scraping workflow diagram",
      "width": 800,
      "height": 450
    }
  ]
}

How much does it cost?

Each result costs $0.002. Crawling 1,000 pages costs just $2, and 10,000 pages costs $20.

Apify gives every new user $5 in free monthly credits, so you can crawl about 2,500 pages for free.

Common questions

Can I get the content in markdown format? Yes. Use the extract_content endpoint and set output_format to "markdown." The crawl_page endpoint also returns markdown_content alongside plain text by default.

Does it follow links across different domains? The crawl_site endpoint only follows internal links (same domain). External links are captured in the output but not followed. This prevents the crawl from spiraling across the entire web.

Does it handle JavaScript-rendered pages? The crawler works with server-rendered HTML. Pages that require JavaScript execution to load content may return incomplete results. For heavy SPA sites, consider using a browser-based crawler instead.

Contact & Custom Solutions

Need a custom scraper, higher volume, or a specific integration? We're here to help.

If anything isn't working right or you need support, don't hesitate to reach out.

Telegram: t.me/novashield_dev
Email: novashield.dev@gmail.com

Website Content Crawler

apify/website-content-crawler

Crawl websites and extract text content to feed AI models, LLM applications, vector databases, or RAG pipelines. The Actor supports rich formatting using Markdown, cleans the HTML, downloads files, and integrates well with 🦜🔗 LangChain, LlamaIndex, and the wider LLM ecosystem.

Apify

133K

4.6

(205)

Fast Website Content Crawler

6sigmag/fast-website-content-crawler

A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David

4.9

(7)

🔥 FireScrape AI Website Content Markdown Scraper

mohamedgb00714/fireScraper-AI-Website-Content-Markdown-Scraper

Advanced web scraper powered by Crawlee and Puppeteer — extracts website content, converts it to Markdown, and structures it for LLM training datasets.

mohamed el hadi msaid

302

1.9

(2)

🧪High-Volume Website Content & Media Scraper

caring_dizi/blog-content-scraper-fixed

🧪Crawling Done Right! Let me now what you think, what or where or how i can improve my actor, and i am all for constructive criticism. So please message if you have any questions. Enjoy and have a good day.

Jeff Halverson

148

5.0

(2)

AI Website Content Markdown Scraper

quaking_pail/ai-website-content-markdown-scraper

This Apify Actor, "Website Content Crawler with Markdown Extraction," is designed to perform a comprehensive crawl of specified websites, extract their text content, convert it into Markdown format, and store it in a structured dataset. The extracted content is suitable for feeding LLMs.

AI_Builder

937

2.3

(3)

Deep Website Content Crawler

6sigmag/deep-website-content-crawler

Scrape Failed Killer! A high-performance web scraper that rapidly extracts and analyzes content from multiple websites simultaneously. Perfect for competitive research, content aggregation, and website structure analysis.

David

1.1K

3.0

(1)

Website Content Extractor

tropical_prune/website-content-extractor

CAPABILITIES: extract_content, convert_to_markdown, batch_urls, extract_metadata. INPUT: URLs (single or array), with optional selectors and output format. OUTPUT: structured JSON with title, text, metadata, word_count. FORMATS: json, markdown, text. PRICING: PPE $0.001/page.

Bado

Website Content Crawler

mikolabs/website-content-crawler

Deep-crawl websites to extract clean text, Markdown, or HTML for AI/LLM apps, RAG pipelines, and vector databases. Supports adaptive crawling, HTML cleaning, file downloads, and structured dataset output. Easily integrates with LangChain, LlamaIndex, and other LLM tools.

mikolabs

5.0

(1)

Website Content Crawler

alizarin_refrigerator-owner/website-crawler

Crawl websites for SEO audits. Extracts HTML, title, meta tags, headings, links, & text content from pages. Automatic sitemap detection & parsing Extracts metadata (title, description, OG tags) Heading structure (H1, H2, H3) Internal & external link analysis Image extraction w/alt text Word count

The Howlers

109

Website Content Text Extractor

smart-digital/website-content-text-extractor

Extract visible text content from websites as structured JSON blocks. Supports multi-URL batch processing, header/footer/cookie exclusion, and optional form extraction. Perfect for content analysis and translation workflows.

My Smart Digital

5.0

(1)

Website Content Crawler for AI — Clean Markdown, 4x Cheaper

joyouscam35875/website-content-crawler

Crawl any website and extract clean text/markdown for LLMs, RAG pipelines, vector databases. BFS crawl with depth control, robots.txt support, boilerplate removal. Perfect for feeding AI models. $0.001/page — 4x cheaper than the official Apify crawler.