🧠 Smart Article Extractor avatar

🧠 Smart Article Extractor

Pricing

from $4.99 / 1,000 results

Go to Apify Store
🧠 Smart Article Extractor

🧠 Smart Article Extractor

Pricing

from $4.99 / 1,000 results

Rating

0.0

(0)

Developer

Scraper Engine

Scraper Engine

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

9 days ago

Last modified

Share

🧠 Smart Article Extractor — News & Blog Scraper

One-paragraph summary: Smart Article Extractor is an Apify Actor that bulk-extracts clean article content — title, author, publish date, full text, summary, images, videos, in-body links and rich metadata — from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.


🚀 Why Choose Us?

FeatureSmart Article ExtractorTypical 1-URL article scraper
Bulk discovery (BFS crawler)✅ Yes❌ One URL at a time
Sitemap & robots.txt scanning✅ Built-in
Sub-domain / sub-path scoping✅ Per Start URL
onlyNewArticles cross-run dedup✅ Per-domain & global
Date filters (dateFrom, lastDays, mustHaveDate)✅ All three⚠️ Limited
Anti-block proxy fallback (none → DC → RES)✅ Automatic
Optional Playwright rendering✅ Toggle
Extend-output Python hook✅ Inline snippet
Live dataset push + state KVS⚠️

🔥 Key Features

  • 📰 Clean article extraction — trafilatura + BeautifulSoup combo for high recall.
  • 🌐 Bulk discovery — drop a homepage URL and the actor discovers articles via BFS.
  • 🗺️ Sitemap & robots.txt — automatic Sitemap: parsing + common candidates.
  • 🛡️ Smart proxy fallback — starts direct, then datacenter, then residential.
  • 🎭 Headless browser mode — Playwright + Chromium for JS-heavy or protected sites.
  • 🧠 Cross-run memoryonlyNewArticles and onlyNewArticlesPerDomain.
  • 🪜 Depth / page / article caps — never over-crawl.
  • 📅 Date filtersdateFrom, onlyArticlesForLastDays, mustHaveDate.
  • 🛠️ extendOutputFunction — inject your own Python extend(soup, article, html).
  • 💾 Save HTML / snapshots — full HTML in-record or as KVS link, PNG screenshots.

📥 Input

FieldTypeDefaultDescription
startUrlsarrayrequiredHomepages, sections, topic pages — used as crawl seeds.
articleUrlsarray[]Direct article URLs to extract (no discovery needed).
onlyNewArticlesbooleanfalseSkip URLs already seen in any previous run.
onlyNewArticlesPerDomainbooleanfalsePer-domain dedup memory.
onlyInsideArticlesbooleantrueEnqueue only same-domain links from articles.
onlySubdomainArticlesbooleanfalseRestrict to URLs sharing the Start URL path prefix.
enqueueFromArticlesbooleantrueDiscover further links inside extracted articles.
crawlWholeSubdomainbooleantrueTreat any same-subdomain link as a category candidate.
scanSitemapsbooleantrueDiscover articles from robots.txt and common sitemap paths.
useGoogleBotHeadersbooleantrueIdentify as Googlebot.
useBrowserbooleanfalseRender with headless Chromium.
scrollToBottombooleanfalseForce lazy-loaded content (browser mode only).
mustHaveDatebooleanfalseDrop articles with no detectable date.
dateFromstring (ISO date)Earliest article date.
onlyArticlesForLastDaysintegerConvenience cut-off.
minWordsinteger150Reject short articles.
maxDepthinteger2BFS depth.
maxPagesPerCrawlinteger50Hard cap on fetched pages.
maxArticlesPerCrawlinteger25Hard cap on saved articles.
maxArticlesPerStartUrlinteger25Cap per Start URL.
isUrlArticleDefinitionobjectsee schemaURL-shape heuristic.
linkSelectorstringCSS selector restricting where links are collected from.
pseudoUrlsarray[]Custom URL patterns for category pages.
sitemapUrlsarray[]Explicit sitemap URLs (skip auto-discovery).
saveHtmlbooleanfalseInclude raw HTML in the dataset record.
saveHtmlAsLinkbooleanfalseSave HTML to KVS and put a link in the record.
saveSnapshotsbooleanfalsePNG screenshot (browser mode only).
extendOutputFunctionstringPython snippet — must define extend(soup, article, html) -> dict.
proxyConfigurationobject{useApifyProxy: false}Default = no proxy; auto-fallback to DC → RES if blocked.

Example input:

{
"startUrls": [{ "url": "https://www.theguardian.com" }],
"onlyArticlesForLastDays": 2,
"minWords": 150,
"maxArticlesPerCrawl": 5,
"useGoogleBotHeaders": true,
"scanSitemaps": true,
"proxyConfiguration": { "useApifyProxy": false }
}

📤 Output

Each pushed record contains:

FieldTypeDescription
url, loadedUrlstringOriginal / resolved URL.
domain, loadedDomainstringBare host.
referrer, startUrlstringWhere the link was discovered.
depthintegerBFS depth at time of crawl.
title, softTitlestringBest-effort headline.
datestring (ISO)Publication date if found.
authorarrayAuthor URL(s) or name(s).
publisher, copyright, lang, favicon, canonicalLinkstringSite metadata.
description, keywordsstringMeta description / keywords.
tagsarrayarticle:tag values.
imagestringHero / OG image URL.
videosarray<video> / <iframe> / <source> URLs.
linksarray of {text, href}Inner-body links.
wordCountintegerWord count of the extracted text.
textstringCleaned article body.
htmlstringFull HTML (only if saveHtml / saveHtmlAsLink).
screenshotUrlstringKVS link (only if saveSnapshots + useBrowser).

Example output (truncated):

{
"url": "https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toilet…",
"domain": "theguardian.com",
"title": "How often should you go to the toilet?…",
"date": "2026-05-21T04:00:02.000Z",
"author": ["https://www.theguardian.com/profile/sarahphillips"],
"publisher": "the Guardian",
"wordCount": 1620,
"text": "Think balance, diversity and routine. \"Our gut is a complex machine,\" says…",
"image": "https://i.guim.co.uk/img/media/…"
}

🚀 How to Use (Apify Console)

  1. Log in at https://console.apify.comActors.
  2. Open Smart Article Extractor.
  3. Configure inputs (Start URLs, date filters, caps, proxy).
  4. Click Start.
  5. Watch logs in real time — the actor prints a per-article live feed.
  6. Open the Output tab once the run completes.
  7. Export to JSON / CSV / XLSX or wire to a webhook.

🤖 Use via API / MCP

curl -X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"startUrls": [{"url": "https://www.theguardian.com"}],
"maxArticlesPerCrawl": 5,
"onlyArticlesForLastDays": 2,
"proxyConfiguration": {"useApifyProxy": false}
}'

MCP-server tool name: smart-article-extractor.


💡 Best Use Cases

  • 📰 News monitoring on a topic / publisher
  • 📊 NLP / sentiment / summarisation datasets
  • 🏛️ Brand or competitor coverage tracking
  • 🔍 SEO / SERP enrichment with full article text
  • 📚 Knowledge-base construction for RAG / LLMs
  • 🗞️ Press-clipping archives

💰 Pricing

Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.


❓ Frequently Asked Questions

Q: Why are some articles skipped?
A: They failed at least one filter — date cut-off, mustHaveDate, minWords, or onlyNewArticles (already seen in a previous run). The log line states which one.

Q: The site keeps blocking me.
A: Leave proxyConfiguration.useApifyProxy = false. The actor will auto-escalate to datacenter and then residential proxies (and retry up to 3 times residential). If even that fails, enable useBrowser.

Q: Will it work for paywalled articles?
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.

Q: How do I keep cross-run memory?
A: Toggle onlyNewArticles or onlyNewArticlesPerDomain. The actor keeps state in a named KVS — if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.

Q: Can I customise the output?
A: Yes — supply extendOutputFunction as a Python snippet defining extend(soup, article, html) -> dict. The returned dict is merged into the record.


🛟 Support & Feedback

Use the Issues tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.


  • Data is collected only from publicly available sources.
  • Do not scrape private accounts or content behind authentication unless explicitly authorised.
  • The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
  • The actor honours robots.txt for sitemap discovery; it does not enforce robots.txt blocks on crawl URLs — please be a good citizen.