🧠 Smart Article Extractor
Pricing
from $4.99 / 1,000 results
🧠 Smart Article Extractor
Pricing
from $4.99 / 1,000 results
Rating
0.0
(0)
Developer
Scraper Engine
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
9 days ago
Last modified
Categories
Share
🧠 Smart Article Extractor — News & Blog Scraper
One-paragraph summary: Smart Article Extractor is an Apify Actor that bulk-extracts clean article content — title, author, publish date, full text, summary, images, videos, in-body links and rich metadata — from any news site, blog or sitemap. Point it at a homepage / section / topic URL and it will discover, classify and extract every article automatically using a BFS crawler, sitemap scanning, and configurable URL-shape heuristics.
🚀 Why Choose Us?
| Feature | Smart Article Extractor | Typical 1-URL article scraper |
|---|---|---|
| Bulk discovery (BFS crawler) | ✅ Yes | ❌ One URL at a time |
| Sitemap & robots.txt scanning | ✅ Built-in | ❌ |
| Sub-domain / sub-path scoping | ✅ Per Start URL | ❌ |
onlyNewArticles cross-run dedup | ✅ Per-domain & global | ❌ |
Date filters (dateFrom, lastDays, mustHaveDate) | ✅ All three | ⚠️ Limited |
| Anti-block proxy fallback (none → DC → RES) | ✅ Automatic | ❌ |
| Optional Playwright rendering | ✅ Toggle | ❌ |
| Extend-output Python hook | ✅ Inline snippet | ❌ |
| Live dataset push + state KVS | ✅ | ⚠️ |
🔥 Key Features
- 📰 Clean article extraction — trafilatura + BeautifulSoup combo for high recall.
- 🌐 Bulk discovery — drop a homepage URL and the actor discovers articles via BFS.
- 🗺️ Sitemap & robots.txt — automatic
Sitemap:parsing + common candidates. - 🛡️ Smart proxy fallback — starts direct, then datacenter, then residential.
- 🎭 Headless browser mode — Playwright + Chromium for JS-heavy or protected sites.
- 🧠 Cross-run memory —
onlyNewArticlesandonlyNewArticlesPerDomain. - 🪜 Depth / page / article caps — never over-crawl.
- 📅 Date filters —
dateFrom,onlyArticlesForLastDays,mustHaveDate. - 🛠️
extendOutputFunction— inject your own Pythonextend(soup, article, html). - 💾 Save HTML / snapshots — full HTML in-record or as KVS link, PNG screenshots.
📥 Input
| Field | Type | Default | Description |
|---|---|---|---|
startUrls | array | required | Homepages, sections, topic pages — used as crawl seeds. |
articleUrls | array | [] | Direct article URLs to extract (no discovery needed). |
onlyNewArticles | boolean | false | Skip URLs already seen in any previous run. |
onlyNewArticlesPerDomain | boolean | false | Per-domain dedup memory. |
onlyInsideArticles | boolean | true | Enqueue only same-domain links from articles. |
onlySubdomainArticles | boolean | false | Restrict to URLs sharing the Start URL path prefix. |
enqueueFromArticles | boolean | true | Discover further links inside extracted articles. |
crawlWholeSubdomain | boolean | true | Treat any same-subdomain link as a category candidate. |
scanSitemaps | boolean | true | Discover articles from robots.txt and common sitemap paths. |
useGoogleBotHeaders | boolean | true | Identify as Googlebot. |
useBrowser | boolean | false | Render with headless Chromium. |
scrollToBottom | boolean | false | Force lazy-loaded content (browser mode only). |
mustHaveDate | boolean | false | Drop articles with no detectable date. |
dateFrom | string (ISO date) | — | Earliest article date. |
onlyArticlesForLastDays | integer | — | Convenience cut-off. |
minWords | integer | 150 | Reject short articles. |
maxDepth | integer | 2 | BFS depth. |
maxPagesPerCrawl | integer | 50 | Hard cap on fetched pages. |
maxArticlesPerCrawl | integer | 25 | Hard cap on saved articles. |
maxArticlesPerStartUrl | integer | 25 | Cap per Start URL. |
isUrlArticleDefinition | object | see schema | URL-shape heuristic. |
linkSelector | string | — | CSS selector restricting where links are collected from. |
pseudoUrls | array | [] | Custom URL patterns for category pages. |
sitemapUrls | array | [] | Explicit sitemap URLs (skip auto-discovery). |
saveHtml | boolean | false | Include raw HTML in the dataset record. |
saveHtmlAsLink | boolean | false | Save HTML to KVS and put a link in the record. |
saveSnapshots | boolean | false | PNG screenshot (browser mode only). |
extendOutputFunction | string | — | Python snippet — must define extend(soup, article, html) -> dict. |
proxyConfiguration | object | {useApifyProxy: false} | Default = no proxy; auto-fallback to DC → RES if blocked. |
Example input:
{"startUrls": [{ "url": "https://www.theguardian.com" }],"onlyArticlesForLastDays": 2,"minWords": 150,"maxArticlesPerCrawl": 5,"useGoogleBotHeaders": true,"scanSitemaps": true,"proxyConfiguration": { "useApifyProxy": false }}
📤 Output
Each pushed record contains:
| Field | Type | Description |
|---|---|---|
url, loadedUrl | string | Original / resolved URL. |
domain, loadedDomain | string | Bare host. |
referrer, startUrl | string | Where the link was discovered. |
depth | integer | BFS depth at time of crawl. |
title, softTitle | string | Best-effort headline. |
date | string (ISO) | Publication date if found. |
author | array | Author URL(s) or name(s). |
publisher, copyright, lang, favicon, canonicalLink | string | Site metadata. |
description, keywords | string | Meta description / keywords. |
tags | array | article:tag values. |
image | string | Hero / OG image URL. |
videos | array | <video> / <iframe> / <source> URLs. |
links | array of {text, href} | Inner-body links. |
wordCount | integer | Word count of the extracted text. |
text | string | Cleaned article body. |
html | string | Full HTML (only if saveHtml / saveHtmlAsLink). |
screenshotUrl | string | KVS link (only if saveSnapshots + useBrowser). |
Example output (truncated):
{"url": "https://www.theguardian.com/lifeandstyle/2026/may/21/how-often-should-you-go-to-the-toilet…","domain": "theguardian.com","title": "How often should you go to the toilet?…","date": "2026-05-21T04:00:02.000Z","author": ["https://www.theguardian.com/profile/sarahphillips"],"publisher": "the Guardian","wordCount": 1620,"text": "Think balance, diversity and routine. \"Our gut is a complex machine,\" says…","image": "https://i.guim.co.uk/img/media/…"}
🚀 How to Use (Apify Console)
- Log in at https://console.apify.com → Actors.
- Open Smart Article Extractor.
- Configure inputs (Start URLs, date filters, caps, proxy).
- Click Start.
- Watch logs in real time — the actor prints a per-article live feed.
- Open the Output tab once the run completes.
- Export to JSON / CSV / XLSX or wire to a webhook.
🤖 Use via API / MCP
curl -X POST "https://api.apify.com/v2/acts/<USERNAME>~smart-article-extractor/run-sync-get-dataset-items?token=$APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"startUrls": [{"url": "https://www.theguardian.com"}],"maxArticlesPerCrawl": 5,"onlyArticlesForLastDays": 2,"proxyConfiguration": {"useApifyProxy": false}}'
MCP-server tool name: smart-article-extractor.
💡 Best Use Cases
- 📰 News monitoring on a topic / publisher
- 📊 NLP / sentiment / summarisation datasets
- 🏛️ Brand or competitor coverage tracking
- 🔍 SEO / SERP enrichment with full article text
- 📚 Knowledge-base construction for RAG / LLMs
- 🗞️ Press-clipping archives
💰 Pricing
Pay-per-usage. You only pay the Apify platform charges (compute time + proxies + transfer). No separate developer fee.
❓ Frequently Asked Questions
Q: Why are some articles skipped?
A: They failed at least one filter — date cut-off, mustHaveDate, minWords, or onlyNewArticles (already seen in a previous run). The log line states which one.
Q: The site keeps blocking me.
A: Leave proxyConfiguration.useApifyProxy = false. The actor will auto-escalate to datacenter and then residential proxies (and retry up to 3 times residential). If even that fails, enable useBrowser.
Q: Will it work for paywalled articles?
A: It honours soft-paywall workarounds (Googlebot UA) but does not bypass strict authentication.
Q: How do I keep cross-run memory?
A: Toggle onlyNewArticles or onlyNewArticlesPerDomain. The actor keeps state in a named KVS — if that fails (e.g. Store run with limited permissions) it falls back to the run-default store.
Q: Can I customise the output?
A: Yes — supply extendOutputFunction as a Python snippet defining extend(soup, article, html) -> dict. The returned dict is merged into the record.
🛟 Support & Feedback
Use the Issues tab on the Actor page, or open a discussion on the Apify community forum. Pull requests are welcome.
⚖️ Cautions / legal
- Data is collected only from publicly available sources.
- Do not scrape private accounts or content behind authentication unless explicitly authorised.
- The end user is responsible for legal compliance (GDPR, CCPA, anti-spam laws, target site ToS, etc.).
- The actor honours
robots.txtfor sitemap discovery; it does not enforce robots.txt blocks on crawl URLs — please be a good citizen.