RAG Web Browser
Pricing
from $2.99 / 1,000 results
RAG Web Browser
Pricing
from $2.99 / 1,000 results
Rating
0.0
(0)
Developer
SimpleAPI
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
4 days ago
Last modified
Categories
Share
🌐 RAG Web Browser — Search & Scrape for AI Agents & LLM Pipelines
One actor. Any question. Clean Markdown back. Search Google → scrape the top results → return polished Markdown / HTML / plain text — ready to drop straight into your RAG pipeline, LangChain / LlamaIndex retriever, OpenAI Assistant, Claude, Gemini, or custom AI agent.
✨ Why Choose This Actor?
| 🔥 | What you get |
|---|---|
| 🚀 | Blazing-fast async pipeline (aiohttp + selectolax + lexbor) |
| 🧠 | LLM-ready output — clean Markdown by default, HTML & plain text on demand |
| 🛡️ | Smart proxy ladder — starts direct, auto-upgrades to datacenter → residential if a site blocks us |
| 🔁 | Resilient retries — 3 residential attempts before giving up |
| 🕸️ | Bulk URLs or one search query — single input, two modes |
| 🍪 | Removes cookie / GDPR banners automatically |
| 📰 | Readability mode — isolates article body for cleaner context |
| 💾 | Live dataset writes — partial results survive crashes |
| 🪟 | Open-source friendly — Apify SDK 3.x, Python 3.13 |
🎯 Key Features
- 🔍 Google Search backbone — paginated, deduped, ranked results
- 🌐 Direct URL mode — paste a list of URLs and skip search entirely
- 🧹 Custom CSS scrub — strip nav, footer, scripts, modals, ads, …
- 📑 Per-page metadata — title, description, language, redirect chain
- 🔢 Per-section dataset views — Results · Metadata · Crawl status · Content
- 🎚️ Tunable concurrency — 1 to 50 parallel fetches
- 🐞 Debug mode — see byte length, final URL, content type
- 💸 Pay-per-usage pricing — no separate per-event charges
📥 Input
The form matches the official RAG Web Browser layout, plus an optional bulk URLs field.
| Field | Type | Default | Description |
|---|---|---|---|
query | string | web browser for RAG pipelines -site:reddit.com | Search keywords or a single URL. |
urls | array | [] | Optional bulk URLs — skips search when set. |
maxResults | integer | 3 | Top organic results to scrape (1–100). |
outputFormats | array | ["markdown"] | text, markdown, and/or html. |
serpProxyGroup | string | GOOGLE_SERP | Proxy group for Google Search (GOOGLE_SERP or SHADER). |
serpMaxRetries | integer | 2 | Retries when SERP fetch fails. |
proxyConfiguration | object | { "useApifyProxy": true } | Target-page proxies; auto-escalates to residential on block. |
scrapingTool | string | raw-http | raw-http (supported) or browser-playwright (falls back to HTTP). |
removeElementsCssSelector | string | (sensible default) | CSS to strip before extraction. |
htmlTransformer | enum | none | none or readable (article body). |
maxRequestRetries | integer | 1 | Target page retries (0–3). |
dynamicContentWaitSecs | integer | 10 | For browser mode only (ignored for Raw HTTP). |
removeCookieWarnings | boolean | true | Strip cookie & GDPR dialogs. |
debugMode | boolean | false | Add per-page debug info. |
Example input
{"query": "best web scraping libraries 2026","maxResults": 5,"outputFormats": ["markdown"],"removeCookieWarnings": true,"proxyConfiguration": { "useApifyProxy": false }}
Or scrape specific URLs:
{"urls": ["https://apify.com","https://playwright.dev","https://crawlee.dev"],"outputFormats": ["markdown", "text"]}
📤 Output
Each dataset row contains:
{"crawl": {"httpStatusCode": 200,"httpStatusMessage": "OK","loadedAt": "2026-05-19T12:50:40.591Z","uniqueKey": "21f8d32712","requestStatus": "handled"},"searchResult": {"title": "RAG Web Browser","description": "Web search and fetch tool for AI agents and RAG pipelines ...","url": "https://apify.com/apify/rag-web-browser","resultType": "ORGANIC","rank": 1},"metadata": {"title": "RAG Web Browser · Apify","description": "Web search and fetch tool for AI agents and RAG pipelines.","languageCode": "en","url": "https://apify.com/apify/rag-web-browser","redirectedUrl": "https://apify.com/apify/rag-web-browser"},"query": "web browser for RAG pipelines -site:reddit.com","markdown": "# RAG Web Browser\n\nWeb search and fetch tool for AI agents..."}
The Apify Console renders the dataset with five tabs:
- 📋 Overview — everything at a glance
- 📄 Search results — rank, title, snippet, URL
- 📑 Page metadata — title, description, language, redirect chain
- 🛰️ Crawl status — HTTP code, request outcome, timestamps
- 📝 Extracted content — Markdown / HTML / plain text per page
🚀 How to Use (Apify Console)
- Go to Apify Console → Actors.
- Open this actor (or import it as a task).
- Set your 🔎 Search query or paste a list of 🔗 URLs.
- Pick which 📝 Output formats you want (Markdown is the default).
- Click ▶ Start.
- Watch the run feed — you'll see emoji-prefixed live progress:
🔎 Searching…,📄 Page 1 → +10 new,🔗 [3] Fetching …,✅ [3] 200 — Title…,📊 Progress: 5/10 (50%). - Open the 📦 Output tab to browse results by section.
- Export as JSON / CSV / XLSX, or pull via the Apify API.
🤖 Use via API / Integration
REST API
curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \-H "Content-Type: application/json" \-d '{"query": "vector database benchmarks 2026","maxResults": 5,"outputFormats": ["markdown"]}'
Python SDK
from apify_client import ApifyClientclient = ApifyClient("YOUR_APIFY_TOKEN")run = client.actor("<ACTOR_ID>").call(run_input={"query": "LangChain vs LlamaIndex","maxResults": 5,})for item in client.dataset(run["defaultDatasetId"]).iterate_items():print(item["metadata"]["title"], "→", item["markdown"][:200])
Drop-in for LangChain retrievers
from langchain.schema import Documentdocs = [Document(page_content=item["markdown"],metadata={"source": item["metadata"]["url"], "rank": item["searchResult"]["rank"]})for item in items if item.get("markdown")]
🛡️ How blocking & proxies are handled
You don't need to think about proxies — the actor auto-tunes:
- 🟢 Direct by default (fastest, cheapest).
- If a site blocks us → 🟡 Datacenter proxy is engaged.
- Still blocked? → 🔴 Residential proxy with up to 3 retries.
- Once residential kicks in, it sticks for the rest of the run so successive pages don't fight the same wall.
All escalations are logged so you can audit them, e.g. 🛡️ Switching to residential connection (sticky) — reason: site responded with 403.
🎯 Best Use Cases
- 🧠 RAG pipelines — feed fresh web context to your LLM at query time
- 🤖 AI agents — give Claude / GPT / Gemini a real web-browsing skill
- 🔬 Research assistants — bulk-summarize top N results for a topic
- 📈 Competitive intelligence — track competitor pages on a schedule
- 📰 Content monitoring — convert articles to Markdown for analysis
- 🪄 Prompt enrichment — auto-grab fresh facts before generating text
💰 Pricing
This actor is pay-per-usage — you only pay for the Apify platform compute units (CUs) and proxy traffic it actually uses. There are no separate per-event charges.
| Driver | Notes |
|---|---|
| ⏱️ Compute units | Proportional to memory × runtime. Typical 10-result run = a few cents. |
| 🛡️ Datacenter proxy | Used only if a site blocks the direct request. |
| 🛡️ Residential proxy | Used as a last resort. Higher cost but unblocks most walls. |
| 💾 Storage | A few KB per dataset row. |
Want to lower cost further? Set
maxResultslower, enablehtmlTransformer: "readable", or skiphtmloutput.
❓ Frequently Asked Questions
Q: Do I need to configure a proxy myself? A: No. Start with no proxy (the default). If a site blocks the direct request, the actor automatically tries datacenter, then residential. You only need to pick a proxy explicitly if you want a specific geography.
Q: How is the Markdown produced? A: We parse HTML with selectolax (lexbor backend), strip noise via your CSS selectors, optionally isolate the article body, then convert to Markdown via markdownify with ATX-style headings.
Q: Can I scrape JavaScript-heavy sites? A: This actor uses HTTP-only fetching for maximum speed. For sites that require a full browser (heavy SPA / login flows), use a Playwright-based actor.
Q: Does it handle redirects?
A: Yes — metadata.redirectedUrl captures the final URL after following redirects.
Q: What happens if half my pages succeed and half fail?
A: You still get the successful ones. Each record is pushed to the dataset live, so a crash mid-run cannot wipe earlier results. Failed pages are saved with crawl.requestStatus: "failed" and the error message.
Q: Can I export results? A: Yes — JSON, CSV, XLSX, RSS, XML, HTML table, all available in the Output tab and via the Apify API.
📜 Cautions / Legal
- The actor scrapes only publicly available web content.
- Don't use it to scrape private, gated, or authenticated content unless you have explicit authorization.
- You are responsible for legal compliance (GDPR, CCPA, site Terms of Service, robots.txt, copyright).
- Be a good citizen — avoid excessive
maxResultson sites you do not own or operate.
📨 Support & Feedback
Found a bug or have a feature request? Open an issue from the actor page in the Apify Console and we'll take a look. PRs welcome.
Built with 💙 on the Apify platform.