RAG Web Browser avatar

RAG Web Browser

Pricing

from $3.99 / 1,000 results

Go to Apify Store
RAG Web Browser

RAG Web Browser

Pricing

from $3.99 / 1,000 results

Rating

0.0

(0)

Developer

Scrapio

Scrapio

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

16 hours ago

Last modified

Share

🌐 RAG Web Browser — Search & Scrape for AI Agents & LLM Pipelines

One actor. Any question. Clean Markdown back. Search Google → scrape the top results → return polished Markdown / HTML / plain text — ready to drop straight into your RAG pipeline, LangChain / LlamaIndex retriever, OpenAI Assistant, Claude, Gemini, or custom AI agent.

Apify Actor


✨ Why Choose This Actor?

🔥What you get
🚀Blazing-fast async pipeline (aiohttp + selectolax + lexbor)
🧠LLM-ready output — clean Markdown by default, HTML & plain text on demand
🛡️Smart proxy ladder — starts direct, auto-upgrades to datacenter → residential if a site blocks us
🔁Resilient retries — 3 residential attempts before giving up
🕸️Bulk URLs or one search query — single input, two modes
🍪Removes cookie / GDPR banners automatically
📰Readability mode — isolates article body for cleaner context
💾Live dataset writes — partial results survive crashes
🪟Open-source friendly — Apify SDK 3.x, Python 3.13

🎯 Key Features

  • 🔍 Google Search backbone — paginated, deduped, ranked results
  • 🌐 Direct URL mode — paste a list of URLs and skip search entirely
  • 🧹 Custom CSS scrub — strip nav, footer, scripts, modals, ads, …
  • 📑 Per-page metadata — title, description, language, redirect chain
  • 🔢 Per-section dataset views — Results · Metadata · Crawl status · Content
  • 🎚️ Tunable concurrency — 1 to 50 parallel fetches
  • 🐞 Debug mode — see byte length, final URL, content type
  • 💸 Pay-per-usage pricing — no separate per-event charges

📥 Input

The form matches the official RAG Web Browser layout, plus an optional bulk URLs field.

FieldTypeDefaultDescription
querystringweb browser for RAG pipelines -site:reddit.comSearch keywords or a single URL.
urlsarray[]Optional bulk URLs — skips search when set.
maxResultsinteger3Top organic results to scrape (1–100).
outputFormatsarray["markdown"]text, markdown, and/or html.
serpProxyGroupstringGOOGLE_SERPProxy group for Google Search (GOOGLE_SERP or SHADER).
serpMaxRetriesinteger2Retries when SERP fetch fails.
proxyConfigurationobject{ "useApifyProxy": true }Target-page proxies; auto-escalates to residential on block.
scrapingToolstringraw-httpraw-http (supported) or browser-playwright (falls back to HTTP).
removeElementsCssSelectorstring(sensible default)CSS to strip before extraction.
htmlTransformerenumnonenone or readable (article body).
maxRequestRetriesinteger1Target page retries (0–3).
dynamicContentWaitSecsinteger10For browser mode only (ignored for Raw HTTP).
removeCookieWarningsbooleantrueStrip cookie & GDPR dialogs.
debugModebooleanfalseAdd per-page debug info.

Example input

{
"query": "best web scraping libraries 2026",
"maxResults": 5,
"outputFormats": ["markdown"],
"removeCookieWarnings": true,
"proxyConfiguration": { "useApifyProxy": false }
}

Or scrape specific URLs:

{
"urls": [
"https://apify.com",
"https://playwright.dev",
"https://crawlee.dev"
],
"outputFormats": ["markdown", "text"]
}

📤 Output

Each dataset row contains:

{
"crawl": {
"httpStatusCode": 200,
"httpStatusMessage": "OK",
"loadedAt": "2026-05-19T12:50:40.591Z",
"uniqueKey": "21f8d32712",
"requestStatus": "handled"
},
"searchResult": {
"title": "RAG Web Browser",
"description": "Web search and fetch tool for AI agents and RAG pipelines ...",
"url": "https://apify.com/apify/rag-web-browser",
"resultType": "ORGANIC",
"rank": 1
},
"metadata": {
"title": "RAG Web Browser · Apify",
"description": "Web search and fetch tool for AI agents and RAG pipelines.",
"languageCode": "en",
"url": "https://apify.com/apify/rag-web-browser",
"redirectedUrl": "https://apify.com/apify/rag-web-browser"
},
"query": "web browser for RAG pipelines -site:reddit.com",
"markdown": "# RAG Web Browser\n\nWeb search and fetch tool for AI agents..."
}

The Apify Console renders the dataset with five tabs:

  • 📋 Overview — everything at a glance
  • 📄 Search results — rank, title, snippet, URL
  • 📑 Page metadata — title, description, language, redirect chain
  • 🛰️ Crawl status — HTTP code, request outcome, timestamps
  • 📝 Extracted content — Markdown / HTML / plain text per page

🚀 How to Use (Apify Console)

  1. Go to Apify Console → Actors.
  2. Open this actor (or import it as a task).
  3. Set your 🔎 Search query or paste a list of 🔗 URLs.
  4. Pick which 📝 Output formats you want (Markdown is the default).
  5. Click ▶ Start.
  6. Watch the run feed — you'll see emoji-prefixed live progress: 🔎 Searching…, 📄 Page 1 → +10 new, 🔗 [3] Fetching …, ✅ [3] 200 — Title…, 📊 Progress: 5/10 (50%).
  7. Open the 📦 Output tab to browse results by section.
  8. Export as JSON / CSV / XLSX, or pull via the Apify API.

🤖 Use via API / Integration

REST API

curl -X POST "https://api.apify.com/v2/acts/<ACTOR_ID>/run-sync-get-dataset-items?token=$APIFY_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"query": "vector database benchmarks 2026",
"maxResults": 5,
"outputFormats": ["markdown"]
}'

Python SDK

from apify_client import ApifyClient
client = ApifyClient("YOUR_APIFY_TOKEN")
run = client.actor("<ACTOR_ID>").call(run_input={
"query": "LangChain vs LlamaIndex",
"maxResults": 5,
})
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
print(item["metadata"]["title"], "→", item["markdown"][:200])

Drop-in for LangChain retrievers

from langchain.schema import Document
docs = [
Document(page_content=item["markdown"],
metadata={"source": item["metadata"]["url"], "rank": item["searchResult"]["rank"]})
for item in items if item.get("markdown")
]

🛡️ How blocking & proxies are handled

You don't need to think about proxies — the actor auto-tunes:

  1. 🟢 Direct by default (fastest, cheapest).
  2. If a site blocks us → 🟡 Datacenter proxy is engaged.
  3. Still blocked? → 🔴 Residential proxy with up to 3 retries.
  4. Once residential kicks in, it sticks for the rest of the run so successive pages don't fight the same wall.

All escalations are logged so you can audit them, e.g. 🛡️ Switching to residential connection (sticky) — reason: site responded with 403.


🎯 Best Use Cases

  • 🧠 RAG pipelines — feed fresh web context to your LLM at query time
  • 🤖 AI agents — give Claude / GPT / Gemini a real web-browsing skill
  • 🔬 Research assistants — bulk-summarize top N results for a topic
  • 📈 Competitive intelligence — track competitor pages on a schedule
  • 📰 Content monitoring — convert articles to Markdown for analysis
  • 🪄 Prompt enrichment — auto-grab fresh facts before generating text

💰 Pricing

This actor is pay-per-usage — you only pay for the Apify platform compute units (CUs) and proxy traffic it actually uses. There are no separate per-event charges.

DriverNotes
⏱️ Compute unitsProportional to memory × runtime. Typical 10-result run = a few cents.
🛡️ Datacenter proxyUsed only if a site blocks the direct request.
🛡️ Residential proxyUsed as a last resort. Higher cost but unblocks most walls.
💾 StorageA few KB per dataset row.

Want to lower cost further? Set maxResults lower, enable htmlTransformer: "readable", or skip html output.


❓ Frequently Asked Questions

Q: Do I need to configure a proxy myself? A: No. Start with no proxy (the default). If a site blocks the direct request, the actor automatically tries datacenter, then residential. You only need to pick a proxy explicitly if you want a specific geography.

Q: How is the Markdown produced? A: We parse HTML with selectolax (lexbor backend), strip noise via your CSS selectors, optionally isolate the article body, then convert to Markdown via markdownify with ATX-style headings.

Q: Can I scrape JavaScript-heavy sites? A: This actor uses HTTP-only fetching for maximum speed. For sites that require a full browser (heavy SPA / login flows), use a Playwright-based actor.

Q: Does it handle redirects? A: Yes — metadata.redirectedUrl captures the final URL after following redirects.

Q: What happens if half my pages succeed and half fail? A: You still get the successful ones. Each record is pushed to the dataset live, so a crash mid-run cannot wipe earlier results. Failed pages are saved with crawl.requestStatus: "failed" and the error message.

Q: Can I export results? A: Yes — JSON, CSV, XLSX, RSS, XML, HTML table, all available in the Output tab and via the Apify API.


  • The actor scrapes only publicly available web content.
  • Don't use it to scrape private, gated, or authenticated content unless you have explicit authorization.
  • You are responsible for legal compliance (GDPR, CCPA, site Terms of Service, robots.txt, copyright).
  • Be a good citizen — avoid excessive maxResults on sites you do not own or operate.

📨 Support & Feedback

Found a bug or have a feature request? Open an issue from the actor page in the Apify Console and we'll take a look. PRs welcome.


Built with 💙 on the Apify platform.