# RSS / XML Scraper (`shahidirfan/rss-xml-scraper`) Actor

Meet the RSS / XML Scraper: the most advanced actor for parsing any RSS feed or XML file. It effortlessly extracts clean, structured data from even the most complex sources. Your ultimate tool for content aggregation, data monitoring, and content analysis.

- **URL**: https://apify.com/shahidirfan/rss-xml-scraper.md
- **Developed by:** [Shahid Irfan](https://apify.com/shahidirfan) (community)
- **Categories:** Developer tools, Automation, News
- **Stats:** 99 total users, 20 monthly users, 98.9% runs succeeded, 5 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

Pay per usage

This Actor is paid per platform usage. The Actor is free to use, and you only pay for the Apify platform usage, which gets cheaper the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-usage

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## RSS XML Feed Scraper
Extract RSS and Atom feed data into structured datasets for monitoring, research, and content pipelines. Collect feed metadata, article metadata, tags, media links, and optional full article text at scale. Built for reliable feed ingestion with clean output suitable for automation and analytics.

### Features
- **Feed URL list input** — Add one or many feed URLs quickly with string-list input.
- **Feed discovery mode** — Discover valid feeds from website URLs when needed.
- **Full article expansion** — Expand snippet-only items into richer full text and HTML.
- **Batch processing** — Process feed entries in batches for faster and steadier runs.
- **Batch dataset writes** — Push extracted items in batches for better write throughput.
- **Fallback feed support** — Runs with a default BBC feed when no URL is provided.
- **Proxy-ready input** — Optional proxy configuration with Apify Proxy disabled by default.

### Use Cases
#### News Monitoring
Track breaking stories from multiple publishers in one scheduled run. Store normalized records for alerts, dashboards, and trend tracking.

#### Content Aggregation
Collect article headlines, descriptions, publish dates, and links for newsletters and curation workflows. Expand snippets into fuller article text when needed.

#### Competitive Intelligence
Monitor competitor blogs and media feeds continuously. Compare publishing frequency, topic clusters, and update cadence.

#### Research Datasets
Build structured datasets for NLP, topic modeling, and sentiment workflows. Export in JSON, CSV, Excel, or XML for downstream analysis.

### Input Parameters
| Parameter | Type | Required | Default | Description |
|-----------|------|----------|---------|-------------|
| `urls` | Array | No | `["https://feeds.bbci.co.uk/news/rss.xml"]` | List of feed URLs or website URLs. |
| `extractContent` | Boolean | No | `false` | Forces full article extraction for entries. |
| `autoExpandSnippets` | Boolean | No | `true` | Expands snippet-only feed items to fuller content automatically. |
| `maxEntries` | Integer | No | `20` | Maximum entries per feed. Use `0` to process all entries. |
| `discoverFeeds` | Boolean | No | `false` | If `true`, website URLs are scanned for valid feeds. |
| `userAgent` | String | No | `""` | Optional custom user agent header. |
| `proxyConfiguration` | Object | No | `{ "useApifyProxy": false }` | Optional proxy settings. |

---

### Output Data
Each dataset item contains feed-level or entry-level fields, depending on `item_type`.

| Field | Type | Description |
|-------|------|-------------|
| `item_type` | String | Record type such as `feed_meta` or `entry`. |
| `feed_url` | String | Source feed URL. |
| `title` | String | Feed title or article title. |
| `link` | String | Feed homepage or article URL. |
| `description` | String | Description or teaser text. |
| `summary` | String | Summary text from feed payload. |
| `content` | String | Parsed content text from feed payload. |
| `author` | String | Primary author name when available. |
| `authors` | Array | Author list when available. |
| `published` | String | Published time in ISO format when available. |
| `updated` | String | Updated time in ISO format when available. |
| `tags` | Array | Categories or tags from feed entries. |
| `source_title` | String | Source/channel title for the entry when present. |
| `source_url` | String | Source/channel URL for the entry when present. |
| `enclosure_url` | String | Enclosure/media URL when present. |
| `image_url` | String | Best available image URL from feed fields. |
| `full_text` | String | Expanded full article text. |
| `full_html` | String | Cleaned full article HTML. |
| `meta_description` | String | Article meta description when available. |
| `top_image` | String | Primary article image when available. |
| `publish_date` | String | Article publish timestamp when available. |
| `content_source` | String | Indicates whether full content came from feed payload or article page. |
| `content_error` | String | Full-content extraction error details if any. |
| `error` | String | Processing error details if any. |
| `collected_at` | String | Collection timestamp in ISO format. |

---

### Usage Examples
#### Basic Feed Collection
```json
{
  "urls": [
    "https://feeds.bbci.co.uk/news/rss.xml"
  ]
}
````

#### Multiple Feeds

```json
{
  "urls": [
    "https://feeds.bbci.co.uk/news/rss.xml",
    "https://www.theguardian.com/world/rss",
    "https://hnrss.org/frontpage"
  ],
  "maxEntries": 50
}
```

#### Force Full Article Extraction

```json
{
  "urls": [
    "https://feeds.bbci.co.uk/news/rss.xml"
  ],
  "extractContent": true,
  "maxEntries": 25
}
```

#### Discover Feeds from Website URLs

```json
{
  "urls": [
    "https://example.com"
  ],
  "discoverFeeds": true
}
```

#### Proxy Configuration

```json
{
  "urls": [
    "https://feeds.bbci.co.uk/news/rss.xml"
  ],
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}
```

***

### Sample Output

```json
{
  "item_type": "entry",
  "feed_url": "https://feeds.bbci.co.uk/news/rss.xml",
  "title": "Example Article Title",
  "link": "https://www.example.com/articles/123",
  "description": "Short teaser from the feed.",
  "summary": "Extended summary text from feed payload.",
  "author": "Reporter Name",
  "published": "2026-04-01T12:20:51.000Z",
  "tags": ["world", "politics"],
  "source_title": "Example News",
  "full_text": "Expanded article text...",
  "content_source": "article_page",
  "collected_at": "2026-04-01T15:30:00.000Z"
}
```

***

### Tips for Best Results

#### Use Stable Feed URLs

- Prefer canonical RSS/Atom URLs from publishers.
- Keep feed URL lists clean and deduplicated.

#### Start with Moderate Limits

- Use `maxEntries` around `20` to validate data quality quickly.
- Increase limits after verifying target feed behavior.

#### Enable Full Content Only When Needed

- Keep `extractContent` off for lightweight metadata pipelines.
- Turn it on for downstream NLP, summarization, or archiving workflows.

#### Handle Site Restrictions

- Use `proxyConfiguration` when targets rate-limit requests.
- Set a custom `userAgent` for sites with strict header checks.

***

### Integrations

Connect extracted feed data with:

- **Google Sheets** — Build live monitoring sheets.
- **Airtable** — Create searchable editorial databases.
- **Slack** — Trigger alerts for new stories or keywords.
- **Webhooks** — Send records to internal services in real time.
- **Make** — Automate enrichment and routing flows.
- **Zapier** — Connect feeds to business tools without code.

#### Export Formats

- **JSON** — API and backend workflows.
- **CSV** — Spreadsheet analytics.
- **Excel** — Business reporting.
- **XML** — Legacy system integration.

***

### Frequently Asked Questions

#### What happens if I do not provide any feed URL?

The actor uses a default BBC feed and still produces data.

#### Can I scrape multiple feeds in one run?

Yes, provide multiple URLs in the `urls` list.

#### Does it collect feed metadata and entry data?

Yes, output includes feed-level metadata records and entry-level records.

#### Can it expand short snippets to full text?

Yes, use `extractContent: true` or keep `autoExpandSnippets: true`.

#### Will it fail if an article page blocks extraction?

No, it falls back to feed-provided summary/content when possible.

#### Does it support proxies?

Yes, pass `proxyConfiguration`. By default, Apify Proxy is disabled.

#### Can I use website URLs instead of direct feed URLs?

Yes, enable `discoverFeeds` to discover feed endpoints from websites.

***

### Support

For issues or feature requests, use the Apify actor page discussion and support channels.

#### Resources

- [Apify Documentation](https://docs.apify.com/)
- [Apify API Reference](https://docs.apify.com/api/v2)
- [Scheduling Runs](https://docs.apify.com/platform/schedules)

***

### Legal Notice

This actor is intended for legitimate data collection. You are responsible for complying with website terms, robots policies, and applicable laws in your jurisdiction.

# Actor input Schema

## `urls` (type: `array`):

List RSS/Atom feed URLs or website URLs. One URL per line for easy management.

## `extractContent` (type: `boolean`):

Whether to extract full article content. This will fetch and parse full article text, HTML, keywords, and metadata from each feed entry link.

## `autoExpandSnippets` (type: `boolean`):

When true, the actor automatically fetches full article content if feed entries contain only short snippets or summaries.

## `discoverFeeds` (type: `boolean`):

Automatically discover RSS/Atom feeds from website URLs. When enabled, non-feed URLs will be scanned for feed links.

## `maxEntries` (type: `integer`):

Maximum number of entries to process from each feed. Set to 0 to process all entries.

## `proxyConfiguration` (type: `object`):

Optional proxy settings. Default keeps Apify Proxy disabled for faster direct feed fetching.

## Actor input object example

```json
{
  "urls": [
    "https://feeds.bbci.co.uk/news/rss.xml"
  ],
  "extractContent": false,
  "autoExpandSnippets": true,
  "discoverFeeds": false,
  "maxEntries": 20,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}
```

# Actor output Schema

## `overview` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://feeds.bbci.co.uk/news/rss.xml"
    ],
    "maxEntries": 20
};

// Run the Actor and wait for it to finish
const run = await client.actor("shahidirfan/rss-xml-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "urls": ["https://feeds.bbci.co.uk/news/rss.xml"],
    "maxEntries": 20,
}

# Run the Actor and wait for it to finish
run = client.actor("shahidirfan/rss-xml-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://feeds.bbci.co.uk/news/rss.xml"
  ],
  "maxEntries": 20
}' |
apify call shahidirfan/rss-xml-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=shahidirfan/rss-xml-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "RSS / XML Scraper",
        "description": "Meet the RSS / XML Scraper: the most advanced actor for parsing any RSS feed or XML file. It effortlessly extracts clean, structured data from even the most complex sources. Your ultimate tool for content aggregation, data monitoring, and content analysis.",
        "version": "0.0",
        "x-build-id": "NJd4hPcG3Zx9aPGgV"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/shahidirfan~rss-xml-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-shahidirfan-rss-xml-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/shahidirfan~rss-xml-scraper/runs": {
            "post": {
                "operationId": "runs-sync-shahidirfan-rss-xml-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/shahidirfan~rss-xml-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-shahidirfan-rss-xml-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "urls": {
                        "title": "Feed URLs List",
                        "type": "array",
                        "description": "List RSS/Atom feed URLs or website URLs. One URL per line for easy management.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "extractContent": {
                        "title": "Extract Full Article Content",
                        "type": "boolean",
                        "description": "Whether to extract full article content. This will fetch and parse full article text, HTML, keywords, and metadata from each feed entry link.",
                        "default": false
                    },
                    "autoExpandSnippets": {
                        "title": "Auto Expand Snippet-only Entries",
                        "type": "boolean",
                        "description": "When true, the actor automatically fetches full article content if feed entries contain only short snippets or summaries.",
                        "default": true
                    },
                    "discoverFeeds": {
                        "title": "Discover Feeds from Websites",
                        "type": "boolean",
                        "description": "Automatically discover RSS/Atom feeds from website URLs. When enabled, non-feed URLs will be scanned for feed links.",
                        "default": false
                    },
                    "maxEntries": {
                        "title": "Maximum Entries per Feed",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Maximum number of entries to process from each feed. Set to 0 to process all entries.",
                        "default": 20
                    },
                    "proxyConfiguration": {
                        "title": "Proxy Configuration",
                        "type": "object",
                        "description": "Optional proxy settings. Default keeps Apify Proxy disabled for faster direct feed fetching.",
                        "default": {
                            "useApifyProxy": false
                        }
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
