# arXiv Preprint Scraper (`parseforge/arxiv-scraper`) Actor

Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Pull titles, authors, abstracts, categories, DOIs, journal refs, and PDF links.

- **URL**: https://apify.com/parseforge/arxiv-scraper.md
- **Developed by:** [ParseForge](https://apify.com/parseforge) (community)
- **Categories:** Other, Automation
- **Stats:** 17 total users, 2 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: 5.00 out of 5 stars

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

![ParseForge Banner](https://github.com/ParseForge/apify-assets/blob/ad35ccc13ddd068b9d6cba33f323962e39aed5b2/banner.jpg?raw=true)

## 📐 arXiv Preprint Scraper

> 🚀 **Export open-access research papers in seconds.** Query **2.5M+ arXiv preprints** across physics, math, computer science, biology, finance, statistics, and economics by keyword, author, category, or date range. No API key, no registration, no XML parsing.

> 🕒 **Last updated:** 2026-05-27 · **📊 13 fields** per record · **📄 2.5M+ preprints** · **🧠 8 macro disciplines** · **🔖 170+ subject categories**

The **arXiv Preprint Scraper** searches the open-access arXiv archive and returns **13 fields per record**, including arXiv ID, title, authors, abstract, primary category, every secondary category, DOI, journal reference, comment, publication dates, and direct links to both the PDF and the abstract page. arXiv has hosted preprints since 1991 and has become the de facto place where physicists, computer scientists, and mathematicians first publish new work.

The catalog covers **eight macro disciplines and more than 170 subject categories**, from `cs.LG` (machine learning) and `math.AG` (algebraic geometry) to `q-bio.NC` (neurons and cognition) and `econ.EM` (econometrics). This Actor accepts the full arXiv query syntax, so you can filter by title, abstract, author, category, or any boolean combination, and download the dataset as CSV, Excel, JSON, or XML.

| 🎯 Target Audience | 💡 Primary Use Cases |
|---|---|
| ML engineers, academic researchers, literature-review teams, science journalists, R&D groups, librarians, citation tools, AI agents | Paper discovery, trend tracking, author monitoring, citation graphs, RAG/training data, alert pipelines, systematic reviews |

---

### 📋 What the arXiv Scraper does

Four research workflows in a single run:

- 🔍 **Keyword search.** Use `all:transformer` or `ti:attention abs:retrieval` to scope to titles or abstracts.
- 👤 **Author monitoring.** Use `au:hinton` or `au:lecun AND cat:cs.LG` to track an author's output.
- 🧠 **Category feeds.** Use `cat:cs.LG`, `cat:hep-ph`, or `cat:q-bio.NC` for category-specific firehoses.
- 📅 **Recency sort.** Order by `submittedDate` or `lastUpdatedDate`, descending or ascending, to surface the latest work first.

Each record includes the arXiv identifier, full title, every co-author, the abstract, primary and secondary categories, the DOI if assigned, journal reference, author comments, both `published` and `updated` timestamps, and ready-to-open PDF and abstract URLs.

> 💡 **Why it matters:** scientific output doubles roughly every nine years. Tracking the literature by hand is impossible. Calling the public arXiv interface yourself means writing an XML parser, respecting rate limits, and managing pagination. This Actor turns that into a one-click data pull that returns clean JSON.

---

### 🎬 Full Demo

_🚧 Coming soon: a 3-minute walkthrough showing how to go from sign-up to a downloaded dataset of papers._

---

### ⚙️ Input

<table>
<thead>
<tr><th>Input</th><th>Type</th><th>Default</th><th>Behavior</th></tr>
</thead>
<tbody>
<tr><td>maxItems</td><td>integer</td><td>10</td><td>Records to return. Free plan caps at 10, paid plan at 1,000,000.</td></tr>
<tr><td>searchQuery</td><td>string</td><td>"all:transformer"</td><td>arXiv query syntax. Prefixes: `all`, `ti`, `abs`, `au`, `cat`, `id`. Boolean operators `AND`, `OR`, `ANDNOT` are supported.</td></tr>
<tr><td>sortBy</td><td>string</td><td>"relevance"</td><td>One of `relevance`, `lastUpdatedDate`, or `submittedDate`.</td></tr>
<tr><td>sortOrder</td><td>string</td><td>"descending"</td><td>`descending` (newest first) or `ascending` (oldest first).</td></tr>
</tbody>
</table>

**Example: 50 most recent machine-learning preprints.**

```json
{
    "maxItems": 50,
    "searchQuery": "cat:cs.LG",
    "sortBy": "submittedDate",
    "sortOrder": "descending"
}
````

**Example: every paper by Yann LeCun on neural networks, newest first.**

```json
{
    "maxItems": 100,
    "searchQuery": "au:lecun AND abs:neural",
    "sortBy": "submittedDate",
    "sortOrder": "descending"
}
```

> ⚠️ **Good to Know:** arXiv is a preprint server. Most papers are pre-publication and may not yet be peer-reviewed. The `journalRef` field is populated once an author updates the metadata after journal acceptance, and the `doi` field follows the same rule. For systematic reviews, combine this Actor with a peer-review check downstream.

***

### 📊 Output

Each preprint record contains **13 fields**. Download the dataset as CSV, Excel, JSON, or XML.

#### 🧾 Schema

| Field | Type | Example |
|---|---|---|
| 🆔 `arxivId` | string | `"1706.03762v7"` |
| 📄 `title` | string | `"Attention Is All You Need"` |
| 👥 `authors` | string\[] | `["Ashish Vaswani", "Noam Shazeer", "..."]` |
| 📝 `summary` | string | `"The dominant sequence transduction models..."` |
| 🧠 `primaryCategory` | string | `"cs.CL"` |
| 🏷️ `categories` | string\[] | `["cs.CL", "cs.LG"]` |
| 🔗 `doi` | string | null | `"10.48550/arXiv.1706.03762"` |
| 📚 `journalRef` | string | null | `"NeurIPS 2017"` |
| 💬 `comment` | string | null | `"15 pages, 5 figures"` |
| 📅 `publishedDate` | ISO 8601 | `"2017-06-12T17:57:34Z"` |
| 🔁 `updatedDate` | ISO 8601 | `"2023-08-02T00:41:18Z"` |
| 📥 `pdfUrl` | string | `"https://arxiv.org/pdf/1706.03762"` |
| 🔖 `abstractUrl` | string | `"http://arxiv.org/abs/1706.03762v7"` |
| 🕒 `scrapedAt` | ISO 8601 | `"2026-05-27T00:00:00.000Z"` |

#### 📦 Sample records

<details>
<summary><strong>🧠 Foundational ML paper: Attention Is All You Need</strong></summary>

```json
{
    "arxivId": "1706.03762v7",
    "title": "Attention Is All You Need",
    "authors": [
        "Ashish Vaswani",
        "Noam Shazeer",
        "Niki Parmar",
        "Jakob Uszkoreit",
        "Llion Jones",
        "Aidan N. Gomez",
        "Lukasz Kaiser",
        "Illia Polosukhin"
    ],
    "summary": "The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms...",
    "primaryCategory": "cs.CL",
    "categories": ["cs.CL", "cs.LG"],
    "doi": null,
    "journalRef": null,
    "comment": "15 pages, 5 figures",
    "publishedDate": "2017-06-12T17:57:34Z",
    "updatedDate": "2023-08-02T00:41:18Z",
    "pdfUrl": "https://arxiv.org/pdf/1706.03762",
    "abstractUrl": "http://arxiv.org/abs/1706.03762v7",
    "scrapedAt": "2026-05-27T00:00:00.000Z"
}
```

</details>

<details>
<summary><strong>🔬 Physics preprint: Higgs boson cross-section measurement</strong></summary>

```json
{
    "arxivId": "2402.15212v2",
    "title": "Measurement of the Higgs boson production cross sections in proton-proton collisions",
    "authors": ["ATLAS Collaboration"],
    "summary": "A measurement of the Higgs boson production cross sections in proton-proton collisions at a centre-of-mass energy of 13 TeV is presented...",
    "primaryCategory": "hep-ex",
    "categories": ["hep-ex"],
    "doi": "10.1016/j.physletb.2024.138853",
    "journalRef": "Phys. Lett. B 854 (2024) 138853",
    "comment": "32 pages, 13 figures",
    "publishedDate": "2024-02-23T11:08:42Z",
    "updatedDate": "2024-07-10T08:14:11Z",
    "pdfUrl": "https://arxiv.org/pdf/2402.15212",
    "abstractUrl": "http://arxiv.org/abs/2402.15212v2",
    "scrapedAt": "2026-05-27T00:00:00.000Z"
}
```

</details>

<details>
<summary><strong>🧬 Quantitative biology: protein structure prediction</strong></summary>

```json
{
    "arxivId": "2310.04318v1",
    "title": "Equivariant Diffusion for Protein Backbone Generation",
    "authors": ["Lin Chen", "Yutong Zhang", "Jinwoo Lee"],
    "summary": "We introduce an equivariant diffusion model that generates protein backbone structures conditioned on sequence motifs...",
    "primaryCategory": "q-bio.BM",
    "categories": ["q-bio.BM", "cs.LG"],
    "doi": null,
    "journalRef": null,
    "comment": "Under review",
    "publishedDate": "2023-10-06T15:42:01Z",
    "updatedDate": "2023-10-06T15:42:01Z",
    "pdfUrl": "https://arxiv.org/pdf/2310.04318",
    "abstractUrl": "http://arxiv.org/abs/2310.04318v1",
    "scrapedAt": "2026-05-27T00:00:00.000Z"
}
```

</details>

***

### ✨ Why choose this Actor

| | Capability |
|---|---|
| 📚 | **2.5M+ preprints.** Every paper hosted on arXiv across physics, math, CS, statistics, quantitative biology, quantitative finance, economics, and electrical engineering. |
| 🎯 | **Full arXiv query syntax.** Title, abstract, author, category, ID, and boolean operators all work. |
| 📅 | **Recency sort.** Sort by submission date or last update for date-bounded discovery. |
| ⚡ | **Fast.** 100 records per page, fully paginated. 1,000 papers in under two minutes. |
| 🧰 | **Ready for downstream pipelines.** Clean JSON with arXiv IDs, DOIs, and direct PDF links for RAG, training, or citation graphs. |
| 🔁 | **Always fresh.** arXiv updates continuously. Every run hits the live archive. |
| 🚫 | **No registration.** Uses only public open-access metadata. No login or API key required. |

> 🧠 Every state-of-the-art result in modern AI was on arXiv months before it hit a peer-reviewed journal. Skip the lag.

***

### 📈 How it compares to alternatives

| Approach | Cost | Coverage | Refresh | Query power | Setup |
|---|---|---|---|---|---|
| **⭐ arXiv Preprint Scraper** *(this Actor)* | $5 free credit, then pay-per-use | **2.5M+** preprints | **Live per run** | Full arXiv syntax + sort | ⚡ 2 min |
| Google Scholar scraping | Variable | Broad but noisy | Live | Keyword only | ⏳ Hours, captcha-prone |
| Semantic Scholar API | Free tier | 200M+ papers | Daily | Limited operators | 🐢 Days (API key, quotas) |
| Manual arXiv listing pages | Free | All of arXiv | Live | UI clicks only | 🐢 No automation |

Pick this Actor when you want arXiv-quality metadata with a clean schema and zero parser maintenance.

***

### 🚀 How to use

1. 📝 **Sign up.** [Create a free account with $5 credit](https://console.apify.com/sign-up?fpr=vmoqkp) (takes 2 minutes).
2. 🌐 **Open the Actor.** Go to the arXiv Preprint Scraper page on the Apify Store.
3. 🎯 **Set input.** Write an arXiv query (e.g. `cat:cs.LG`), pick a sort order, and set `maxItems`.
4. 🚀 **Run it.** Click **Start** and let the Actor collect your data.
5. 📥 **Download.** Grab your results in the **Dataset** tab as CSV, Excel, JSON, or XML.

> ⏱️ Total time from signup to downloaded dataset: **3-5 minutes.** No coding required.

***

### 💼 Business use cases

<table>
<tr>
<td width="50%" valign="top">

#### 🧠 AI / ML R\&D teams

- Daily firehose of new `cs.LG`, `cs.CL`, and `cs.CV` papers
- Build training corpora and RAG indexes from abstracts
- Track competitor authors and labs by name
- Surface state-of-the-art benchmarks via abstract keywords

</td>
<td width="50%" valign="top">

#### 📊 Investment & VC research

- Monitor deep-tech preprints from portfolio companies
- Track quant finance category `q-fin.*` for new strategies
- Spot academic spin-out candidates before they raise
- Build technology landscape reports from preprint clusters

</td>
</tr>
<tr>
<td width="50%" valign="top">

#### 📰 Science journalism & comms

- Find embargoed-but-public physics and biomed preprints
- Build alerts on senior-author names for explainer pieces
- Pull abstracts for newsletters and round-ups
- Cross-reference DOIs with published-version journals

</td>
<td width="50%" valign="top">

#### 🏥 Pharma & biotech intelligence

- Track `q-bio.BM` and `q-bio.GN` for target-discovery work
- Author-monitor academic collaborators
- Build literature dashboards for therapeutic areas
- Cite-graph upstream papers feeding clinical pipelines

</td>
</tr>
</table>

***

### 🔌 Automating arXiv Scraper

Control the scraper programmatically for scheduled runs and pipeline integrations:

- 🟢 **Node.js.** Install the `apify-client` NPM package.
- 🐍 **Python.** Use the `apify-client` PyPI package.
- 📚 See the [Apify API documentation](https://docs.apify.com/api/v2) for full details.

The [Apify Schedules feature](https://docs.apify.com/platform/schedules) lets you trigger this Actor on any cron interval. A daily refresh on `cat:cs.LG` plus `sortBy: submittedDate` gives you a continuously updated "what's new" feed for any subject category.

***

### 🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

<table>
<tr>
<td width="50%">

#### 🎓 Research and academia

- Systematic literature reviews with reproducible queries
- Bibliometric analyses and citation-network construction
- Course reading-list generation for graduate seminars
- Thesis-defense literature scans across categories

</td>
<td width="50%">

#### 🎨 Personal and creative

- Hobby ML newsletter and Substack curation
- Personal "papers to read" digest powered by RSS
- Tools that surface arXiv papers to lay readers
- Art projects visualizing the shape of human knowledge

</td>
</tr>
<tr>
<td width="50%">

#### 🤝 Non-profit and civic

- Public-interest tech evaluations of academic claims
- Disinformation researchers tracking preprint origins
- Civic-science explainers for climate and public-health topics
- Education non-profits building free curricula from open papers

</td>
<td width="50%">

#### 🧪 Experimentation

- Train topic classifiers and embedding models on abstracts
- Benchmark retrieval systems against arXiv-scale corpora
- Prototype academic-search frontends and chat assistants
- Build agent pipelines that resolve paper IDs to PDFs

</td>
</tr>
</table>

***

### 🤖 Ask an AI assistant about this scraper

Open a ready-to-send prompt about this ParseForge actor in the AI of your choice:

- 💬 [**ChatGPT**](https://chat.openai.com/?q=How%20do%20I%20use%20the%20arXiv%20Preprint%20Scraper%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)
- 🧠 [**Claude**](https://claude.ai/new?q=How%20do%20I%20use%20the%20arXiv%20Preprint%20Scraper%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)
- 🔍 [**Perplexity**](https://perplexity.ai/search?q=How%20do%20I%20use%20the%20arXiv%20Preprint%20Scraper%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)
- 🅒 [**Copilot**](https://copilot.microsoft.com/?q=How%20do%20I%20use%20the%20arXiv%20Preprint%20Scraper%20by%20ParseForge%20on%20Apify%3F%20Show%20me%20input%20examples%2C%20output%20fields%2C%20common%20use%20cases%2C%20and%20how%20to%20integrate%20it%20into%20a%20workflow.)

***

### ❓ Frequently Asked Questions

#### 🧩 How does it work?

You write an arXiv query (e.g. `cat:cs.LG`), pick a sort order, and click Start. The Actor hits the public arXiv catalog, paginates through results, and emits one clean JSON record per paper. No setup, no captchas.

#### 🔍 What query syntax can I use?

arXiv's full query language. Prefixes include `all`, `ti` (title), `abs` (abstract), `au` (author), `cat` (category), and `id`. Combine with `AND`, `OR`, and `ANDNOT`. Wrap multi-word phrases in quotes, e.g. `ti:"neural radiance fields"`.

#### 📚 Which subject categories are covered?

All of arXiv: physics (`hep-ph`, `cond-mat`, `gr-qc`, etc.), mathematics (`math.AG`, `math.PR`, etc.), computer science (`cs.LG`, `cs.CV`, etc.), statistics, quantitative biology, quantitative finance, economics, and electrical engineering.

#### 🔁 How often is the data refreshed?

arXiv updates continuously as authors submit new versions. Every run of this Actor fetches the live archive, so your dataset is current at run time.

#### 📅 Can I get only the latest papers?

Yes. Set `sortBy` to `submittedDate` and `sortOrder` to `descending`. The first records returned will be the most recently submitted preprints.

#### 🔗 Do I get the full PDF?

The dataset includes a direct `pdfUrl` to the PDF on arXiv. You can download PDFs separately or pipe the URLs into a downloader. Full-text extraction is not part of this Actor.

#### 💬 What is the `comment` field?

It's an author-supplied note attached to the preprint, often "20 pages, 5 figures" or "Accepted at NeurIPS 2024". It's optional, so it may be `null` on older or minimally annotated papers.

#### ⚖️ Is it legal to use arXiv metadata?

arXiv's terms permit programmatic access for non-commercial and most commercial research uses. The metadata (titles, authors, abstracts) is publicly viewable. Always review the latest arXiv terms for your specific use case.

#### 💳 Do I need a paid Apify plan?

No. The free Apify plan is enough for testing and small runs (10 records per run). A paid plan lifts the limit and gives you scheduling, higher concurrency, and larger datasets.

#### 🔁 What happens if a run fails?

The Actor automatically retries transient errors and rotates outbound connections. If a run still fails, you can inspect the log, fix the input, and re-run. Partial datasets from failed runs are preserved.

#### 🆘 What if I need help?

Contact us through the Apify platform or use the Tally form linked below.

***

### 🔌 Integrate with any app

arXiv Scraper connects to any cloud service via [Apify integrations](https://apify.com/integrations):

- [**Make**](https://docs.apify.com/platform/integrations/make) - Automate multi-step workflows
- [**Zapier**](https://docs.apify.com/platform/integrations/zapier) - Connect with 5,000+ apps
- [**Slack**](https://docs.apify.com/platform/integrations/slack) - Get run notifications in your channels
- [**Airbyte**](https://docs.apify.com/platform/integrations/airbyte) - Pipe paper records into your warehouse
- [**GitHub**](https://docs.apify.com/platform/integrations/github) - Trigger runs from commits and releases
- [**Google Drive**](https://docs.apify.com/platform/integrations/drive) - Export datasets straight to Sheets

You can also use webhooks to trigger downstream actions when a run finishes. Push fresh paper metadata into your RAG index, or alert your team in Slack when a watched author posts a new preprint.

***

### 🔗 Recommended Actors

- [**✈️ OurAirports Scraper**](https://apify.com/parseforge/ourairports-scraper) - Global airport reference database
- [**💼 Greenhouse Jobs Scraper**](https://apify.com/parseforge/greenhouse-jobs-scraper) - Pull research and engineering job postings
- [**📈 LinkedIn Jobs Scraper**](https://apify.com/parseforge/linkedin-jobs-scraper) - Track academic-adjacent industry roles
- [**🔍 Monster Scraper**](https://apify.com/parseforge/monster-scraper) - U.S. job market signal for research talent
- [**🧑‍💼 Lever Jobs Scraper**](https://apify.com/parseforge/lever-jobs-scraper) - Pipeline of startup R\&D openings

> 💡 **Pro Tip:** browse the complete [ParseForge collection](https://apify.com/parseforge) for more research and reference-data scrapers.

***

**🆘 Need Help?** [**Open our contact form**](https://tally.so/r/BzdKgA) to request a new scraper, propose a custom data project, or report an issue.

***

> **⚠️ Disclaimer:** this Actor is an independent tool and is not affiliated with, endorsed by, or sponsored by arXiv, Cornell University, or any of its contributors. All trademarks mentioned are the property of their respective owners. Only publicly available open-access preprint metadata is collected.

# Actor input Schema

## `maxItems` (type: `integer`):

Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000

## `searchQuery` (type: `string`):

arXiv query syntax. Examples: 'all:transformer', 'ti:attention abs:machine', 'au:hinton', 'cat:cs.LG'. See https://info.arxiv.org/help/api/user-manual.html.

## `sortBy` (type: `string`):

How to order results.

## `sortOrder` (type: `string`):

Ascending or descending order for the selected sort field.

## Actor input object example

```json
{
  "maxItems": 10,
  "searchQuery": "all:transformer",
  "sortBy": "relevance",
  "sortOrder": "descending"
}
```

# Actor output Schema

## `overview` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "maxItems": 10,
    "searchQuery": "all:transformer",
    "sortBy": "relevance",
    "sortOrder": "descending"
};

// Run the Actor and wait for it to finish
const run = await client.actor("parseforge/arxiv-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "maxItems": 10,
    "searchQuery": "all:transformer",
    "sortBy": "relevance",
    "sortOrder": "descending",
}

# Run the Actor and wait for it to finish
run = client.actor("parseforge/arxiv-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "maxItems": 10,
  "searchQuery": "all:transformer",
  "sortBy": "relevance",
  "sortOrder": "descending"
}' |
apify call parseforge/arxiv-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=parseforge/arxiv-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "arXiv Preprint Scraper",
        "description": "Export preprints from arXiv.org. Search 2.5M+ open-access papers across physics, mathematics, computer science, biology, economics, and quantitative finance. Query by keyword, author, category, or date range. Pull titles, authors, abstracts, categories, DOIs, journal refs, and PDF links.",
        "version": "1.0",
        "x-build-id": "AoqVsbjfAaO54fO10"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/parseforge~arxiv-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-parseforge-arxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/parseforge~arxiv-scraper/runs": {
            "post": {
                "operationId": "runs-sync-parseforge-arxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/parseforge~arxiv-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-parseforge-arxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 1,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000"
                    },
                    "searchQuery": {
                        "title": "Search Query",
                        "type": "string",
                        "description": "arXiv query syntax. Examples: 'all:transformer', 'ti:attention abs:machine', 'au:hinton', 'cat:cs.LG'. See https://info.arxiv.org/help/api/user-manual.html."
                    },
                    "sortBy": {
                        "title": "Sort By",
                        "enum": [
                            "relevance",
                            "lastUpdatedDate",
                            "submittedDate"
                        ],
                        "type": "string",
                        "description": "How to order results."
                    },
                    "sortOrder": {
                        "title": "Sort Order",
                        "enum": [
                            "descending",
                            "ascending"
                        ],
                        "type": "string",
                        "description": "Ascending or descending order for the selected sort field."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
