arXiv Scraper — Search & Export Paper Metadata
Pricing
Pay per event
arXiv Scraper — Search & Export Paper Metadata
Search arXiv by query, category, or author and export structured paper metadata — title, authors, abstract, primary category, DOI, PDF URL, submitted and updated timestamps — to JSON or CSV. An arXiv API wrapper that handles pagination, retries, and rate-limit pacing for your pipeline.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
3
Total users
0
Monthly active users
12 days ago
Last modified
Categories
Share
🎯 What this scrapes
arXiv's Atom feed at export.arxiv.org/api/query is the canonical source for preprint paper metadata. It is also paginated, rate-limited, and quick to push back on anything that looks like aggressive bulk access. This Actor wraps it with a polished input schema, paces requests to stay within arXiv's courtesy guidelines, paginates automatically across large result sets, and writes one structured row per paper. We absorb the transient errors and pushback; you get a dataset that drops cleanly into research dashboards, citation-tracking tools, RAG pipelines, or ML training corpora.
Looking to download arXiv papers metadata across an entire category slice (for example, all cs.AI submissions from 2025)? A single sweep like that can exceed 30 000 records — hours of hand-rolled pagination that we handle end-to-end.
🔥 Features — What We Handle for You
- 🛡️ Browser fingerprint rotation —
curl-cffiimpersonates real Chrome / Firefox / Safari TLS handshakes so the upstream sees a browser, not a Python script. - 🌐 Residential proxy rotation via Apify Proxy — fresh session ID and exit IP whenever the upstream pushes back.
- 🔁 Retries with exponential backoff on
408 / 429 / 5xx— up to 5 attempts per page,Retry-Afterheader honoured. - 🧱 Rate-limit-aware pacing — we slow down rather than accumulate blocks; partial progress is always surfaced, never silently dropped.
- 🧊 Clean, typed dataset rows — Pydantic-validated fields, ISO-8601 timestamps, stable IDs. Export as JSON, CSV, or Excel directly from Apify Console.
- 💰 Pay-Per-Event pricing — you pay only for results that land in your dataset. No data, no charge beyond the small start fee.
💡 Use cases
- RAG corpus building — pull every
cs.AI/cs.LG/cs.CLabstract from the past year and load it straight into ChromaDB, Pinecone, or Weaviate for semantic search over papers. - Citation tracking — schedule weekly runs for
au:<your-name>and diff to detect new citations of your work. - Trend monitoring — daily pull from a specific category to feed a research digest or newsletter.
- Dataset curation — extract all papers matching a topic + date range to seed a systematic literature review or benchmark evaluation.
- Notification pipeline — stream new results into Slack or Discord when a paper matches a saved query.
- VC / competitive intelligence — map research output by lab, author, or topic over time to surface emerging areas.
⚙️ How to use it
- Click Try for free at the top of the Store listing.
- Fill in the input form —
searchQueryis the only required field and ships with a working default (cat:cs.AI). - Click Start. Results stream into the run's dataset in real time.
- Export from Storage → Dataset as JSON, CSV, or Excel — or pull via the Apify API.
For large sweeps (tens of thousands of records), set maxResults to your target count and let the Actor page through automatically. Runs can be scheduled from the Actor's Schedules tab.
📥 Input
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
searchQuery | string | yes | cat:cs.AI | arXiv search query. Use field prefixes: ti: (title), au: (author), cat: (category), abs: (abstract), all: (all fields). Boolean operators AND, OR, ANDNOT are supported. |
sortBy | string | no | submittedDate | Order results by submittedDate, lastUpdatedDate, or relevance. |
sortOrder | string | no | descending | ascending or descending. |
maxResults | integer | no | 50 | Total papers to fetch. arXiv recommends staying under 30 000 per query. |
pageSize | integer | no | 50 | Papers per API call. arXiv caps page size at 2 000. |
proxyConfiguration | object | no | {"useApifyProxy": false} | Apify Proxy configuration. Enable residential proxies for large-volume or scheduled runs to ensure consistent delivery. |
Example input
{"searchQuery": "cat:cs.AI","sortBy": "submittedDate","sortOrder": "descending","maxResults": 3,"pageSize": 3,"proxyConfiguration": {"useApifyProxy": false}}
📤 Output
One dataset item per paper.
| Field | Type | Notes |
|---|---|---|
arxiv_id | string | arXiv identifier, e.g. 2401.12345v2. |
url | string | Abstract page URL on arxiv.org. |
pdf_url | string | Direct PDF link. |
title | string | Paper title (whitespace-normalised). |
summary | string | Abstract text. |
authors | array | Author names in submission order. |
primary_category | string | Primary arXiv category slug, e.g. cs.AI. |
categories | array | All arXiv categories the paper is tagged with. |
doi | string | null | DOI if assigned (null for most preprints). |
journal_ref | string | null | Journal reference if the paper has been published. |
comment | string | null | Authors' note (page count, conference acceptance, etc.). |
published | string | Original submission timestamp (ISO-8601 UTC). |
updated | string | Last revision timestamp (ISO-8601 UTC). |
scraped_at | string | Timestamp when this row was recorded by the Actor. |
Example output
{"arxiv_id": "2401.12345v2","url": "https://arxiv.org/abs/2401.12345v2","pdf_url": "https://arxiv.org/pdf/2401.12345v2","title": "Scaling Laws for Sparse Mixture-of-Experts Language Models","authors": ["Alex Doe", "Jamie Smith"],"primary_category": "cs.CL","categories": ["cs.CL", "cs.LG"],"doi": null,"journal_ref": null,"comment": "Accepted at NeurIPS 2025","published": "2026-04-12T16:00:00+00:00","updated": "2026-04-14T09:00:00+00:00","scraped_at": "2026-06-01T10:00:00+00:00"}
💰 Pricing
Pay-Per-Event — you are charged only when these events fire:
| Event | USD | What it covers |
|---|---|---|
actor-start | $0.005 | One-off warm-up charge per run |
result | $0.0015 | Per dataset item written |
Example: 1 000 results at the rates above ≈ $1.50. No subscription, no minimum commitment, no card required to start — Apify gives every new account $5 of free credit.
🚧 Limitations
- Metadata only — the Actor uses the Atom API. Full-text search over PDF content is not supported; queries operate on arXiv metadata fields (title, abstract, authors, categories).
- Author disambiguation — arXiv does not expose canonical author IDs in the public API. Resolving name collisions across similar author strings is left to the caller.
- 30 000-record soft ceiling — arXiv's own documentation recommends keeping single queries under 30 000 results. The Actor enforces a polite inter-request delay to stay within arXiv's rate-limit guidance; very large sweeps will take proportionally longer.
- Preprint freshness window — newly submitted papers typically appear in the feed within 1–2 hours of arXiv ingest, but that window is not guaranteed.
❓ FAQ
Is this legal?
Yes — arXiv publishes the Atom API specifically for programmatic access and bulk metadata retrieval. We respect arXiv's rate-limit guidance and identify the Actor in the request User-Agent per their documentation.
What is the arXiv API and can I use it directly?
The arXiv API (Atom/OAI-PMH feed at export.arxiv.org) is a free public endpoint for querying paper metadata. You can query it directly with Python using the arxiv PyPI library or raw HTTP calls — but at scale, you will hit pagination complexity, rate-limit pushback, and XML parsing overhead. This Actor handles all of that and writes clean structured rows without you touching a single namespace.
I already know Python — why not just write an arXiv API wrapper?
Writing an arXiv API wrapper in Python is a reasonable weekend project for small queries. For production-grade batch ingestion — scheduled refreshes, thousands of records, retry-safe paging, proxy-backed delivery — the Actor saves meaningful engineering time and runs on Apify's infrastructure without a server to maintain.
Can I download PDFs?
Not directly — the Actor surfaces the pdf_url field for every paper. You can pass that URL to a follow-up Actor or a curl loop to fetch the actual files.
Why do some records have a null DOI?
Most arXiv preprints do not receive a DOI until the paper is formally published in a journal. We surface null for those entries so your pipeline can handle them gracefully.
How do I target a specific date range?
arXiv's query language supports date filters via submittedDate:[YYYYMMDD TO YYYYMMDD]. Include that expression in your searchQuery field, e.g. cat:cs.AI AND submittedDate:[20250101 TO 20251231].
💬 Your feedback
Spotted a bug, hit a rate-limit edge case, or need an extra field? Open an issue on the Actor's Issues tab in Apify Console — we ship fixes weekly and we read every report.