Pricing

Pay per event

arXiv Scraper — Search & Export Paper Metadata

Search arXiv by query, category, or author and export structured paper metadata — title, authors, abstract, primary category, DOI, PDF URL, submitted and updated timestamps — to JSON or CSV. An arXiv API wrapper that handles pagination, retries, and rate-limit pacing for your pipeline.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

Actor stats

Bookmarked

Total users

Monthly active users

12 days ago

Last modified

🎯 What this scrapes

arXiv's Atom feed at export.arxiv.org/api/query is the canonical source for preprint paper metadata. It is also paginated, rate-limited, and quick to push back on anything that looks like aggressive bulk access. This Actor wraps it with a polished input schema, paces requests to stay within arXiv's courtesy guidelines, paginates automatically across large result sets, and writes one structured row per paper. We absorb the transient errors and pushback; you get a dataset that drops cleanly into research dashboards, citation-tracking tools, RAG pipelines, or ML training corpora.

Looking to download arXiv papers metadata across an entire category slice (for example, all cs.AI submissions from 2025)? A single sweep like that can exceed 30 000 records — hours of hand-rolled pagination that we handle end-to-end.

🔥 Features — What We Handle for You

🛡️ Browser fingerprint rotation — curl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so the upstream sees a browser, not a Python script.
🌐 Residential proxy rotation via Apify Proxy — fresh session ID and exit IP whenever the upstream pushes back.
🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per page, Retry-After header honoured.
🧱 Rate-limit-aware pacing — we slow down rather than accumulate blocks; partial progress is always surfaced, never silently dropped.
🧊 Clean, typed dataset rows — Pydantic-validated fields, ISO-8601 timestamps, stable IDs. Export as JSON, CSV, or Excel directly from Apify Console.
💰 Pay-Per-Event pricing — you pay only for results that land in your dataset. No data, no charge beyond the small start fee.

💡 Use cases

RAG corpus building — pull every cs.AI / cs.LG / cs.CL abstract from the past year and load it straight into ChromaDB, Pinecone, or Weaviate for semantic search over papers.
Citation tracking — schedule weekly runs for au:<your-name> and diff to detect new citations of your work.
Trend monitoring — daily pull from a specific category to feed a research digest or newsletter.
Dataset curation — extract all papers matching a topic + date range to seed a systematic literature review or benchmark evaluation.
Notification pipeline — stream new results into Slack or Discord when a paper matches a saved query.
VC / competitive intelligence — map research output by lab, author, or topic over time to surface emerging areas.

⚙️ How to use it

Click Try for free at the top of the Store listing.
Fill in the input form — searchQuery is the only required field and ships with a working default (cat:cs.AI).
Click Start. Results stream into the run's dataset in real time.
Export from Storage → Dataset as JSON, CSV, or Excel — or pull via the Apify API.

For large sweeps (tens of thousands of records), set maxResults to your target count and let the Actor page through automatically. Runs can be scheduled from the Actor's Schedules tab.

📥 Input

Field	Type	Required	Default	Notes
`searchQuery`	`string`	yes	`cat:cs.AI`	arXiv search query. Use field prefixes: `ti:` (title), `au:` (author), `cat:` (category), `abs:` (abstract), `all:` (all fields). Boolean operators `AND`, `OR`, `ANDNOT` are supported.
`sortBy`	`string`	no	`submittedDate`	Order results by `submittedDate`, `lastUpdatedDate`, or `relevance`.
`sortOrder`	`string`	no	`descending`	`ascending` or `descending`.
`maxResults`	`integer`	no	`50`	Total papers to fetch. arXiv recommends staying under 30 000 per query.
`pageSize`	`integer`	no	`50`	Papers per API call. arXiv caps page size at 2 000.
`proxyConfiguration`	`object`	no	`{"useApifyProxy": false}`	Apify Proxy configuration. Enable residential proxies for large-volume or scheduled runs to ensure consistent delivery.

Example input

{
  "searchQuery": "cat:cs.AI",
  "sortBy": "submittedDate",
  "sortOrder": "descending",
  "maxResults": 3,
  "pageSize": 3,
  "proxyConfiguration": {
    "useApifyProxy": false
  }
}

📤 Output

One dataset item per paper.

Field	Type	Notes
`arxiv_id`	`string`	arXiv identifier, e.g. `2401.12345v2`.
`url`	`string`	Abstract page URL on arxiv.org.
`pdf_url`	`string`	Direct PDF link.
`title`	`string`	Paper title (whitespace-normalised).
`summary`	`string`	Abstract text.
`authors`	`array`	Author names in submission order.
`primary_category`	`string`	Primary arXiv category slug, e.g. `cs.AI`.
`categories`	`array`	All arXiv categories the paper is tagged with.
`doi`	`string \| null`	DOI if assigned (null for most preprints).
`journal_ref`	`string \| null`	Journal reference if the paper has been published.
`comment`	`string \| null`	Authors' note (page count, conference acceptance, etc.).
`published`	`string`	Original submission timestamp (ISO-8601 UTC).
`updated`	`string`	Last revision timestamp (ISO-8601 UTC).
`scraped_at`	`string`	Timestamp when this row was recorded by the Actor.

Example output

{
  "arxiv_id": "2401.12345v2",
  "url": "https://arxiv.org/abs/2401.12345v2",
  "pdf_url": "https://arxiv.org/pdf/2401.12345v2",
  "title": "Scaling Laws for Sparse Mixture-of-Experts Language Models",
  "authors": ["Alex Doe", "Jamie Smith"],
  "primary_category": "cs.CL",
  "categories": ["cs.CL", "cs.LG"],
  "doi": null,
  "journal_ref": null,
  "comment": "Accepted at NeurIPS 2025",
  "published": "2026-04-12T16:00:00+00:00",
  "updated": "2026-04-14T09:00:00+00:00",
  "scraped_at": "2026-06-01T10:00:00+00:00"
}

💰 Pricing

Pay-Per-Event — you are charged only when these events fire:

Event	USD	What it covers
`actor-start`	$0.005	One-off warm-up charge per run
`result`	$0.0015	Per dataset item written

Example: 1 000 results at the rates above ≈ $1.50. No subscription, no minimum commitment, no card required to start — Apify gives every new account $5 of free credit.

🚧 Limitations

Metadata only — the Actor uses the Atom API. Full-text search over PDF content is not supported; queries operate on arXiv metadata fields (title, abstract, authors, categories).
Author disambiguation — arXiv does not expose canonical author IDs in the public API. Resolving name collisions across similar author strings is left to the caller.
30 000-record soft ceiling — arXiv's own documentation recommends keeping single queries under 30 000 results. The Actor enforces a polite inter-request delay to stay within arXiv's rate-limit guidance; very large sweeps will take proportionally longer.
Preprint freshness window — newly submitted papers typically appear in the feed within 1–2 hours of arXiv ingest, but that window is not guaranteed.

❓ FAQ

Is this legal?

Yes — arXiv publishes the Atom API specifically for programmatic access and bulk metadata retrieval. We respect arXiv's rate-limit guidance and identify the Actor in the request User-Agent per their documentation.

What is the arXiv API and can I use it directly?

The arXiv API (Atom/OAI-PMH feed at export.arxiv.org) is a free public endpoint for querying paper metadata. You can query it directly with Python using the arxiv PyPI library or raw HTTP calls — but at scale, you will hit pagination complexity, rate-limit pushback, and XML parsing overhead. This Actor handles all of that and writes clean structured rows without you touching a single namespace.

I already know Python — why not just write an arXiv API wrapper?

Writing an arXiv API wrapper in Python is a reasonable weekend project for small queries. For production-grade batch ingestion — scheduled refreshes, thousands of records, retry-safe paging, proxy-backed delivery — the Actor saves meaningful engineering time and runs on Apify's infrastructure without a server to maintain.

Can I download PDFs?

Not directly — the Actor surfaces the pdf_url field for every paper. You can pass that URL to a follow-up Actor or a curl loop to fetch the actual files.

Why do some records have a null DOI?

Most arXiv preprints do not receive a DOI until the paper is formally published in a journal. We surface null for those entries so your pipeline can handle them gracefully.

How do I target a specific date range?

arXiv's query language supports date filters via submittedDate:[YYYYMMDD TO YYYYMMDD]. Include that expression in your searchQuery field, e.g. cat:cs.AI AND submittedDate:[20250101 TO 20251231].

💬 Your feedback

Spotted a bug, hit a rate-limit edge case, or need an extra field? Open an issue on the Actor's Issues tab in Apify Console — we ship fixes weekly and we read every report.

arXiv Research Paper Scraper

crawlerbros/arxiv-research-paper-scraper

Scrape research papers from arXiv.org - search by query, category, or author; lookup by arXiv ID. Returns title, authors, abstract, PDF URL, DOI, categories, and more. Uses the public arXiv Atom API. No login or proxy required.

Crawler Bros

arXiv Search & Paper Scraper

scrapeworks/arxiv-search

Search arXiv and get clean structured JSON for each paper: title, authors, abstract, categories, DOI, PDF link, and dates. Built for research, datasets, and AI pipelines.

Nicolas van Arkens

Arxiv Paper Scraper

technicaldost/arxiv-paper-scraper

Technical Dost Solutions

arXiv Paper Scraper

plantane/arxiv-scraper

Scrape research papers from arXiv by search query or category. Get titles, abstracts, authors, categories, and PDF links via the public arXiv API.

Daniel

ArXiv Paper Search

gentle_cloud/arxiv-paper-search

Search and extract academic papers from ArXiv. Find papers by keyword, author, or category with full metadata including title, authors, abstract, categories, and PDF links.

Monkey Coder

ArXiv Research Paper Scraper

datapilot/arxiv-research-paper-scraper

arXiv Research Paper Scraper retrieves academic paper metadata from the arXiv API based on a keyword. It extracts titles, abstracts, authors with affiliations, DOI, categories, submission dates, and PDF links. Supports proxy usage and outputs structured JSON results for research and data analysis.

Data Pilot

arXiv Scraper

dami_studio/arxiv-scraper

Search arXiv via the official API and get clean, structured paper metadata: title, abstract, authors, categories, DOI, dates, and abstract + PDF links. No key, no login, no anti-bot. Uses arXiv search syntax (all:, cat:, ti:, au:).

Dami's Studio

5.0

arXiv Papers Scraper

crawlerbros/arxiv-papers-scraper

Scrape academic preprints from arXiv.org by keyword, author, or category. Returns clean records with title, authors, abstract, categories, PDF URL, DOI. HTTP-only via the public arXiv API. No login, no proxy.

Crawler Bros

arXiv Scraper: Papers, Authors, Categories & Search

perconey/arxiv-scraper

Scrape arxiv.org via the official Atom API. Full-text search, by author / title / category, paper detail by id, latest in any category. Returns title, abstract, authors, DOI, PDF link. No auth, no proxies. Pay only per result item.

Perconey

arXiv Paper Scraper

lulzasaur/arxiv-scraper

Search and scrape arXiv academic papers. Get titles, authors, abstracts, categories, PDF links, DOIs. Search by keyword, browse recent papers by category, or fetch by arXiv ID.