arXiv Scraper avatar

arXiv Scraper

Pricing

from $2.00 / 1,000 paper returneds

Go to Apify Store
arXiv Scraper

arXiv Scraper

Search arXiv via the official API and get clean, structured paper metadata: title, abstract, authors, categories, DOI, dates, and abstract + PDF links. No key, no login, no anti-bot. Uses arXiv search syntax (all:, cat:, ti:, au:).

Pricing

from $2.00 / 1,000 paper returneds

Rating

5.0

(1)

Developer

Dami's Studio

Dami's Studio

Maintained by Community

Actor stats

0

Bookmarked

3

Total users

1

Monthly active users

2 days ago

Last modified

Share

Search arXiv via the official arXiv API and get clean, structured paper metadata. No API key, no login, no anti-bot — just polite, paginated requests to export.arxiv.org.

What it does

Given an arXiv search query, the actor fetches the Atom feed from the arXiv API, parses every <entry>, dedupes by arXiv ID, and returns one clean record per paper.

Input

FieldTypeDefaultDescription
querystringall:large language modelsarXiv search query (see syntax below). Required.
sortByselectrelevancerelevance, submittedDate, or lastUpdatedDate.
maxItemsinteger50Max papers to return (1–30000). Paginates 100/page.
proxyConfigurationproxynoneOptional — the public arXiv API has no anti-bot, so no proxy is used by default. Only enable one if you hit IP rate limits.

Search query syntax

arXiv uses field prefixes you can combine with AND / OR / ANDNOT:

  • all:transformer — search all fields
  • cat:cs.AI — papers in a category (e.g. cs.CL, cs.LG, stat.ML)
  • ti:attention — title
  • au:hinton — author
  • abs:diffusion — abstract
  • ti:attention AND au:vaswani — combine

Examples: all:large language models, cat:cs.CL, ti:"retrieval augmented generation".

Output

One record per paper:

{
"ok": true,
"arxivId": "2401.12345",
"title": "Paper title",
"abstract": "Whitespace-collapsed abstract…",
"authors": ["First Author", "Second Author"],
"primaryCategory": "cs.CL",
"categories": ["cs.CL", "cs.AI"],
"publishedAt": "2024-01-23T18:00:00Z",
"updatedAt": "2024-02-01T12:00:00Z",
"doi": "10.1000/xyz123",
"absUrl": "http://arxiv.org/abs/2401.12345v1",
"pdfUrl": "http://arxiv.org/pdf/2401.12345v1"
}

Nullable fields: doi is null when the paper has no registered DOI. primaryCategory, publishedAt, updatedAt, and absUrl can also be null if the arXiv feed omits them for an entry. pdfUrl is derived from the arXiv ID when the feed doesn't include a PDF link, so it is null only when the ID itself is missing.

Results are deduplicated by arxivId, falling back to absUrl and then title when a record has no arXiv ID.

Notes

  • Be polite: the actor pauses ~3 seconds between pages, as arXiv requests.
  • arXiv hard-limits the total reachable results per query to about 30000.
  • On an error, empty result, or missing query, the actor writes a single diagnostic row (ok: false, with an errorCode such as NO_RESULTS, BAD_INPUT, RATE_LIMITED, or NETWORK) and does not charge. The run still finishes cleanly so you can inspect the reason in the dataset.

Charging

Charges one paper unit per successfully returned paper. Diagnostic / empty rows are never charged.