# medRxiv Scraper (`parseforge/medrxiv-scraper`) Actor

Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.

- **URL**: https://apify.com/parseforge/medrxiv-scraper.md
- **Developed by:** [ParseForge](https://apify.com/parseforge) (community)
- **Categories:** Developer tools, Automation, Other
- **Stats:** 7 total users, 2 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

![ParseForge Banner](https://raw.githubusercontent.com/ParseForge/apify-assets/main/banner.jpg)

## 🧬 medRxiv Preprint Scraper

> 🚀 **Scrape medRxiv health-science preprints in seconds.** Filter by topic, subject collection, date range, or author. No API key, no registration, no manual CSV wrangling.

> 🕒 **Last updated:** 2026-05-18 · **📊 18 fields** per record · **210,000+ live preprints** · **50 subject collections** · **3 sort orders**

medRxiv is the leading open preprint server for the health sciences, hosting clinical research, epidemiology, public health, and biomedical work months before formal peer review. This Actor turns any medRxiv search (keyword, subject collection, posted-date window, or author) into a structured dataset of full preprint records, complete with title, all authors, posting date, DOI, abstract, full text, PDF link, license, funding statement, competing-interest declarations, data-availability statement, and any data or code repository link the authors disclose. The output drops straight into Google Sheets, BigQuery, Postgres, Notion, or any other tool your team already uses.

Preprint data is hard to harvest at scale. medRxiv exposes a faceted search interface but no public bulk API for researchers; the site sits behind Cloudflare and a Varnish edge that rate-limits direct datacenter traffic. This Actor closes the gap: pick a search query, narrow it with subject collection, date range, or author filters, set how many records you want, and the data lands in your dataset within minutes. **Systematic reviewers**, **clinical research teams**, **public-health analysts**, **bibliometric researchers**, **science journalists**, and **AI training-set curators** all use this kind of feed for living reviews, evidence surveillance, citation tracking, and topic modelling on emerging health research.

| 👥 Target audience | 🎯 Primary use case |
|---|---|
| Systematic reviewers and meta-analysts | Run living searches across health-science preprints |
| Clinical research and trial teams | Track new evidence in a therapeutic area in near real time |
| Public-health analysts and epidemiologists | Surveil outbreak, vaccine, and intervention literature as it lands |
| Bibliometric and science-policy researchers | Build datasets on topic emergence, author networks, and funding patterns |
| Science journalists and editors | Spot newsworthy preprints in a chosen field within hours of posting |
| AI/ML and NLP teams | Curate domain-specific corpora of biomedical text for training and evaluation |

---

### 📋 What the medRxiv Preprint Scraper does

- 🔍 **Any keyword search.** Drive results from a free-text query just like the medRxiv search box.
- 🗂️ **50 subject collections.** Restrict to a specific medical specialty (Epidemiology, Cardiovascular Medicine, Infectious Diseases, HIV/AIDS, Public and Global Health, and 45 more).
- 📅 **Date range filter.** Slice by posting date with `dateFrom` and `dateTo` (YYYY-MM-DD).
- 👤 **Author filter.** Narrow to preprints where a given author name appears.
- 📰 **Full record per preprint.** Title, full author list, posting date, DOI, abstract, full text, PDF URL, license, funding, competing interests, data-availability statement, and disclosed data or code URLs.
- 🔁 **Sort order.** Choose newest first, oldest first, or relevance-ranked.

Each record returns the article URL, title, author list (with affiliation metadata where exposed), posting date, DOI, abstract, full text body, subject area assignment from medRxiv, PDF link, citation string, license text, funding statement, competing-interest statement, author declarations block, data-availability statement, disclosed data or code repository URL, supplementary materials, and the scrape timestamp.

> 💡 **Why it matters:** preprints often surface findings weeks or months before journal publication. In fast-moving health topics (pandemics, vaccines, drug repurposing) waiting for the indexed-in-MEDLINE version means working from stale evidence. This Actor lets analysts track the live preprint frontier without manual click-through.

---

### 🎬 Full Demo

🚧 Coming soon: a 3-minute walkthrough showing how to filter, run, and export medRxiv preprints into Google Sheets and the BI tool of your choice.

---

### ⚙️ Input

<table>
<tr><th>Field</th><th>Type</th><th>Required</th><th>Description</th></tr>
<tr><td>startUrl</td><td>string</td><td>no</td><td>Direct medRxiv.org search URL. Apply filters on medrxiv.org and paste the URL. Mutually exclusive with the filter fields below.</td></tr>
<tr><td>maxItems</td><td>integer</td><td>no</td><td>Maximum number of preprints to return. Free plan: capped at 10. Paid plan: up to 1,000,000.</td></tr>
<tr><td>searchQuery</td><td>string</td><td>no</td><td>Free-text search query (the same string you would type in the medRxiv search box).</td></tr>
<tr><td>subjectCollection</td><td>enum</td><td>no</td><td>One of 50 medRxiv subject collections (slugs like epidemiology, cardiovascular-medicine, infectious-diseases, hiv-aids).</td></tr>
<tr><td>dateFrom</td><td>string</td><td>no</td><td>Lower bound for posting date, YYYY-MM-DD.</td></tr>
<tr><td>dateTo</td><td>string</td><td>no</td><td>Upper bound for posting date, YYYY-MM-DD.</td></tr>
<tr><td>author</td><td>string</td><td>no</td><td>Restrict to preprints with this author name in the author list.</td></tr>
<tr><td>orderBy</td><td>enum</td><td>no</td><td>relevance (best match), newest (newest first), or oldest (oldest first).</td></tr>
</table>

Example: pull the latest 50 infectious-disease preprints from January 2026 onward, sorted newest first.

```json
{
    "searchQuery": "antimicrobial resistance",
    "subjectCollection": "infectious-diseases",
    "dateFrom": "2026-01-01",
    "orderBy": "newest",
    "maxItems": 50
}
````

Example: paste a pre-filtered medRxiv search URL and grab the first 200 results.

```json
{
    "startUrl": "https://www.medrxiv.org/search/bacterial%20infection",
    "maxItems": 200
}
```

> ⚠️ **Good to Know:** medRxiv sits behind Cloudflare and a Varnish edge that rate-limits datacenter IPs. The Actor routes every request through Apify Residential US proxies with per-attempt session rotation, so transient 503s retry transparently. Free-plan runs are capped at 10 preprints; upgrade to a paid plan for full batches.

***

### 📊 Output

Each record is one medRxiv preprint, normalised across subject collections.

#### 🧾 Schema

| Field | Type | Example |
|---|---|---|
| 🔗 `url` | string | `https://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1` |
| 📝 `title` | string | `Genomic epidemiology of ESBL-producing Escherichia coli and Klebsiella pneumoniae...` |
| 👥 `authors` | string\[] | `["Germanie Delaisie Abomo", "Gabriel Cedric Bessala", "..."]` |
| 🧑‍🔬 `authorDetails` | object\[] | `[{ "name": "...", "affiliation": "..." }, ...]` |
| 📅 `publicationDate` | string | `Posted March 18, 2026.` |
| 🔢 `doi` | string | `https://doi.org/10.64898/2026.03.16.26348538` |
| 📄 `abstract` | string | `Background Livestock production systems in peri-urban areas are associated with...` |
| 📰 `fullText` | string | Full article body text |
| 🗂️ `subjectAreas` | string\[] | `["Epidemiology"]` |
| 📑 `pdfUrl` | string | `https://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1.full.pdf` |
| 📚 `citationInformation` | string | Auto-generated citation string |
| 📜 `licenseInformation` | string | `It is made available under a CC-BY 4.0 International license.` |
| 💰 `fundingStatement` | string | Funding sources disclosed by authors |
| ⚖️ `competingInterestStatement` | string | Competing-interest declaration |
| 🪪 `authorDeclarations` | string | Author declarations block (IRB, consent, etc.) |
| 🗄️ `dataAvailability` | string | Data-availability statement |
| 🧪 `dataCodeUrl` | string | `https://data.wastewaterscan.org/about` |
| 📎 `supplementaryMaterials` | string\[] | Links to supplementary files |
| 🕒 `scrapedTimestamp` | ISO date | `2026-05-18T01:30:00.000Z` |

#### 📦 Sample records

<details>
<summary>📦 Typical record: epidemiology preprint with full author list, abstract, full text, license, and disclosures</summary>

```json
{
    "url": "https://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1",
    "title": "Genomic epidemiology of ESBL-producing Escherichia coli and Klebsiella pneumoniae across the human-animal-environment interface in peri-urban pig farms in Yaounde, Cameroon",
    "authors": [
        "Germanie Delaisie Abomo",
        "Gabriel Cedric Bessala",
        "Isaac Dah",
        "Michelle M. C. Buckner",
        "Jan-Ulrich Kreft",
        "Blaise Pascal Bougnom"
    ],
    "authorDetails": [
        { "name": "Germanie Delaisie Abomo" },
        { "name": "Gabriel Cedric Bessala" },
        { "name": "Isaac Dah" },
        { "name": "Michelle M. C. Buckner" },
        { "name": "Jan-Ulrich Kreft" },
        { "name": "Blaise Pascal Bougnom" }
    ],
    "publicationDate": "Posted March 18, 2026.",
    "doi": "https://doi.org/10.64898/2026.03.16.26348538",
    "abstract": "Background Livestock production systems in peri-urban areas are associated with high levels of interaction between humans, animals, and the environment...",
    "fullText": "Background Livestock production systems in peri-urban areas are associated with high levels of...",
    "subjectAreas": ["Epidemiology"],
    "pdfUrl": "https://www.medrxiv.org/content/10.64898/2026.03.16.26348538v1.full.pdf",
    "licenseInformation": "The copyright holder for this preprint is the author/funder, who has granted medRxiv a license to display the preprint in perpetuity. It is made available under a CC-BY 4.0 International license.",
    "fundingStatement": "This work was supported by...",
    "competingInterestStatement": "The authors have declared no competing interest.",
    "dataAvailability": "All sequencing data are available in the European Nucleotide Archive under accession PRJEB...",
    "scrapedTimestamp": "2026-05-18T01:30:00.000Z"
}
```

</details>

<details>
<summary>📦 Edge case: US government public-health preprint with a disclosed data repository URL and CC0 license</summary>

```json
{
    "url": "https://www.medrxiv.org/content/10.64898/2026.03.05.26345726v2",
    "title": "Population-scale tracking of community infections via wastewater monitoring",
    "authors": [
        "Author A",
        "Author B",
        "Author C"
    ],
    "publicationDate": "Posted March 7, 2026.",
    "doi": "https://doi.org/10.64898/2026.03.05.26345726",
    "abstract": "Wastewater monitoring enables non-invasive, population-scale tracking of community infections independent of clinical testing access...",
    "subjectAreas": ["Public and Global Health"],
    "licenseInformation": "This article is a US Government work. It is not subject to copyright under 17 USC 105 and is also made available for use under a CC0 license.",
    "dataAvailability": "Aggregated wastewater monitoring data are available at the project portal.",
    "dataCodeUrl": "https://data.wastewaterscan.org/about",
    "scrapedTimestamp": "2026-05-18T01:30:00.000Z"
}
```

</details>

<details>
<summary>📦 Sparse record: hospital infection-control preprint with no subject collection assigned and minimal disclosures</summary>

```json
{
    "url": "https://www.medrxiv.org/content/10.64898/2026.03.11.26348025v1",
    "title": "Colonization dynamics of vancomycin-resistant Enterococcus faecium in a hospital ICU cohort",
    "authors": [
        "Author X",
        "Author Y"
    ],
    "publicationDate": "Posted March 13, 2026.",
    "doi": "https://doi.org/10.64898/2026.03.11.26348025",
    "abstract": "Colonization of the gastrointestinal (GI) tract by vancomycin-resistant Enterococcus faecium (VREfm) precedes invasive infection in critically ill patients...",
    "subjectAreas": null,
    "licenseInformation": "It is made available under a CC-BY-NC-ND 4.0 International license.",
    "scrapedTimestamp": "2026-05-18T01:30:00.000Z"
}
```

</details>

***

### ✨ Why choose this Actor

| 🪄 | Capability |
|---|---|
| 🧬 | **Full health-science coverage.** All 50 medRxiv subject collections, from Addiction Medicine to Urology, in a single Actor. |
| 🔁 | **Filter or paste.** Drive the search from filter fields, or paste a pre-filtered medrxiv.org URL. |
| 🛡️ | **Cloudflare and Varnish handled.** Apify Residential US proxies with per-attempt session rotation absorb the rate limits that block direct fetches. |
| 📦 | **18 normalised fields.** Same shape across every collection, so an epidemiology preprint and a cardiology preprint drop into the same table. |
| 📅 | **Date and author filters.** Narrow by posting date range or author name without touching the UI. |
| 💸 | **Pay-per-event or flat.** Compatible with both Apify pricing models. |
| 🪵 | **Resilient.** Per-request session rotation plus short exponential backoff retries shrug off the transient 503s that medRxiv's edge throws under load. |

> 📊 medRxiv has hosted 210,000+ preprints across 50 medical specialties since 2019, with hundreds of new posts every week.

***

### 📈 How it compares to alternatives

| Approach | Cost | Coverage | Refresh | Filters | Setup |
|---|---|---|---|---|---|
| **⭐ medRxiv Preprint Scraper** *(this Actor)* | Pay only for runs | All 50 subject collections, 210,000+ preprints | On-demand | Keyword, collection, date range, author, sort | None |
| Official PubMed mirrors | Free | Indexed-after-peer-review only | Days to weeks behind preprints | MeSH terms, not preprint fields | API key, query syntax |
| Paid scholarly aggregators | Subscription | Mixed; preprints often partial | Daily | Vendor-specific | Sales contract |
| Manual browser saves | Time | Limited | Manual | Manual | Slow |
| Generic web-scraping tools | DIY | Brittle, breaks on edge updates | DIY | DIY | High |

Most teams already have a citation database. This Actor is the live-preprint layer that sits in front of it.

***

### 🚀 How to use

1. 🔐 **Sign up.** Create a free [Apify account](https://console.apify.com/sign-up?fpr=vmoqkp) (no credit card needed for the free tier).
2. 🔎 **Open the Actor.** Search the Apify Store for "medRxiv" or open this page.
3. 🧩 **Fill in input.** Either paste a medRxiv search URL into `startUrl`, or set `searchQuery`, `subjectCollection`, `dateFrom` / `dateTo`, and `author` as needed.
4. ▶️ **Click Start.** The Actor uses Apify Residential US proxies to load and parse search and detail pages.
5. 📥 **Export.** Download the dataset as JSON, CSV, or Excel, or push directly to Google Sheets, BigQuery, Webhook, or S3.

> ⏱️ Total time: about 2 minutes from sign-up to first export.

***

### 💼 Business use cases

<table>
<tr>
<td width="50%">

#### 🏥 Clinical research teams

- Run weekly searches in a therapeutic area to surface new preprints
- Build evidence dashboards for internal review committees
- Track preprints by competing labs or trial sponsors
- Feed a regulatory-affairs intelligence pipeline

</td>
<td width="50%">

#### 🦠 Public-health and epidemiology

- Surveil outbreak literature (respiratory, vector-borne, AMR) in near real time
- Track vaccine effectiveness and uptake research as it posts
- Map study-design trends across regions and pathogens
- Brief ministries and public-health agencies on emerging evidence

</td>
</tr>
<tr>
<td width="50%">

#### 📊 Bibliometric and policy analysts

- Build datasets on topic emergence and growth curves
- Map author networks and institutional output across subject collections
- Compare preprint-to-publication conversion rates over time
- Quantify funding-statement patterns by topic and country

</td>
<td width="50%">

#### 📰 Science journalism and comms

- Spot newsworthy preprints in a beat within hours of posting
- Pull DOIs and abstracts directly into an editorial CMS
- Track corrections and version updates across an evidence story
- Build "what's new on medRxiv this week" digests

</td>
</tr>
</table>

***

### 🌟 Beyond business use cases

Data like this powers more than commercial workflows. The same structured records support research, education, civic projects, and personal initiatives.

<table>
<tr>
<td width="50%">

#### 🎓 Research and academia

- Empirical datasets for papers, thesis work, and coursework
- Longitudinal studies tracking changes across snapshots
- Reproducible research with cited, versioned data pulls
- Classroom exercises on data analysis and ethical scraping

</td>
<td width="50%">

#### 🎨 Personal and creative

- Side projects, portfolio demos, and indie app launches
- Data visualizations, dashboards, and infographics
- Content research for bloggers, YouTubers, and podcasters
- Hobbyist collections and personal trackers

</td>
</tr>
<tr>
<td width="50%">

#### 🤝 Non-profit and civic

- Transparency reporting and accountability projects
- Advocacy campaigns backed by public-interest data
- Community-run databases for local issues
- Investigative journalism on public records

</td>
<td width="50%">

#### 🧪 Experimentation

- Prototype AI and machine-learning pipelines with real data
- Validate product-market hypotheses before engineering spend
- Train small domain-specific models on niche corpora
- Test dashboard concepts with live input

</td>
</tr>
</table>

***

### 🔌 Automating medRxiv Preprint Scraper

Trigger this Actor from any code path or scheduler.

- 🟨 **Node.js / JavaScript:** use the [apify-client npm package](https://docs.apify.com/api/client/js/).
- 🐍 **Python:** use the [apify-client PyPI package](https://docs.apify.com/api/client/python/).
- 📘 **HTTP API:** see the [Apify API docs](https://docs.apify.com/api/v2) for direct REST calls.

Schedule it on Apify with [Schedules](https://docs.apify.com/platform/schedules) to refresh evidence pulls daily, weekly, or hourly. Pipe the dataset to Google Sheets, BigQuery, Snowflake, or your own webhook for downstream analytics.

***

### ❓ Frequently Asked Questions

<details>
<summary>💼 <b>Can I use this for commercial purposes?</b></summary>

Yes. The Actor returns publicly posted preprint metadata and full text. You are responsible for how you use it. medRxiv preprints are typically released under Creative Commons or CC0 licenses (each record includes the exact `licenseInformation` string). Comply with medRxiv's terms of service and the per-preprint license terms.

</details>

<details>
<summary>💳 <b>Do I need a paid Apify plan?</b></summary>

No. You can run the Actor on the free tier with a 10-preprint preview cap per run. Upgrade to any paid plan to unlock the full 1,000,000 cap per run.

</details>

<details>
<summary>🛑 <b>What happens if a run fails?</b></summary>

The Actor pushes a single `{"error": "..."}` record describing the failure. Re-run with the same input; transient blocks usually resolve on the next attempt thanks to per-request residential proxy rotation.

</details>

<details>
<summary>⚖️ <b>Is this legal?</b></summary>

The Actor reads public preprint pages the same way a browser visitor does. Most jurisdictions treat public-data scraping as lawful. You should still review medRxiv's terms and the per-preprint license before commercial or republishing use.

</details>

<details>
<summary>🚀 <b>How fast is each run?</b></summary>

A 10-preprint run typically completes in 90 to 180 seconds. Larger runs scale roughly linearly; detail-page fetches are parallelised at concurrency 30.

</details>

<details>
<summary>🗂️ <b>Which subject collections are supported?</b></summary>

All 50 medRxiv collections, exposed as an enum on `subjectCollection`. Examples: Epidemiology, Cardiovascular Medicine, Infectious Diseases, HIV/AIDS, Oncology, Neurology, Public and Global Health, Psychiatry and Clinical Psychology, and 42 more.

</details>

<details>
<summary>🏷️ <b>How fresh is the data?</b></summary>

medRxiv posts new preprints throughout the week. Each run reads live pages, so re-running tomorrow returns whatever the site shows tomorrow.

</details>

<details>
<summary>📅 <b>Can I filter by posting date?</b></summary>

Yes. Set `dateFrom` and `dateTo` in YYYY-MM-DD format. The Actor maps these to medRxiv's `limit_from` and `limit_to` filters.

</details>

<details>
<summary>👤 <b>Can I filter by author?</b></summary>

Yes. Set `author` to a name; the Actor passes it as `author1` in the medRxiv search query.

</details>

<details>
<summary>📰 <b>Does each record include the full paper text?</b></summary>

Yes, when medRxiv exposes a full-text HTML view. The `fullText` field holds the article body; `pdfUrl` is the canonical PDF link for cases where you need the formatted version.

</details>

<details>
<summary>🌐 <b>Do I need to bring my own proxies?</b></summary>

No. The Actor defaults to Apify Residential US proxies, which are required to bypass medRxiv's edge rate limits. You can override with your own proxy URL via the standard Apify proxy environment.

</details>

<details>
<summary>🧪 <b>Can I do a dry run before committing to a long pull?</b></summary>

Yes. Set `maxItems` to 5 or 10 for a quick smoke test. Once the shape looks right, raise the cap and re-run.

</details>

***

### 🔌 Integrate with any app

Pipe medRxiv data into the tools you already use.

- [**Google Sheets**](https://docs.apify.com/platform/integrations/google-sheets) - drop preprint records into a live sheet for analysts.
- [**BigQuery**](https://docs.apify.com/platform/integrations/bigquery) - warehouse evidence pulls for SQL analytics.
- [**Webhooks**](https://docs.apify.com/platform/integrations/webhooks) - notify Slack, Zapier, or your own service when a run completes.
- [**Airbyte**](https://docs.apify.com/platform/integrations/airbyte) - sync to Snowflake, Postgres, or any Airbyte destination.
- [**Zapier**](https://docs.apify.com/platform/integrations/zapier) - trigger downstream automations on every new run.
- [**Make**](https://docs.apify.com/platform/integrations/make) - build no-code workflows around scraped preprints.

***

### 🔗 Recommended Actors

- [**🧪 bioRxiv & medRxiv Preprint Scraper**](https://apify.com/parseforge/biorxiv-scraper) - sister Actor covering bioRxiv and the broader preprint network.
- [**📚 PubMed Citation Scraper**](https://apify.com/parseforge/pubmed-citation-scraper) - peer-reviewed biomedical citations from PubMed.
- [**🌐 OpenAlex Scholarly Works Scraper**](https://apify.com/parseforge/openalex-scraper) - cross-disciplinary scholarly metadata at scale.
- [**📐 arXiv Preprint Scraper**](https://apify.com/parseforge/arxiv-scraper) - preprint coverage for physics, math, CS, and quantitative biology.
- [**🧠 Semantic Scholar Scraper**](https://apify.com/parseforge/semantic-scholar-scraper) - citation graph and paper metadata across academic disciplines.

> 💡 **Pro Tip:** browse the complete [ParseForge collection](https://apify.com/parseforge) for more research, scholarly, and health-data scrapers.

***

**🆘 Need Help?** [**Open our contact form**](https://tally.so/r/BzdKgA) and we will reply within one business day.

***

> ⚠️ **Disclaimer:** This Actor extracts publicly accessible preprint data from medRxiv.org for legitimate research, evidence-surveillance, and analytics purposes. It does not collect personal user data beyond what authors and institutions publicly publish on their preprints. Use of this Actor is at your own risk and subject to medRxiv.org's terms of service and the per-preprint Creative Commons or CC0 license. ParseForge is not affiliated with, endorsed by, or sponsored by medRxiv, Cold Spring Harbor Laboratory, BMJ, or Yale University.

# Actor input Schema

## `startUrl` (type: `string`):

medRxiv.org search URL to start scraping from. Use this for custom searches or specific URLs. Cannot be used together with searchQuery. Example: https://www.medrxiv.org/search/asd or https://www.medrxiv.org/search/asd?page=1

## `maxItems` (type: `integer`):

Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000

## `searchQuery` (type: `string`):

Search term to query medRxiv articles. Cannot be used together with startUrl. The scraper will search for articles matching this query.

## `subjectCollection` (type: `string`):

Restrict results to articles posted in a specific medRxiv subject collection. Maps to the source's subject\_collection\_code filter.

## `dateFrom` (type: `string`):

Lower bound for the posting date, in YYYY-MM-DD format. Maps to the source's limit\_from filter.

## `dateTo` (type: `string`):

Upper bound for the posting date, in YYYY-MM-DD format. Maps to the source's limit\_to filter.

## `author` (type: `string`):

Restrict results to articles where this name appears in the author list. Maps to the source's author1 filter.

## `orderBy` (type: `string`):

How to sort the search results. 'relevance' = Best match, 'oldest' = Oldest First, 'newest' = Newest First

## Actor input object example

```json
{
  "startUrl": "https://www.medrxiv.org/search/asd",
  "maxItems": 10,
  "searchQuery": "bacterial infection",
  "dateFrom": "2025-01-01",
  "dateTo": "2025-12-31",
  "author": "Smith",
  "orderBy": "relevance"
}
```

# Actor output Schema

## `articles` (type: `string`):

Complete dataset with all scraped medRxiv articles including full details

## `overview` (type: `string`):

Overview view of articles with key fields displayed in a table format

## `details` (type: `string`):

Complete article details view with all fields including full text and declarations

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "maxItems": 10,
    "searchQuery": "bacterial infection",
    "orderBy": "relevance"
};

// Run the Actor and wait for it to finish
const run = await client.actor("parseforge/medrxiv-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "maxItems": 10,
    "searchQuery": "bacterial infection",
    "orderBy": "relevance",
}

# Run the Actor and wait for it to finish
run = client.actor("parseforge/medrxiv-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "maxItems": 10,
  "searchQuery": "bacterial infection",
  "orderBy": "relevance"
}' |
apify call parseforge/medrxiv-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=parseforge/medrxiv-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "medRxiv Scraper",
        "description": "Extract comprehensive preprint data from medRxiv, including titles, authors, abstracts, full text, DOIs, citations, and metadata. Automate access to health-science preprints with structured outputs, ideal for researchers and analysts who need reliable, large-scale article data without manual work.",
        "version": "1.1",
        "x-build-id": "jdmpBGWGx2njPM55K"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/parseforge~medrxiv-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-parseforge-medrxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/parseforge~medrxiv-scraper/runs": {
            "post": {
                "operationId": "runs-sync-parseforge-medrxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/parseforge~medrxiv-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-parseforge-medrxiv-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrl": {
                        "title": "Start URL",
                        "type": "string",
                        "description": "medRxiv.org search URL to start scraping from. Use this for custom searches or specific URLs. Cannot be used together with searchQuery. Example: https://www.medrxiv.org/search/asd or https://www.medrxiv.org/search/asd?page=1"
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 1,
                        "maximum": 1000000,
                        "type": "integer",
                        "description": "Free users: Limited to 10 items (preview). Paid users: Optional, max 1,000,000"
                    },
                    "searchQuery": {
                        "title": "Search Query",
                        "type": "string",
                        "description": "Search term to query medRxiv articles. Cannot be used together with startUrl. The scraper will search for articles matching this query."
                    },
                    "subjectCollection": {
                        "title": "Subject Collection",
                        "enum": [
                            "",
                            "addiction-medicine",
                            "allergy-and-immunology",
                            "anesthesia",
                            "cardiovascular-medicine",
                            "dentistry-and-oral-medicine",
                            "dermatology",
                            "emergency-medicine",
                            "endocrinology",
                            "epidemiology",
                            "forensic-medicine",
                            "gastroenterology",
                            "genetic-and-genomic-medicine",
                            "geriatric-medicine",
                            "health-economics",
                            "health-informatics",
                            "health-policy",
                            "health-systems-and-quality-improvement",
                            "hematology",
                            "hiv-aids",
                            "infectious-diseases",
                            "intensive-care-and-critical-care-medicine",
                            "medical-education",
                            "medical-ethics",
                            "nephrology",
                            "neurology",
                            "nursing",
                            "nutrition",
                            "obstetrics-and-gynecology",
                            "occupational-and-environmental-health",
                            "oncology",
                            "ophthalmology",
                            "orthopedics",
                            "otolaryngology",
                            "pain-medicine",
                            "palliative-medicine",
                            "pathology",
                            "pediatrics",
                            "pharmacology-and-therapeutics",
                            "primary-care-research",
                            "psychiatry-and-clinical-psychology",
                            "public-and-global-health",
                            "radiology-and-imaging",
                            "rehabilitation-medicine-and-physical-therapy",
                            "respiratory-medicine",
                            "rheumatology",
                            "sexual-and-reproductive-health",
                            "sports-medicine",
                            "surgery",
                            "toxicology",
                            "transplantation",
                            "urology"
                        ],
                        "type": "string",
                        "description": "Restrict results to articles posted in a specific medRxiv subject collection. Maps to the source's subject_collection_code filter."
                    },
                    "dateFrom": {
                        "title": "Posted From",
                        "type": "string",
                        "description": "Lower bound for the posting date, in YYYY-MM-DD format. Maps to the source's limit_from filter."
                    },
                    "dateTo": {
                        "title": "Posted To",
                        "type": "string",
                        "description": "Upper bound for the posting date, in YYYY-MM-DD format. Maps to the source's limit_to filter."
                    },
                    "author": {
                        "title": "Author",
                        "type": "string",
                        "description": "Restrict results to articles where this name appears in the author list. Maps to the source's author1 filter."
                    },
                    "orderBy": {
                        "title": "Sort Order",
                        "enum": [
                            "relevance",
                            "oldest",
                            "newest"
                        ],
                        "type": "string",
                        "description": "How to sort the search results. 'relevance' = Best match, 'oldest' = Oldest First, 'newest' = Newest First"
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
