# Crossref Scraper — DOI Metadata for Academic Papers (`openclawmara/crossref-scraper`) Actor

Scrape Crossref — largest DOI registry for academic literature. Modes: search works, DOI lookup, journal metadata, funder info, affiliation search. Extracts titles, authors, DOIs, ISSN, references, citations. Official REST API, no auth, 50 req/sec. For research & citation analysis.

- **URL**: https://apify.com/openclawmara/crossref-scraper.md
- **Developed by:** [OpenClaw Mara](https://apify.com/openclawmara) (community)
- **Categories:** AI, Developer tools, Other
- **Stats:** 3 total users, 0 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

$5.00 / 1,000 paper scrapeds

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 🔗 Crossref Scraper — DOI Metadata for 150M+ Scholarly Works

**Structured metadata, citation counts, and author affiliations from the world's largest DOI registry. $0.004 per article.**

Scrape [Crossref](https://www.crossref.org) — the DOI registration agency for scholarly publishers — for titles, authors, DOIs, publication dates, journal info, citation counts, references, and funder data across 150M+ articles, books, datasets, and preprints.

Uses the official Crossref REST API. No auth required. Free for non-commercial use under Crossref's public data policy.

### 🚀 What does this Actor do?

Crossref is the backbone of academic citation infrastructure — almost every journal article, conference paper, book chapter, and dataset has a Crossref DOI. This Actor turns Crossref into a programmable source in two modes:

- **Search** — Full-text search by keyword across 150M+ works, with year range, citation count, and sort filters.
- **DOI lookup** — Bulk-fetch metadata for a list of DOIs you already have (bibliographies, reference lists, dataset citations).

Unlike Google Scholar (no API) or Semantic Scholar (focused on citations), Crossref is the **authoritative publisher metadata** — it's where the citation count originates, where the DOI resolves, and where the funder data lives.

### 💡 Use Cases

#### 1. Systematic literature review automation
Pull every paper on a topic within a year range and feed the metadata into a structured review table.

```json
{
  "searchQueries": ["retrieval augmented generation"],
  "maxResults": 500,
  "fromYear": 2022,
  "toYear": 2026,
  "sortBy": "published"
}
````

#### 2. Citation graph building

You have a list of DOIs (from a bibliography, a dataset, a search result). Pull full metadata + reference lists.

```json
{
  "dois": [
    "10.1038/s41586-023-06747-5",
    "10.1126/science.aap8731",
    "10.48550/arXiv.1706.03762"
  ]
}
```

#### 3. Research analytics dashboard

Track publication volume and citation trends for a field, author, or publisher over time.

```json
{
  "searchQueries": ["large language models"],
  "maxResults": 1000,
  "fromYear": 2020,
  "sortBy": "is-referenced-by-count",
  "minCitations": 10
}
```

#### 4. Publisher / journal monitoring

Feed a Slack channel new publications in a domain, weekly.

```json
{
  "searchQueries": ["CRISPR gene editing"],
  "maxResults": 100,
  "fromYear": 2026,
  "sortBy": "published"
}
```

### 📊 Output Example

```json
{
  "doi": "10.1038/s41586-023-06747-5",
  "title": "A foundation model of transcription across human cell types",
  "authors": [
    { "given": "Xi", "family": "Fu", "affiliation": ["Columbia University"] },
    { "given": "Shentong", "family": "Mo", "affiliation": ["Carnegie Mellon University"] }
  ],
  "publishedDate": "2025-01-15",
  "publishedYear": 2025,
  "containerTitle": "Nature",
  "publisher": "Springer Science and Business Media LLC",
  "type": "journal-article",
  "volume": "637",
  "issue": "8047",
  "page": "965-973",
  "isReferencedByCount": 142,
  "referencesCount": 67,
  "url": "https://doi.org/10.1038/s41586-023-06747-5",
  "abstract": "The human genome contains...",
  "subject": ["Multidisciplinary"],
  "funder": [
    { "name": "National Institutes of Health", "award": ["R01HG012875"] }
  ],
  "issn": ["0028-0836", "1476-4687"],
  "license": [{ "URL": "https://creativecommons.org/licenses/by/4.0/" }]
}
```

### ⚙️ Input Parameters

| Parameter | Type | Description |
|-----------|------|-------------|
| `searchQueries` | array | Keywords/phrases (e.g. `["CRISPR gene editing", "large language models"]`) |
| `dois` | array | Specific DOIs to fetch metadata for (e.g. `["10.1038/s41586-023-06747-5"]`) |
| `maxResults` | int | Results per query (default 50, max 1000) |
| `sortBy` | enum | `relevance` (default), `published`, `is-referenced-by-count` |
| `fromYear` | int | Only articles published from this year |
| `toYear` | int | Only articles published up to this year |
| `minCitations` | int | Only include articles with ≥ N citations (default 0) |

### 📤 Output Fields

| Field | Description |
|-------|-------------|
| `doi` | Digital Object Identifier (canonical) |
| `title` | Article title |
| `authors[]` | `{ given, family, affiliation[] }` per author |
| `publishedDate`, `publishedYear` | ISO date + year |
| `containerTitle` | Journal / conference / book name |
| `publisher` | Publisher name |
| `type` | `journal-article`, `book-chapter`, `proceedings-article`, `dataset`, `preprint`, etc. |
| `volume`, `issue`, `page` | Bibliographic location |
| `isReferencedByCount` | Citation count (from Crossref) |
| `referencesCount` | Number of references in the article |
| `url` | DOI resolver URL |
| `abstract` | When available (~30% of articles) |
| `subject[]` | Subject categories |
| `funder[]` | Funder + award/grant IDs |
| `issn[]` | Journal ISSNs |
| `license[]` | Open-access license info when present |

### 💰 Pricing & Performance

- **Pay-per-event:** **$0.004 per article**.
- **Typical cost:** $4 for 1000 articles — a full systematic review for the price of a coffee.
- **Speed:** ~100–150 articles/minute via the official Crossref REST API.
- **No auth required** — Crossref API is fully open (with polite-pool pacing for sustained usage).
- **Bulk-friendly** — up to 1000 results per search query.

### 🔌 Integrations

- **Vector DBs (Pinecone, Weaviate, Qdrant, pgvector)** — embed titles + abstracts for semantic scholarly search.
- **LangChain / LlamaIndex** — power a "research assistant" RAG over a year-range corpus.
- **Neo4j / graph DBs** — build a citation network: `author → paper → references → paper`.
- **Zapier / Make / n8n** — weekly "new papers in my field" digest to Slack, email, or Notion.
- **Systematic review tools (Covidence, Rayyan)** — bulk-import metadata from a search.
- **Airbyte / Fivetran** — load structured metadata into a data warehouse for bibliometrics.

### 🧭 DOI Prefix Reference

- `10.1038` — Nature Publishing Group
- `10.1126` — AAAS (Science)
- `10.1145` — ACM
- `10.1109` — IEEE
- `10.48550` — arXiv preprints (yes, arXiv has DOIs too)
- `10.1101` — bioRxiv / medRxiv
- `10.21105` — Journal of Open Source Software
- Full prefix list: https://www.crossref.org/getting-started/prefix/

### ❓ FAQ

**Why use Crossref over Google Scholar?**
Google Scholar has no official API and aggressively blocks scrapers. Crossref has a free, well-documented REST API that returns canonical metadata — the same data Google Scholar uses under the hood.

**Crossref vs Semantic Scholar?**
Crossref = **authoritative publisher metadata** (title, authors, DOI, journal, date, basic citation count). Semantic Scholar = **enriched citation graph + influential-citation signal**. Use both for a complete picture — this Actor's `dois[]` mode pairs perfectly with the Semantic Scholar scraper.

**Does every article have an abstract?**
About 30% do. Crossref's abstract coverage depends on the publisher's metadata submission. For guaranteed abstracts, follow up with arXiv / PubMed / Semantic Scholar.

**Can I pull references (who this paper cites)?**
Partially — `referencesCount` is always returned; the full reference list is included when the publisher deposits it with Crossref (roughly half of articles). For guaranteed reference lists, use OpenCitations or the publisher's native API.

**What's the difference between `published` sort and `fromYear` filter?**
`sortBy: published` orders results newest-first. `fromYear`/`toYear` restricts the year range. Use them together for "newest papers in the last 3 years."

**Rate limits?**
Crossref requests you identify yourself for sustained high-volume usage (polite pool). The Actor handles this automatically — you just set `maxResults` and wait.

### 🔗 Companions

- [arXiv Paper Scraper](https://apify.com/Helpermara/arxiv-paper-scraper) — Preprints before they're officially published.
- [Semantic Scholar Scraper](https://apify.com/Helpermara/semantic-scholar-scraper) — Citation graphs and influence metrics.
- [ORCID Scraper](https://apify.com/Helpermara/orcid-scraper) — Author profiles and disambiguated publication histories.
- [DBLP Scraper](https://apify.com/Helpermara/dblp-scraper) — Computer science bibliography with clean author data.

### 🔑 Keywords

Crossref scraper, Crossref API, DOI lookup, DOI metadata, scholarly articles API, academic paper metadata, citation count scraper, bibliometrics data, systematic review automation, literature review scraper, journal article scraper, publisher metadata, publication tracking, citation graph, research analytics, scientific publication data, open science metadata, Crossref bulk export, DOI resolver API, scholarly search API, RAG over research papers.

### 📝 Changelog

- **v1.0** — Initial release. Keyword search + DOI lookup modes, year range filter, citation count filter, 3 sort modes, up to 1000 results per search.

# Actor input Schema

## `searchQueries` (type: `array`):

Search scholarly articles by keyword (e.g. 'CRISPR gene editing', 'large language models')

## `dois` (type: `array`):

Fetch specific articles by DOI (e.g. '10.1038/s41586-023-06747-5')

## `maxResults` (type: `integer`):

Maximum articles to return per search query

## `sortBy` (type: `string`):

How to sort search results

## `fromYear` (type: `integer`):

Only include articles published from this year (e.g. 2020)

## `toYear` (type: `integer`):

Only include articles published up to this year

## `minCitations` (type: `integer`):

Only include articles with at least this many citations (0 = no filter)

## Actor input object example

```json
{
  "maxResults": 50,
  "sortBy": "relevance",
  "minCitations": 0
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("openclawmara/crossref-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("openclawmara/crossref-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call openclawmara/crossref-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=openclawmara/crossref-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Crossref Scraper — DOI Metadata for Academic Papers",
        "description": "Scrape Crossref — largest DOI registry for academic literature. Modes: search works, DOI lookup, journal metadata, funder info, affiliation search. Extracts titles, authors, DOIs, ISSN, references, citations. Official REST API, no auth, 50 req/sec. For research & citation analysis.",
        "version": "1.0",
        "x-build-id": "5i5FFcqL5OX2lxMdA"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/openclawmara~crossref-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-openclawmara-crossref-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/openclawmara~crossref-scraper/runs": {
            "post": {
                "operationId": "runs-sync-openclawmara-crossref-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/openclawmara~crossref-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-openclawmara-crossref-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "searchQueries": {
                        "title": "Search Queries",
                        "type": "array",
                        "description": "Search scholarly articles by keyword (e.g. 'CRISPR gene editing', 'large language models')",
                        "items": {
                            "type": "string"
                        }
                    },
                    "dois": {
                        "title": "DOIs",
                        "type": "array",
                        "description": "Fetch specific articles by DOI (e.g. '10.1038/s41586-023-06747-5')",
                        "items": {
                            "type": "string"
                        }
                    },
                    "maxResults": {
                        "title": "Max Results per Query",
                        "minimum": 1,
                        "maximum": 1000,
                        "type": "integer",
                        "description": "Maximum articles to return per search query",
                        "default": 50
                    },
                    "sortBy": {
                        "title": "Sort By",
                        "enum": [
                            "relevance",
                            "published",
                            "is-referenced-by-count"
                        ],
                        "type": "string",
                        "description": "How to sort search results",
                        "default": "relevance"
                    },
                    "fromYear": {
                        "title": "From Year",
                        "minimum": 1900,
                        "maximum": 2030,
                        "type": "integer",
                        "description": "Only include articles published from this year (e.g. 2020)"
                    },
                    "toYear": {
                        "title": "To Year",
                        "minimum": 1900,
                        "maximum": 2030,
                        "type": "integer",
                        "description": "Only include articles published up to this year"
                    },
                    "minCitations": {
                        "title": "Minimum Citations",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Only include articles with at least this many citations (0 = no filter)",
                        "default": 0
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
