GitHub Repo Scraper
Pricing
Pay per event
GitHub Repo Scraper
Fetch full GitHub repository metadata for one or many repos in one call — stars, forks, languages, topics, license, default branch, latest release, contributor count — export to JSON or CSV. A GitHub repo API wrapper; optional token for higher rate limits.
Pricing
Pay per event
Rating
0.0
(0)
Developer
DevilScrapes
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
14 days ago
Last modified
Categories
Share
🎯 What this scrapes
GitHub exposes public repository data through its REST API, but turning a list of repos into a reliable dataset is messier than it looks: secondary-rate-limit errors kick in at burst speeds, the languages and release endpoints are separate calls, and unauthenticated requests cap out at 60 per hour. This GitHub repo scraper fans out requests in parallel, handles the retry dance automatically, and delivers one richly-typed row per repository — covering everything from stargazers_count through latest_release_tag to scraped_at.
Give it a list of owner/repo slugs or full GitHub URLs. It writes clean, Pydantic-validated rows straight into your Apify dataset. Use it for competitor benchmarking, OSS health checks, DevRel dashboards, AI/RAG corpus building, or any workflow that needs bulk GitHub repository data on demand.
🔥 What we handle for you
- 🛡️ Browser fingerprint rotation —
curl-cffiimpersonates real Chrome / Firefox / Safari TLS handshakes so requests look like a browser, not a Python script. - 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP whenever the target pushes back.
- 🔁 Retries with exponential backoff on
408 / 429 / 5xx— up to 5 attempts per request,Retry-Afterheaders honoured. - 🧱 Rate-limit-aware pacing — when GitHub's secondary rate limit kicks in, we slow down and wait rather than hammering until banned.
- 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable field names, JSON / CSV / Excel export straight from Apify Console.
- 💰 Pay-Per-Event pricing — you pay only for results that land in your dataset. No data, no charge.
💡 Use cases
- Competitor OSS benchmarking — track stars and forks across rival projects week-over-week and pipe deltas to Slack or a BI tool.
- Dependency health monitoring — feed your stack's transitive repo list and flag anything archived, disabled, or unmaintained.
- RAG corpus building — pull language breakdowns and README metadata for a curated set of repos to seed a vector store.
- Hiring and M&A research — quantify the open-source surface area of a target company or candidate's personal GitHub activity.
- Newsletter automation — ingest a curated list weekly, diff the star counts, surface the fastest movers.
- DevRel dashboards — track your own org's repos alongside ecosystem repos in one unified dataset.
⚙️ How to use it
- Click Try for free at the top of the page — no credit card required.
- Paste one or more
owner/reposlugs (or full GitHub URLs) into the repos field. - Optionally add a GitHub Personal Access Token to lift the rate limit from 60 req/hr to 5 000 req/hr.
- Click Start. Output streams into the run's dataset in real time.
- Export from Storage → Dataset as JSON, CSV, or Excel — or pull via the Apify API.
📥 Input
| Field | Type | Required | Default | Notes |
|---|---|---|---|---|
repos | array | yes | ["apify/apify-sdk-python", "apify/crawlee-python"] | List of repos as owner/repo slugs or full GitHub URLs. |
githubToken | string | no | — | Personal access token. Unauthenticated = 60 req/hr; with token = 5 000 req/hr. Read-only public_repo scope is sufficient. |
includeLanguages | boolean | no | true | Adds a languages map (language → bytes) per repo. One extra API call per repo. |
includeLatestRelease | boolean | no | true | Adds latest_release_tag and latest_release_published_at. One extra API call per repo. |
concurrency | integer | no | 6 | Parallel API requests. Up to 8 with a token; 2–3 without. |
proxyConfiguration | object | no | {"useApifyProxy": false} | Apify Proxy settings. Proxy is optional for the GitHub REST API — enable it if your network routing requires it. |
Example input
{"repos": ["apify/apify-sdk-python","apify/crawlee-python"],"githubToken": "","includeLanguages": true,"includeLatestRelease": true,"concurrency": 4,"proxyConfiguration": {"useApifyProxy": false}}
📤 Output
Every row is one dataset item representing one GitHub repository.
| Field | Type | Notes |
|---|---|---|
owner | string | Owner or organisation login. |
name | string | Repository name (without owner prefix). |
full_name | string | Full slug — owner/name. |
html_url | string | Canonical GitHub URL. |
description | string | null | Repository tagline. |
fork | boolean | true if this is a fork. |
archived | boolean | true if the repo is archived (read-only). |
disabled | boolean | true if the repo is disabled. |
stargazers_count | integer | Star count at scrape time. |
forks_count | integer | Fork count. |
watchers_count | integer | Watcher count (subscribers). |
open_issues_count | integer | Open issues + open PRs combined. |
size_kb | integer | Repository size in kilobytes. |
language | string | null | Primary language (GitHub's classification). |
languages | object | null | Map of language → bytes. Populated when includeLanguages=true. |
topics | array | Repository topics / tags. |
license | string | null | SPDX identifier (e.g. MIT, Apache-2.0). |
default_branch | string | Default branch name (usually main). |
homepage | string | null | User-supplied homepage URL. |
created_at | string | Repo creation timestamp (ISO-8601 UTC). |
updated_at | string | Last metadata update timestamp (ISO-8601 UTC). |
pushed_at | string | Last commit push timestamp (ISO-8601 UTC). |
latest_release_tag | string | null | Tag of the latest GitHub release. Populated when includeLatestRelease=true. |
latest_release_published_at | string | null | Publish timestamp of the latest release (ISO-8601 UTC). |
scraped_at | string | When this row was recorded (ISO-8601 UTC). |
Example output
{"owner": "apify","name": "apify-sdk-python","full_name": "apify/apify-sdk-python","html_url": "https://github.com/apify/apify-sdk-python","description": "The Apify SDK for Python.","fork": false,"archived": false,"stargazers_count": 415,"forks_count": 41,"watchers_count": 415,"open_issues_count": 12,"size_kb": 2048,"language": "Python","languages": {"Python": 198432,"Shell": 1024},"topics": ["apify", "scraping", "sdk"],"license": "Apache-2.0","default_branch": "main","homepage": "https://docs.apify.com/sdk/python","created_at": "2022-08-01T10:00:00Z","updated_at": "2026-05-30T14:22:00Z","pushed_at": "2026-05-29T08:11:00Z","latest_release_tag": "v3.4.0","latest_release_published_at": "2026-05-20T12:00:00Z","scraped_at": "2026-06-01T09:00:00Z"}
💰 Pricing
Pay-Per-Event — you only pay when these events fire:
| Event | USD | What triggers it |
|---|---|---|
actor-start | $0.005 | One-off warm-up charge per run |
result | $0.002 | Per dataset row written |
Example: 1 000 repos at the rates above ≈ $2.00. No subscription, no minimum, no card required to start — Apify gives every new account $5 of free credit.
🚧 Limitations
- Private repos require a token with the matching scopes — this Actor only processes repos the token can read. Do not reuse production tokens here.
- README content, raw code, commit graphs, and pull request history are outside the scope of this Actor. Use GitHub's search API or a dedicated commits scraper for those.
- Large orgs with thousands of repos will hit the 5 000 req/hr authenticated ceiling on long runs. Plan batches or spread runs over multiple hours.
- GitHub caches some counts (stars, forks) for a few minutes. Compare runs at least 5 minutes apart to catch real movement.
❓ FAQ
Do I need a GitHub token?
For small batches (under ~50 repos), no. The unauthenticated GitHub REST API allows 60 requests per hour, which is enough for a quick test. Provide a Personal Access Token to raise that ceiling to 5 000 requests per hour — public_repo read-only scope is all you need.
Is this a GitHub REST API alternative or replacement?
Neither — it is a wrapper that handles authentication, pagination, sub-resource fetching, secondary-rate-limit retries, and structured output so you do not have to write that code yourself. The GitHub API still powers the requests; we handle the operational layer on top of it.
Can I use this to fetch github repo metadata api-style for hundreds of repos at once?
Yes. This is the primary use case — pass a list of hundreds of owner/repo slugs, set your token, and the Actor fans them out in parallel while respecting GitHub's rate limits. Output lands in a clean dataset ready to export or query.
How is this different from calling the GitHub REST API myself?
Writing a reliable github repository scraper yourself means handling secondary rate limits, separate language and release endpoints, pagination, token rotation, and structured output validation. That is 1–2 dev-weeks of plumbing. This Actor handles all of it for $2 per 1 000 repos.
What if a repo doesn't exist or has been deleted?
The Actor logs a warning for that slug and continues — the rest of your list still processes normally. You get a partial dataset with a clear log entry for every skipped repo.
Can I scrape private repositories?
You can — if you supply a token that has access to those repos. This Actor was designed for bulk public-data extraction; do not use production tokens here.
💬 Your feedback
Spotted a bug, hit a weird edge case, or need a new output field? Open an issue on the Actor's Issues tab in Apify Console — we read every report and ship fixes weekly.