GitHub Repo Scraper avatar

GitHub Repo Scraper

Pricing

Pay per event

Go to Apify Store
GitHub Repo Scraper

GitHub Repo Scraper

Fetch full GitHub repository metadata for one or many repos in one call — stars, forks, languages, topics, license, default branch, latest release, contributor count — export to JSON or CSV. A GitHub repo API wrapper; optional token for higher rate limits.

Pricing

Pay per event

Rating

0.0

(0)

Developer

DevilScrapes

DevilScrapes

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

14 days ago

Last modified

Share


🎯 What this scrapes

GitHub exposes public repository data through its REST API, but turning a list of repos into a reliable dataset is messier than it looks: secondary-rate-limit errors kick in at burst speeds, the languages and release endpoints are separate calls, and unauthenticated requests cap out at 60 per hour. This GitHub repo scraper fans out requests in parallel, handles the retry dance automatically, and delivers one richly-typed row per repository — covering everything from stargazers_count through latest_release_tag to scraped_at.

Give it a list of owner/repo slugs or full GitHub URLs. It writes clean, Pydantic-validated rows straight into your Apify dataset. Use it for competitor benchmarking, OSS health checks, DevRel dashboards, AI/RAG corpus building, or any workflow that needs bulk GitHub repository data on demand.

🔥 What we handle for you

  • 🛡️ Browser fingerprint rotationcurl-cffi impersonates real Chrome / Firefox / Safari TLS handshakes so requests look like a browser, not a Python script.
  • 🌐 Residential proxy rotation via Apify Proxy — fresh session and exit IP whenever the target pushes back.
  • 🔁 Retries with exponential backoff on 408 / 429 / 5xx — up to 5 attempts per request, Retry-After headers honoured.
  • 🧱 Rate-limit-aware pacing — when GitHub's secondary rate limit kicks in, we slow down and wait rather than hammering until banned.
  • 🧊 Clean, typed dataset rows — Pydantic-validated, ISO-8601 timestamps, stable field names, JSON / CSV / Excel export straight from Apify Console.
  • 💰 Pay-Per-Event pricing — you pay only for results that land in your dataset. No data, no charge.

💡 Use cases

  • Competitor OSS benchmarking — track stars and forks across rival projects week-over-week and pipe deltas to Slack or a BI tool.
  • Dependency health monitoring — feed your stack's transitive repo list and flag anything archived, disabled, or unmaintained.
  • RAG corpus building — pull language breakdowns and README metadata for a curated set of repos to seed a vector store.
  • Hiring and M&A research — quantify the open-source surface area of a target company or candidate's personal GitHub activity.
  • Newsletter automation — ingest a curated list weekly, diff the star counts, surface the fastest movers.
  • DevRel dashboards — track your own org's repos alongside ecosystem repos in one unified dataset.

⚙️ How to use it

  1. Click Try for free at the top of the page — no credit card required.
  2. Paste one or more owner/repo slugs (or full GitHub URLs) into the repos field.
  3. Optionally add a GitHub Personal Access Token to lift the rate limit from 60 req/hr to 5 000 req/hr.
  4. Click Start. Output streams into the run's dataset in real time.
  5. Export from Storage → Dataset as JSON, CSV, or Excel — or pull via the Apify API.

📥 Input

FieldTypeRequiredDefaultNotes
reposarrayyes["apify/apify-sdk-python", "apify/crawlee-python"]List of repos as owner/repo slugs or full GitHub URLs.
githubTokenstringnoPersonal access token. Unauthenticated = 60 req/hr; with token = 5 000 req/hr. Read-only public_repo scope is sufficient.
includeLanguagesbooleannotrueAdds a languages map (language → bytes) per repo. One extra API call per repo.
includeLatestReleasebooleannotrueAdds latest_release_tag and latest_release_published_at. One extra API call per repo.
concurrencyintegerno6Parallel API requests. Up to 8 with a token; 2–3 without.
proxyConfigurationobjectno{"useApifyProxy": false}Apify Proxy settings. Proxy is optional for the GitHub REST API — enable it if your network routing requires it.

Example input

{
"repos": [
"apify/apify-sdk-python",
"apify/crawlee-python"
],
"githubToken": "",
"includeLanguages": true,
"includeLatestRelease": true,
"concurrency": 4,
"proxyConfiguration": {
"useApifyProxy": false
}
}

📤 Output

Every row is one dataset item representing one GitHub repository.

FieldTypeNotes
ownerstringOwner or organisation login.
namestringRepository name (without owner prefix).
full_namestringFull slug — owner/name.
html_urlstringCanonical GitHub URL.
descriptionstring | nullRepository tagline.
forkbooleantrue if this is a fork.
archivedbooleantrue if the repo is archived (read-only).
disabledbooleantrue if the repo is disabled.
stargazers_countintegerStar count at scrape time.
forks_countintegerFork count.
watchers_countintegerWatcher count (subscribers).
open_issues_countintegerOpen issues + open PRs combined.
size_kbintegerRepository size in kilobytes.
languagestring | nullPrimary language (GitHub's classification).
languagesobject | nullMap of language → bytes. Populated when includeLanguages=true.
topicsarrayRepository topics / tags.
licensestring | nullSPDX identifier (e.g. MIT, Apache-2.0).
default_branchstringDefault branch name (usually main).
homepagestring | nullUser-supplied homepage URL.
created_atstringRepo creation timestamp (ISO-8601 UTC).
updated_atstringLast metadata update timestamp (ISO-8601 UTC).
pushed_atstringLast commit push timestamp (ISO-8601 UTC).
latest_release_tagstring | nullTag of the latest GitHub release. Populated when includeLatestRelease=true.
latest_release_published_atstring | nullPublish timestamp of the latest release (ISO-8601 UTC).
scraped_atstringWhen this row was recorded (ISO-8601 UTC).

Example output

{
"owner": "apify",
"name": "apify-sdk-python",
"full_name": "apify/apify-sdk-python",
"html_url": "https://github.com/apify/apify-sdk-python",
"description": "The Apify SDK for Python.",
"fork": false,
"archived": false,
"stargazers_count": 415,
"forks_count": 41,
"watchers_count": 415,
"open_issues_count": 12,
"size_kb": 2048,
"language": "Python",
"languages": {
"Python": 198432,
"Shell": 1024
},
"topics": ["apify", "scraping", "sdk"],
"license": "Apache-2.0",
"default_branch": "main",
"homepage": "https://docs.apify.com/sdk/python",
"created_at": "2022-08-01T10:00:00Z",
"updated_at": "2026-05-30T14:22:00Z",
"pushed_at": "2026-05-29T08:11:00Z",
"latest_release_tag": "v3.4.0",
"latest_release_published_at": "2026-05-20T12:00:00Z",
"scraped_at": "2026-06-01T09:00:00Z"
}

💰 Pricing

Pay-Per-Event — you only pay when these events fire:

EventUSDWhat triggers it
actor-start$0.005One-off warm-up charge per run
result$0.002Per dataset row written

Example: 1 000 repos at the rates above ≈ $2.00. No subscription, no minimum, no card required to start — Apify gives every new account $5 of free credit.

🚧 Limitations

  • Private repos require a token with the matching scopes — this Actor only processes repos the token can read. Do not reuse production tokens here.
  • README content, raw code, commit graphs, and pull request history are outside the scope of this Actor. Use GitHub's search API or a dedicated commits scraper for those.
  • Large orgs with thousands of repos will hit the 5 000 req/hr authenticated ceiling on long runs. Plan batches or spread runs over multiple hours.
  • GitHub caches some counts (stars, forks) for a few minutes. Compare runs at least 5 minutes apart to catch real movement.

❓ FAQ

Do I need a GitHub token?

For small batches (under ~50 repos), no. The unauthenticated GitHub REST API allows 60 requests per hour, which is enough for a quick test. Provide a Personal Access Token to raise that ceiling to 5 000 requests per hour — public_repo read-only scope is all you need.

Is this a GitHub REST API alternative or replacement?

Neither — it is a wrapper that handles authentication, pagination, sub-resource fetching, secondary-rate-limit retries, and structured output so you do not have to write that code yourself. The GitHub API still powers the requests; we handle the operational layer on top of it.

Can I use this to fetch github repo metadata api-style for hundreds of repos at once?

Yes. This is the primary use case — pass a list of hundreds of owner/repo slugs, set your token, and the Actor fans them out in parallel while respecting GitHub's rate limits. Output lands in a clean dataset ready to export or query.

How is this different from calling the GitHub REST API myself?

Writing a reliable github repository scraper yourself means handling secondary rate limits, separate language and release endpoints, pagination, token rotation, and structured output validation. That is 1–2 dev-weeks of plumbing. This Actor handles all of it for $2 per 1 000 repos.

What if a repo doesn't exist or has been deleted?

The Actor logs a warning for that slug and continues — the rest of your list still processes normally. You get a partial dataset with a clear log entry for every skipped repo.

Can I scrape private repositories?

You can — if you supply a token that has access to those repos. This Actor was designed for bulk public-data extraction; do not use production tokens here.

💬 Your feedback

Spotted a bug, hit a weird edge case, or need a new output field? Open an issue on the Actor's Issues tab in Apify Console — we read every report and ship fixes weekly.