# GitHub Repository Scraper (`vulnv/github-repository-scraper`) Actor

Scrape and extract GitHub repository data, metadata, statistics, stars, forks, issues, and project information from multiple repositories at once.

- **URL**: https://apify.com/vulnv/github-repository-scraper.md
- **Developed by:** [VulnV](https://apify.com/vulnv) (community)
- **Categories:** Developer tools, Automation, Other
- **Stats:** 15 total users, 0 monthly users, 100.0% runs succeeded, 1 bookmarks
- **User rating**: No ratings yet

## Pricing

$10.00/month + usage

To use this Actor, you pay a monthly rental fee to the developer. The rent is subtracted from your prepaid usage every month after the free trial period.You also pay for the Apify platform usage, which gets cheaper the higher Apify subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#rental-actors

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## GitHub Repository Scraper - Extract Repository Data at Scale

### Overview
The **GitHub Repository Scraper** is a powerful Apify Actor designed to extract comprehensive data from GitHub repositories efficiently. Perfect for competitive analysis, market research, developer insights, or building repository databases — this scraper provides detailed information about repositories, statistics, and project metadata.

✅ Bulk URL processing | ✅ Comprehensive repository data | ✅ Statistics extraction | ✅ Metadata analysis | ✅ Concurrent processing

---

#### **Complete Repository Data Extraction**
- **Basic Information** — Repository name, description, owner, creation date
- **Statistics** — Stars, forks, watchers, usage metrics
- **Technical Details** — Programming languages, file counts, commit information
- **Project Metadata** — Topics, license information, default branch
- **Enhanced Repository Data** — GitHub IDs, clone URLs, file listings, branch info
- **Owner Information** — Detailed owner profiles with avatars and organization status
- **Repository Structure** — File counts, directory listings, README information
- **Access URLs** — Multiple clone formats (HTTPS, SSH, GitHub CLI), download links

#### **Key Features**
- **Bulk Processing** — Process multiple GitHub repository URLs in one run
- **Smart URL Parsing** — Automatically extracts repository paths from full GitHub URLs
- **Proxy Support** — Built-in Apify proxy integration for reliable scraping
- **Error Handling** — Robust error handling with detailed status reporting
- **Clean JSON Output** — Structured, ready-to-use data format
- **Concurrent Processing** — Configurable concurrency for optimal performance
- **Format Flexibility** — Accepts various URL formats and automatically normalizes them

---

### 🧾 Input Configuration

Submit an array of GitHub repository URLs via the input schema:

```json
{
  "urls": [
    "https://github.com/microsoft/vscode",
    "https://github.com/facebook/react",
    "https://github.com/nodejs/node",
    "https://github.com/torvalds/linux"
  ],
  "maxConcurrency": 5,
  "includeNotFound": false,
  "proxyConfiguration": {
    "useApifyProxy": true,
    "apifyProxyGroups": ["RESIDENTIAL"]
  }
}
````

#### **Input Parameters**

1. **URLs** (required):
   - Array of GitHub repository URLs to scrape
   - **Supported formats**: `https://github.com/owner/repo`, `github.com/owner/repo`
   - Invalid URLs will be automatically filtered out with warnings

2. **Max Concurrency** (optional):
   - Number of concurrent requests for scraping (1-20)
   - Default: 5
   - Higher values = faster processing but may increase chance of rate limiting

3. **Include Not Found** (optional):
   - Whether to include repositories that return 404 (not found) in the results
   - Default: false
   - When enabled, includes error information for non-existent repositories

4. **Proxy Configuration** (recommended):
   - Configure Apify proxy settings to avoid rate limiting
   - **Recommended for bulk scraping operations**
   - **Format**:
     ```json
     "proxyConfiguration": {
       "useApifyProxy": true,
       "apifyProxyGroups": ["RESIDENTIAL"]
     }
     ```
   - **Available proxy groups**: `RESIDENTIAL`, `DATACENTER`, `GOOGLE_SERP`
   - Use `RESIDENTIAL` for best reliability when scraping GitHub

#### **Proxy Configuration Examples**

**For small-scale scraping (< 100 repositories):**

```json
"proxyConfiguration": {
  "useApifyProxy": true,
  "apifyProxyGroups": ["DATACENTER"]
}
```

**For large-scale or production scraping (recommended):**

```json
"proxyConfiguration": {
  "useApifyProxy": true,
  "apifyProxyGroups": ["RESIDENTIAL"]
}
```

**No proxy (not recommended for bulk operations):**

```json
// Omit proxyConfiguration entirely - may result in rate limiting
```

***

### 📤 Output Format

Each GitHub repository returns comprehensive structured data including enhanced metadata extracted from GitHub's embedded data:

```json
{
  "url": "https://github.com/microsoft/vscode",
  "repoPath": "microsoft/vscode",
  "success": true,
  "data": {
    "url": "https://github.com/microsoft/vscode",
    "type": "repo",
    "description": "Visual Studio Code",
    "website": "https://code.visualstudio.com",
    "forkedfrom": null,
    "tags": ["editor", "typescript", "electron", "ide"],
    "usedby": 250000,
    "watchers": 3200,
    "stars": 162000,
    "forks": 28500,
    "langs": [
      {"name": "TypeScript", "perc": "93.2%"},
      {"name": "JavaScript", "perc": "4.1%"},
      {"name": "CSS", "perc": "1.5%"}
    ],

    // Enhanced data from GitHub's embedded JSON
    "id": 41881900,
    "name": "vscode",
    "full_name": "microsoft/vscode",
    "owner": "microsoft",
    "default_branch": "main",
    "is_fork": false,
    "is_empty": false,
    "is_private": false,
    "is_org_owned": true,
    "created_at": "2015-09-03T20:23:30.000Z",
    "clone_url": "https://github.com/microsoft/vscode.git",
    "ssh_url": "git@github.com:microsoft/vscode.git",
    "api_url": "https://api.github.com/repos/microsoft/vscode",

    // Owner information
    "owner_info": {
      "login": "microsoft",
      "type": "Organization",
      "url": "https://github.com/microsoft",
      "avatar_url": "https://avatars.githubusercontent.com/u/6154722?v=4"
    },

    // File and repository structure
    "file_count": 15420,
    "files": [
      {"name": "README.md", "path": "README.md", "type": "file"},
      {"name": "package.json", "path": "package.json", "type": "file"},
      {"name": "src", "path": "src", "type": "directory"}
    ],

    // Clone and download URLs
    "clone_urls": {
      "https": "https://github.com/microsoft/vscode.git",
      "ssh": "git@github.com:microsoft/vscode.git",
      "github_cli": "gh repo clone microsoft/vscode"
    },
    "download_url": "/microsoft/vscode/archive/refs/heads/main.zip",

    // Branch and commit information
    "ref_info": {
      "name": "main",
      "type": "branch",
      "current_oid": "585acf48f88e399989d54f001029424b2b7c358a",
      "can_edit": false
    },
    "commit_count": "185,234",

    // README information
    "readme_info": {
      "displayName": "README.md",
      "repoName": "vscode",
      "refName": "main",
      "path": "README.md",
      "loaded": true
    },

    // Metadata
    "enriched_at": "2024-12-29T15:30:45.123Z",
    "data_source": "github_scraper_enhanced"
  }
}
```

#### **Error Handling**

Failed repositories return structured error information:

```json
{
  "url": "https://github.com/invalid/repo",
  "repoPath": "invalid/repo",
  "success": false,
  "error": "Repository not found or private"
}
```

When `includeNotFound` is enabled, 404 repositories return structured data:

```json
{
  "url": "https://github.com/nonexistent/repo",
  "repoPath": "nonexistent/repo",
  "success": true,
  "data": {
    "exists": false,
    "error": "Repository not found",
    "statusCode": 404
  }
}
```

**Common Error Cases:**

- `Repository not found or private` — Repository doesn't exist or is private
- `Network error` — Connection issues or scraping errors
- Invalid URLs are filtered out before processing with warning logs

***

### 💼 Common Use Cases

#### **Competitive Analysis & Market Research**

- Analyze competitor repositories and project activity
- Track technology trends through repository statistics
- Research popular libraries and frameworks in specific domains
- Monitor open source project adoption rates

#### **Developer & Technology Research**

- Study programming language usage patterns
- Analyze repository structures and best practices
- Research active open source projects in specific technologies
- Track development activity and contribution patterns

#### **Portfolio & Investment Analysis**

- Research technology companies and their open source contributions
- Analyze developer productivity and project health metrics
- Track repository growth and community engagement
- Identify trending projects and technologies

#### **Academic & Educational Research**

- Study software development patterns and practices
- Analyze open source community dynamics
- Research programming language evolution
- Track educational resource repositories

***

### 📊 Output & Export Options

#### **Dataset Storage**

- All extracted data stored in Apify dataset
- Each repository becomes one dataset item
- Status tracking for successful and failed extractions

#### **Export Formats**

- **JSON** — Raw structured data for API integration
- **CSV** — Spreadsheet-compatible format for analysis
- **Excel** — Formatted spreadsheet with repository data

#### **Data Processing**

- Clean, validated URLs
- Structured error reporting
- Comprehensive logging for troubleshooting

***

### ⚡ Quick Start Guide

1. **Configure Input**:
   - Add GitHub repository URLs to the `urls` array
   - Set desired `maxConcurrency` (recommended: 5-10)
   - Configure `proxyConfiguration` with `useApifyProxy: true` and appropriate proxy groups for reliable scraping

2. **Run the Actor**:
   - Execute through Apify Console or API
   - Monitor progress through real-time logs
   - Review extracted data in the dataset

3. **Export Results**:
   - Download data in your preferred format
   - Integrate with your existing tools and workflows

***

### 🆘 Support & Feedback

For questions, feature requests, or technical support:

- Visit the [Apify Community Forum](https://forum.apify.com)
- Contact us through the Apify platform
- Submit issues for improvements and bug reports

***

### 🌟 Explore More Actors

✨ **Need more scraping solutions?** Discover additional actors on Apify for comprehensive web automation and data extraction. Explore our full range of tools at 🌐 [Explore More Actors on Apify](https://apify.com/vulnv).

📧 For inquiries or custom development, reach out at apify@vulnv.com.

# Actor input Schema

## `urls` (type: `array`):

List of GitHub repository URLs to scrape. Each URL should be a full GitHub repository URL (e.g., https://github.com/user/repo).

## `proxyConfiguration` (type: `object`):

Select proxies to be used by your scraper.

## `maxConcurrency` (type: `integer`):

Maximum number of concurrent requests to process at once

## `includeNotFound` (type: `boolean`):

Include repositories that return 404 (not found) in the results with error information

## Actor input object example

```json
{
  "urls": [
    "https://github.com/nelsonic/adoro",
    "https://github.com/microsoft/vscode",
    "https://github.com/facebook/react"
  ],
  "proxyConfiguration": {
    "useApifyProxy": true
  },
  "maxConcurrency": 5,
  "includeNotFound": false
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "urls": [
        "https://github.com/nelsonic/adoro",
        "https://github.com/microsoft/vscode",
        "https://github.com/facebook/react"
    ],
    "proxyConfiguration": {
        "useApifyProxy": true
    }
};

// Run the Actor and wait for it to finish
const run = await client.actor("vulnv/github-repository-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "urls": [
        "https://github.com/nelsonic/adoro",
        "https://github.com/microsoft/vscode",
        "https://github.com/facebook/react",
    ],
    "proxyConfiguration": { "useApifyProxy": True },
}

# Run the Actor and wait for it to finish
run = client.actor("vulnv/github-repository-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "urls": [
    "https://github.com/nelsonic/adoro",
    "https://github.com/microsoft/vscode",
    "https://github.com/facebook/react"
  ],
  "proxyConfiguration": {
    "useApifyProxy": true
  }
}' |
apify call vulnv/github-repository-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=vulnv/github-repository-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "GitHub Repository Scraper",
        "description": "Scrape and extract GitHub repository data, metadata, statistics, stars, forks, issues, and project information from multiple repositories at once.",
        "version": "1.0",
        "x-build-id": "1GmQzZN4YM47Dx4M2"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/vulnv~github-repository-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-vulnv-github-repository-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/vulnv~github-repository-scraper/runs": {
            "post": {
                "operationId": "runs-sync-vulnv-github-repository-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/vulnv~github-repository-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-vulnv-github-repository-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "required": [
                    "urls"
                ],
                "properties": {
                    "urls": {
                        "title": "GitHub Repository URLs",
                        "minItems": 1,
                        "maxItems": 100000,
                        "uniqueItems": true,
                        "type": "array",
                        "description": "List of GitHub repository URLs to scrape. Each URL should be a full GitHub repository URL (e.g., https://github.com/user/repo).",
                        "items": {
                            "type": "string"
                        }
                    },
                    "proxyConfiguration": {
                        "title": "Proxy configuration",
                        "type": "object",
                        "description": "Select proxies to be used by your scraper."
                    },
                    "maxConcurrency": {
                        "title": "Max Concurrency",
                        "minimum": 1,
                        "maximum": 20,
                        "type": "integer",
                        "description": "Maximum number of concurrent requests to process at once",
                        "default": 5
                    },
                    "includeNotFound": {
                        "title": "Include Not Found Repositories",
                        "type": "boolean",
                        "description": "Include repositories that return 404 (not found) in the results with error information",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
