# Hacker News Stories & Comments Scraper (`taroyamada/hacker-news-intelligence`) Actor

Extract trending tech discussions, nested comment hierarchies, and post scores from Hacker News directly into structured JSON for custom RAG pipelines.

- **URL**: https://apify.com/taroyamada/hacker-news-intelligence.md
- **Developed by:** [naoki anzai](https://apify.com/taroyamada) (community)
- **Categories:** AI, News, Developer tools
- **Stats:** 3 total users, 1 monthly users, 100.0% runs succeeded, 0 bookmarks
- **User rating**: No ratings yet

## Pricing

Pay per event

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## 📰 Hacker News Scraper

Feed your artificial intelligence pipelines and custom RAG applications with high-quality, vetted tech discussions by extracting data directly from Hacker News. This robust Hacker News scraper is purpose-built for AI researchers, data scientists, and developer teams who require highly structured conversational text to train sentiment analysis models and build search aggregators. By bypassing fragile web page HTML parsing and querying the official Firebase API directly, the scraper ensures your extraction tasks run flawlessly and return perfectly formatted JSON results every time.

Automate your data collection workflow by scheduling the scraper to run on a daily or weekly basis. You can effortlessly scrape the top 100 trending posts alongside their complete, nested comment hierarchies. Filter the extracted results by setting a minimum score threshold, guaranteeing you only collect meaningful text that has gained genuine traction within the developer community. This targeted extraction is ideal for teams building AI agents designed to summarize emerging GitHub repositories, track new developer tools, or analyze sentiment around newly released AI research papers.

The scraped data is delivered in a highly structured format, granting you deep programmatic access to multi-level nested comment trees, detailed author profiles, precise post scores, and external URLs. Stop manually scraping unstructured websites or struggling with brittle CSS selectors. With this extractor, you can reliably capture the internet's most valuable tech insights and seamlessly integrate them into your overarching data strategy.

### Store Quickstart

Start with the **Quickstart** template (top stories, 20 items). For tech trend monitoring, use **Top Trends** with minScore=100 and domain analysis.

### Key Features

- 🔥 **Official Firebase API** — hacker-news.firebaseio.com — 10+ year stable
- 📂 **6 story modes** — top, new, best, ask, show, job
- ⭐ **Score filtering** — Minimum score threshold for quality filtering
- 💬 **Comment threads** — Optional nested comment extraction
- 🏷️ **Top domains analysis** — Which domains dominate the front page
- 🔑 **No API key needed** — Public Firebase API

### Use Cases

| Who | Why |
|-----|-----|
| **Tech journalists** | Daily Hacker News trend reports |
| **Startup founders** | Watch which tools/frameworks gain HN traction |
| **VCs/Investors** | Signal for emerging tech and founder announcements |
| **Developer tool companies** | Monitor HN sentiment on products and competitors |
| **AI/ML researchers** | Discover papers and repos trending in tech community |

### Input

| Field | Type | Default | Description |
|-------|------|---------|-------------|
| mode | string | top | top, new, best, ask, show, job |
| maxItems | integer | 30 | Max stories (1-500) |
| minScore | integer | 0 | Minimum score filter |
| includeComments | boolean | false | Include comment threads |

#### Input Example

```json
{
  "mode": "top",
  "maxItems": 30,
  "minScore": 100,
  "includeComments": false
}
````

### Input Examples

#### Example: Top stories snapshot

```json
{
  "feed": "topstories",
  "maxStories": 30,
  "commentDepth": 1
}
```

#### Example: Keyword search across history

```json
{
  "query": "Rust",
  "maxResults": 100,
  "sortBy": "byPopularity"
}
```

#### Example: Story + full comment tree

```json
{
  "storyIds": [
    42096277
  ],
  "commentDepth": 5
}
```

### Output

| Field | Type | Description |
|-------|------|-------------|
| `id` | integer | HN story ID |
| `title` | string | Story title |
| `url` | string | External URL (if any) |
| `author` | string | HN username |
| `score` | integer | Upvote score |
| `numComments` | integer | Comment count |
| `createdAt` | string | ISO timestamp |
| `hnUrl` | string | Hacker News thread URL |
| `comments` | object\[] | Top comments (if includeComments enabled) |

#### Output Example

```json
{
  "id": 12345678,
  "title": "Claude 4.5 released with new features",
  "url": "https://anthropic.com/news/claude-4-5",
  "score": 523,
  "by": "user123",
  "time": 1712345678,
  "descendants": 142,
  "type": "story"
}
```

### API Usage

Run this actor programmatically using the Apify API. Replace `YOUR_API_TOKEN` with your token from [Apify Console → Settings → Integrations](https://console.apify.com/account/integrations).

#### cURL

```bash
curl -X POST "https://api.apify.com/v2/acts/taroyamada~hacker-news-intelligence/run-sync-get-dataset-items?token=YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{ "mode": "top", "maxItems": 30, "minScore": 100, "includeComments": false }'
```

#### Python

```python
from apify_client import ApifyClient

client = ApifyClient("YOUR_API_TOKEN")
run = client.actor("taroyamada/hacker-news-intelligence").call(run_input={
  "mode": "top",
  "maxItems": 30,
  "minScore": 100,
  "includeComments": false
})

for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)
```

#### JavaScript / Node.js

```javascript
import { ApifyClient } from 'apify-client';

const client = new ApifyClient({ token: 'YOUR_API_TOKEN' });
const run = await client.actor('taroyamada/hacker-news-intelligence').call({
  "mode": "top",
  "maxItems": 30,
  "minScore": 100,
  "includeComments": false
});

const { items } = await client.dataset(run.defaultDatasetId).listItems();
console.log(items);
```

### Tips & Limitations

- Use `mode: "top"` for the front page, `"new"` for breaking submissions.
- Set `minScore: 50` to filter out noise and focus on signal.
- Schedule daily to track trending dev/startup topics.
- Combine with Article Content Extractor to fetch full content of linked stories.

### FAQ

**What does score mean?**

Net upvotes (upvotes minus downvotes). 100+ is front-page quality. 500+ is viral.

**How often does the HN front page update?**

Rapidly — rankings shift every few minutes. Scrape hourly for trend tracking.

**Can I get old/archived stories?**

Yes, the 'new' mode iterates chronologically; 'best' returns high-score stories over time.

**What's the comment limit?**

All comments under a story are available via the API. Comment-heavy posts slow down extraction.

**What's the difference vs the official HN API?**

This actor handles pagination, deduplication, comment threading, and outputs to Apify dataset — no SDK needed.

**Can I search HN by keyword?**

Use the Algolia HN search API for keyword search. This actor focuses on top/new/best feeds.

### Related Actors

News & Content cluster — explore related Apify tools:

- [📰 Google News Scraper](https://apify.com/taroyamada/google-news-scraper) — Scrape Google News articles for any search query via official RSS feed.
- [📰 Article Extractor](https://apify.com/taroyamada/article-content-extractor) — Extract clean article content with title, author, publish date, images from news and blog pages.
- [📄 Website Content Extractor](https://apify.com/taroyamada/website-content-extractor) — Extract clean main content from any webpage as text, markdown, or HTML.
- [📡 RSS Feed Aggregator](https://apify.com/taroyamada/rss-feed-aggregator) — Aggregate multiple RSS and Atom feeds with keyword filtering and deduplication.
- [📡 Reddit All-in-One Scraper](https://apify.com/taroyamada/reddit-all-in-one-scraper) — Scrape Reddit subreddits, posts, comments, user profiles, and search results via public JSON endpoints.
- [🚨 Reddit Keyword Monitor Alerts](https://apify.com/taroyamada/reddit-keyword-monitor-alerts) — Focused Reddit keyword and subreddit monitor built for recurring alerts, snapshot diffing, and webhook handoff.

### Cost

**Pay Per Event**:

- `actor-start`: $0.01 (flat fee per run)
- `dataset-item`: $0.003 per output item

**Example**: 1,000 items = $0.01 + (1,000 × $0.003) = **$3.01**

No subscription required — you only pay for what you use.

### ⭐ Was this helpful?

If this actor saved you time, please [**leave a ★ rating**](https://apify.com/taroyamada/hacker-news-intelligence/reviews) on Apify Store. It takes 10 seconds, helps other developers discover it, and keeps updates free.

Bug report or feature request? Open an issue on the [Issues tab](https://apify.com/taroyamada/hacker-news-intelligence/issues) of this actor.

# Actor input Schema

## `mode` (type: `string`):

Operation mode

## `maxItems` (type: `integer`):

Maximum number of items to return

## `minScore` (type: `integer`):

Minimum score threshold for filtering

## `includeComments` (type: `boolean`):

Include comments in output

## `timeoutMs` (type: `integer`):

Request timeout in milliseconds

## `delivery` (type: `string`):

Where to send results: dataset or webhook

## `webhookUrl` (type: `string`):

Webhook URL to POST results to (if delivery=webhook)

## `dryRun` (type: `boolean`):

Run without saving results (for testing)

## Actor input object example

```json
{
  "mode": "top",
  "maxItems": 100,
  "minScore": 0,
  "includeComments": false,
  "timeoutMs": 15000,
  "delivery": "dataset",
  "dryRun": false
}
```

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {};

// Run the Actor and wait for it to finish
const run = await client.actor("taroyamada/hacker-news-intelligence").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {}

# Run the Actor and wait for it to finish
run = client.actor("taroyamada/hacker-news-intelligence").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{}' |
apify call taroyamada/hacker-news-intelligence --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=taroyamada/hacker-news-intelligence",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Hacker News Stories & Comments Scraper",
        "description": "Extract trending tech discussions, nested comment hierarchies, and post scores from Hacker News directly into structured JSON for custom RAG pipelines.",
        "version": "0.1",
        "x-build-id": "LgfzuRQ4HvarmWlus"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/taroyamada~hacker-news-intelligence/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-taroyamada-hacker-news-intelligence",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/taroyamada~hacker-news-intelligence/runs": {
            "post": {
                "operationId": "runs-sync-taroyamada-hacker-news-intelligence",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/taroyamada~hacker-news-intelligence/run-sync": {
            "post": {
                "operationId": "run-sync-taroyamada-hacker-news-intelligence",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "mode": {
                        "title": "Mode",
                        "enum": [
                            "top",
                            "new",
                            "best",
                            "ask",
                            "show",
                            "job"
                        ],
                        "type": "string",
                        "description": "Operation mode",
                        "default": "top"
                    },
                    "maxItems": {
                        "title": "Max Items",
                        "minimum": 1,
                        "maximum": 500,
                        "type": "integer",
                        "description": "Maximum number of items to return",
                        "default": 100
                    },
                    "minScore": {
                        "title": "Min Score",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Minimum score threshold for filtering",
                        "default": 0
                    },
                    "includeComments": {
                        "title": "Include Comments",
                        "type": "boolean",
                        "description": "Include comments in output",
                        "default": false
                    },
                    "timeoutMs": {
                        "title": "Timeout (ms)",
                        "minimum": 1000,
                        "maximum": 30000,
                        "type": "integer",
                        "description": "Request timeout in milliseconds",
                        "default": 15000
                    },
                    "delivery": {
                        "title": "Delivery",
                        "enum": [
                            "dataset",
                            "webhook"
                        ],
                        "type": "string",
                        "description": "Where to send results: dataset or webhook",
                        "default": "dataset"
                    },
                    "webhookUrl": {
                        "title": "Webhook URL",
                        "type": "string",
                        "description": "Webhook URL to POST results to (if delivery=webhook)"
                    },
                    "dryRun": {
                        "title": "Dry Run",
                        "type": "boolean",
                        "description": "Run without saving results (for testing)",
                        "default": false
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```
