# Substack Scraper | All-In-One (`fatihtahta/substack-scraper`) Actor

Get full articles, user profiles, and search results with All-in-One Substack Scraper. Extract rich data including titles, bios, subscriber counts, social links and engagement metrics. ideal for market research, creator discovery, trend tracking, and audience analysis.

- **URL**: https://apify.com/fatihtahta/substack-scraper.md
- **Developed by:** [Fatih Tahta](https://apify.com/fatihtahta) (community)
- **Categories:** Social media, News, Developer tools
- **Stats:** 135 total users, 37 monthly users, 100.0% runs succeeded, 5 bookmarks
- **User rating**: No ratings yet

## Pricing

from $1.99 / 1,000 results

This Actor is paid per event. You are not charged for the Apify platform usage, but only a fixed price for specific events.
Since this Actor supports Apify Store discounts, the price gets lower the higher subscription plan you have.

Learn more: https://docs.apify.com/platform/actors/running/actors-in-store#pay-per-event

## What's an Apify Actor?

Actors are a software tools running on the Apify platform, for all kinds of web data extraction and automation use cases.
In Batch mode, an Actor accepts a well-defined JSON input, performs an action which can take anything from a few seconds to a few hours,
and optionally produces a well-defined JSON output, datasets with results, or files in key-value store.
In Standby mode, an Actor provides a web server which can be used as a website, API, or an MCP server.
Actors are written with capital "A".

## How to integrate an Actor?

If asked about integration, you help developers integrate Actors into their projects.
You adapt to their stack and deliver integrations that are safe, well-documented, and production-ready.
The best way to integrate Actors is as follows.

In JavaScript/TypeScript projects, use official [JavaScript/TypeScript client](https://docs.apify.com/api/client/js.md):

```bash
npm install apify-client
```

In Python projects, use official [Python client library](https://docs.apify.com/api/client/python.md):

```bash
pip install apify-client
```

In shell scripts, use [Apify CLI](https://docs.apify.com/cli/docs.md):

````bash
# MacOS / Linux
curl -fsSL https://apify.com/install-cli.sh | bash
# Windows
irm https://apify.com/install-cli.ps1 | iex
```bash

In AI frameworks, you might use the [Apify MCP server](https://docs.apify.com/platform/integrations/mcp.md).

If your project is in a different language, use the [REST API](https://docs.apify.com/api/v2.md).

For usage examples, see the [API](#api) section below.

For more details, see Apify documentation as [Markdown index](https://docs.apify.com/llms.txt) and [Markdown full-text](https://docs.apify.com/llms-full.txt).


# README

## Substack Scraper | All-In-One

**Slug:** `fatihtahta/substack-scraper`

### Overview

Substack Scraper | All-In-One collects structured public data from [https://substack.com](https://substack.com), including search results, publications, people, posts, notes, comments, and reply-thread content where supported by the selected input. Records can include core identifiers, URLs, titles, publication metadata, author information, timestamps, engagement counts, and richer content fields when detail enrichment is enabled. Substack is a widely used publishing and newsletter platform, which makes its public content useful for market intelligence, editorial analysis, research, enrichment, and monitoring workflows. This actor is designed for repeatable, automated collection with consistent JSON output that can be used directly in downstream systems. It is suited to recurring data acquisition where users need a dependable, operationally clear workflow for collecting current public records over time.

### Why Use This Actor

- **Market research and analytics teams:** collect structured Substack posts, publications, people, and recent discussions for market intelligence, topic tracking, coverage analysis, and operational reporting.
- **Product and content teams:** monitor subject trends, publication activity, post velocity, and audience-facing content patterns to support editorial planning and content benchmarking.
- **Developers and data engineering teams:** feed normalized JSON records into ETL jobs, search indexes, data warehouses, enrichment pipelines, and other downstream systems with minimal reshaping.
- **Lead generation and enrichment workflows:** gather public creator, publication, and content attributes that can supplement prospecting, segmentation, and account research processes.
- **Monitoring and competitive tracking teams:** run recurring collections to watch for newly published content, public conversation activity, and movement across tracked keywords or known pages.

### Common Use Cases

- **Market intelligence:** track topic coverage, newly published posts, publication activity, and audience-visible conversation around specific themes.
- **Lead generation:** build curated lists of relevant writers, publications, or topic-specific public profiles for outreach research or enrichment.
- **Competitive monitoring:** watch how tracked publications or authors publish, position, and discuss topics over time.
- **Catalog and directory building:** populate internal databases with structured public publication, author, and post records.
- **Data enrichment:** append fresh public content and publication attributes to CRM, BI, or analytics datasets.
- **Recurring reporting:** schedule periodic runs to produce current datasets for dashboards, alerts, and trend reviews.
- **Conversation tracking:** collect notes, comments, and optional reply threads when the goal is to analyze public engagement rather than long-form posts alone.

### Quick Start

1. Choose your input strategy: add known `startUrls`, enter one or more `queries`, or combine both when you need discovery and direct collection in the same run.
2. For a first validation run, set a small `limit` so you can quickly confirm the output shape and record types.
3. Select the appropriate `result_type` for keyword searches, and add optional filters such as `publication_date`, `within_days`, or `language` if you need a narrower scope.
4. Enable `enrich_data` when you want fuller records, and turn on `get_replies` only when reply-thread collection is relevant to explicit note or comment-thread targets.
5. Run the actor in Apify Console and inspect the first dataset records.
6. Increase coverage, refine filters, or add a schedule after the dataset matches your intended workflow.

### Input Parameters

Use `startUrls` for direct collection from known Substack pages, `queries` for keyword-driven discovery, or combine both in a single run.

| Parameter | Type | Description | Default |
| --- | --- | --- | --- |
| `startUrls` | array of strings | Exact Substack pages to collect from directly. Publication homepages are collected through the archive feed (`sort=new`, paged by `offset`/`limit`), while note/comment pages are treated as thread sources. Useful for known search pages, publication profiles, individual posts, notes, comment pages, or custom-domain pages. | – |
| `queries` | array of strings | One or more keyword searches. Each query becomes its own search target for discovery-oriented runs. | – |
| `result_type` | string | Result set to return for keyword searches only. Allowed values: `top`, `recent`, `posts`, `publications`, `people`. | `top` |
| `publication_date` | string | Optional recency filter for keyword-based post results. Allowed values: `last_day` (24 hours), `last_week` (7 days), `last_month`, `last_year`. | – |
| `within_days` | integer | Keep only posts published within the last N days. Minimum value: `1` day. | – |
| `language` | string | Restrict keyword-based post results to a specific language. Allowed values: `ar`, `zh`, `cs`, `nl`, `en`, `fr`, `de`, `el`, `hi`, `hu`, `id`, `it`, `ja`, `ko`, `la`, `no`, `pl`, `pt`, `ro`, `ru`, `es`, `sv`, `th`, `tr`, `vi`. | – |
| `enrich_data` | boolean | Fetch fuller records for supported posts, publications, or people instead of lighter search-result entries. Useful when detail is more important than run speed. | `true` |
| `get_replies` | boolean | Collect reply threads for supported note or comment detail pages. It does not expand publication homepage discovery into comment-thread output. | `false` |
| `max_replies` | integer | Maximum replies to save per post or thread. Use `0` to suppress replies when reply collection is enabled. | – |
| `limit` | integer | Per-input cap on the number of results saved for each `startUrls` entry or search query. Minimum value: `1`. Leave empty for broader collection. | – |
| `proxyConfiguration` | object | Connection settings for Apify Proxy or custom proxy configuration when you need a different routing path. | `{"useApifyProxy":true,"apifyProxyGroups":["RESIDENTIAL"]}` |

### Choosing Inputs

Use `startUrls` when you already know the exact Substack pages you want to collect, such as a specific publication, post, search page, note, or curated list of known targets. Publication homepage `startUrls` collect posts from that publication through the archive API rather than a mixed homepage snapshot. Use `queries` when you want discovery-oriented collection based on topics, names, brands, or keywords.

Narrower filters such as `result_type`, `publication_date`, `within_days`, and `language` produce more targeted datasets, while broader settings improve discovery and surface more varied records. If you are validating a new workflow, start with a small `limit`, inspect the first records, and then increase coverage after confirming that the returned types and fields match your downstream needs.

### Example Inputs

#### Scenario: keyword discovery for recent posts

```json
{
  "queries": ["artificial intelligence", "developer tools"],
  "result_type": "posts",
  "publication_date": "last_week",
  "language": "en",
  "enrich_data": true,
  "limit": 20
}
````

#### Scenario: direct URL collection from known pages

```json
{
  "startUrls": [
    "https://substack.com/search/ai?searching=top",
    "https://www.platformer.news/"
  ],
  "enrich_data": true,
  "get_replies": false,
  "within_days": 30,
  "limit": 15
}
```

#### Scenario: targeted monitoring with reply collection

```json
{
  "queries": ["movie"],
  "result_type": "recent",
  "within_days": 7,
  "enrich_data": true,
  "get_replies": true,
  "max_replies": 25,
  "limit": 10
}
```

### Output

#### 9.1 Output destination

The actor writes results to an Apify dataset as JSON records. The dataset is designed for direct consumption by analytics tools, ETL pipelines, and downstream APIs without post-processing.

Each item contains a stable record envelope plus a type-specific payload when multiple entity types are returned in the same run.

#### 9.2 Record envelope (all items)

- **type** *(string, required)*: record family such as `post` or `comment`.
- **id** *(number, required)*: primary record identifier. Treat this as an opaque identifier in downstream systems because some record families may expose prefixed or string-serialized forms in the dataset.
- **url** *(string, required)*: canonical public URL for the record.

Recommended idempotency key: `type + ":" + id`

Use that composite key for deduplication and upserts when loading repeated runs into warehouses, search indexes, CRMs, or operational databases. The stable envelope makes records easier to merge, deduplicate, and sync across recurring collections.

#### 9.3 Examples

Notes:

- The examples below match the current enriched dataset structure returned when `enrich_data` is enabled.
- Long `body_html` strings are abbreviated inside the value for readability, but the example records show the full field shape now returned by the actor.

Example: post (`type = "post"`)

```json
{
  "id": "193819329",
  "type": "post",
  "title": "Substack Finally Introduced Notes Scheduling. This Is How To Do It.",
  "description": "Plus, at least FIVE ways you can use it to help your Substack grow and experiment intentionally.",
  "published_at": "2026-04-11T12:02:55.328000+00:00",
  "updated_at": "2026-04-11T12:10:13.090000+00:00",
  "url": "https://unstackit.substack.com/p/schedule-substack-notes",
  "author_name": "Kristi Keller \ud83c\udde8\ud83c\udde6",
  "author_handle": "kristikeller",
  "publication_name": "Unstack Substack",
  "publication_url": "https://unstackit.substack.com",
  "main_post_id": "193819329",
  "main_post_title": "Substack Finally Introduced Notes Scheduling. This Is How To Do It.",
  "main_post_url": "https://unstackit.substack.com/p/schedule-substack-notes",
  "main_post_published_at": "2026-04-11T12:02:55.328000+00:00",
  "subtitle": "Plus, at least FIVE ways you can use it to experiment and help your Substack grow intentionally.",
  "truncated_body_text": "This question has come up multiple times from several clients since Substack introduced the Notes feature:",
  "image_url": "https://substack-post-media.s3.amazonaws.com/public/images/626d8a7f-9db9-4f6e-933a-6663a9f9fbe8_2240x1260.png",
  "body_text": "This question has come up multiple times from several clients since Substack introduced the Notes feature:",
  "body_html": "<p>This question has come up multiple times from several clients since Substack introduced the Notes feature:</p><blockquote><p><em><strong>\u201cCan we schedule our notes to publish at a later date?\u201d</strong></em></p></blockquote><p>...</p>",
  "word_count": 549,
  "reaction_count": 60,
  "comment_count": 19,
  "child_comment_count": 11,
  "restack_count": 8,
  "post_tags": [
    {
      "id": "871d1cd1-6fa0-4a94-bf79-d4657addba2d",
      "publication_id": 2328962,
      "name": "Substack Notes",
      "slug": "substack-notes",
      "hidden": false
    },
    {
      "id": "ed4d7b60-078b-404f-a2e9-21f49cf290f6",
      "publication_id": 2328962,
      "name": "Substack How-To",
      "slug": "substack-how-to",
      "hidden": false
    }
  ],
  "published_bylines": [
    {
      "id": 165322060,
      "name": "Kristi Keller \ud83c\udde8\ud83c\udde6",
      "handle": "kristikeller",
      "photo_url": "https://substackcdn.com/image/fetch/$s_!TdHE!,f_auto,q_auto:good,fl_progressive:steep/https%3A%2F%2Fsubstack-post-media.s3.amazonaws.com%2Fpublic%2Fimages%2F0bee1713-6b2a-45fa-bbb6-e9437756af86_316x326.png",
      "bio": "Certified independent liberation advisor & builder of highly engaged Substack communities.",
      "is_guest": false
    }
  ],
  "audio_items": [
    {
      "post_id": 193819329,
      "voice_id": "en-US-AlloyTurboMultilingualNeural",
      "audio_url": "https://substack-video.s3.amazonaws.com/video_upload/post/193819329/tts/12334368-98cb-45d6-a507-805e3a789351/en-US-AlloyTurboMultilingualNeural.mp3",
      "type": "tts",
      "status": "completed"
    }
  ],
  "comments": [
    {
      "id": 241695637,
      "type": "comment",
      "body_text": "I scheduled a note everyday this week about tulips \ud83c\udf37 and it was quick and easy ...",
      "published_at": "2026-04-11T12:34:50.936000+00:00",
      "author_name": "Cerina Triglavcanin",
      "author_handle": "cerinatrig",
      "photo_url": "https://substack-post-media.s3.amazonaws.com/public/images/c3d05e18-ea68-4208-981e-47638e58953b_405x405.png",
      "reaction_count": 3,
      "children_count": 2,
      "childrenSummary": "2 replies by Kristi Keller \ud83c\udde8\ud83c\udde6 and others"
    }
  ],
  "post": {
    "audience": "everyone",
    "canonical_url": "https://unstackit.substack.com/p/schedule-substack-notes",
    "id": 193819329,
    "post_date": "2026-04-11T12:02:55.328Z",
    "updated_at": "2026-04-11T12:10:13.090Z",
    "publication_id": 2328962,
    "search_engine_description": "FIVE ways you can use Notes scheduling to help your Substack grow and experiment intentionally.",
    "slug": "schedule-substack-notes",
    "subtitle": "Plus, at least FIVE ways you can use it to experiment and help your Substack grow intentionally.",
    "title": "Substack Finally Introduced Notes Scheduling. This Is How To Do It.",
    "type": "newsletter",
    "write_comment_permissions": "everyone",
    "is_published": true,
    "restacks": 8,
    "reactions": {
      "\u2764": 60
    }
  },
  "publication": {
    "id": 2328962,
    "name": "Unstack Substack",
    "subdomain": "unstackit",
    "logo_url": "https://substack-post-media.s3.amazonaws.com/public/images/9ec83ce1-c3b2-4b6f-87d9-41902e60a69f_500x500.png",
    "homepage_type": "magaziney",
    "community_enabled": true,
    "payments_state": "enabled"
  },
  "publication_settings": {
    "enable_new_publisher": true,
    "podcast_enabled": false
  },
  "page_rank": 0
}
```

Example: comment (`type = "comment"`)

```json
{
  "id": "c-247726222",
  "type": "comment",
  "title": "Movies that Made Me Love Movies: This one I got to take a friend to the drive-in and see it! I was so excited! ...",
  "description": "Movies that Made Me Love Movies: This one I got to take a friend to the drive-in and see it! I was so excited! ...",
  "published_at": "2026-04-23T00:49:31.944000+00:00",
  "url": "https://substack.com/@backyardmoviecritic/note/c-247726222",
  "author_name": "Backyard Movie Critic",
  "author_handle": "backyardmoviecritic",
  "publication_name": "Backyard Movie Critic",
  "publication_url": "https://backyardmoviecritic.substack.com",
  "image_url": "https://substack-post-media.s3.amazonaws.com/public/images/1edaf33e-19c0-4c26-ab04-0df6f0e34e3d_452x452.png",
  "body_text": "Movies that Made Me Love Movies: This one I got to take a friend to the drive-in and see it! I was so excited! ...",
  "body_json": {
    "type": "doc",
    "attrs": {
      "schemaVersion": "v1"
    },
    "content": [
      {
        "type": "paragraph",
        "content": [
          {
            "type": "text",
            "text": "Movies that Made Me Love Movies:"
          }
        ]
      }
    ]
  },
  "attachments": [
    {
      "id": "3c033bf5-9636-4857-b762-dc298a726531",
      "type": "image",
      "imageUrl": "https://substack-post-media.s3.amazonaws.com/public/images/8057c6bf-2643-42a2-a5bd-800cc068e831_700x1050.png",
      "imageWidth": 700,
      "imageHeight": 1050,
      "explicit": false
    }
  ],
  "reaction_count": 40,
  "comment_count": 6,
  "restack_count": 2,
  "context": {
    "timestamp": "2026-04-23T00:49:31.944Z",
    "type": "note",
    "source": "db-note"
  },
  "comment": {
    "id": 247726222,
    "body": "Movies that Made Me Love Movies: This one I got to take a friend to the drive-in and see it! I was so excited! ...",
    "body_json": {
      "type": "doc",
      "attrs": {
        "schemaVersion": "v1"
      },
      "content": [
        {
          "type": "paragraph",
          "content": [
            {
              "type": "text",
              "text": "Movies that Made Me Love Movies:"
            }
          ]
        }
      ]
    },
    "type": "feed",
    "date": "2026-04-23T00:49:31.944Z",
    "name": "Backyard Movie Critic",
    "photo_url": "https://substack-post-media.s3.amazonaws.com/public/images/b427d26c-e0cb-4908-b8b6-2033359c2565_399x399.jpeg",
    "bio": "Gen X film junkie: silent horror to streaming chaos. Grew up on VHS gore and HBO late nights.",
    "handle": "backyardmoviecritic",
    "reaction_count": 40,
    "reactions": {
      "\u2764": 40
    },
    "restacks": 2,
    "children_count": 6,
    "attachments": [
      {
        "id": "3c033bf5-9636-4857-b762-dc298a726531",
        "type": "image",
        "imageUrl": "https://substack-post-media.s3.amazonaws.com/public/images/8057c6bf-2643-42a2-a5bd-800cc068e831_700x1050.png",
        "imageWidth": 700,
        "imageHeight": 1050,
        "explicit": false
      }
    ],
    "user_primary_publication": {
      "id": 7618357,
      "subdomain": "backyardmoviecritic",
      "name": "Backyard Movie Critic",
      "logo_url": "https://substack-post-media.s3.amazonaws.com/public/images/1edaf33e-19c0-4c26-ab04-0df6f0e34e3d_452x452.png",
      "payments_state": "disabled"
    },
    "language": "en"
  },
  "publication": {
    "id": 7618357,
    "subdomain": "backyardmoviecritic",
    "name": "Backyard Movie Critic",
    "logo_url": "https://substack-post-media.s3.amazonaws.com/public/images/1edaf33e-19c0-4c26-ab04-0df6f0e34e3d_452x452.png",
    "payments_state": "disabled"
  },
  "page_rank": 1
}
```

Example: reply (`type = "reply"`)

```json
{
  "id": "c-205156724",
  "type": "reply",
  "title": "This is such a thoughtful comment, thank you.",
  "published_at": "2026-01-26T01:26:46.936000+00:00",
  "url": "https://substack.com/@vanessathe/note/c-205156724",
  "author_name": "Vanessa",
  "author_handle": "vanessathe",
  "main_post_id": "173236288",
  "main_post_title": "eight films that made me fall madly in love with films",
  "main_post_url": "https://vanessateh.substack.com/p/eight-films-that-made-me-fall-madly",
  "main_post_published_at": "2026-01-16T05:06:39.097000+00:00",
  "body_text": "This is such a thoughtful comment, thank you.",
  "context": {
    "timestamp": "2026-01-26T01:26:46.936Z",
    "type": "reply"
  },
  "comment": {
    "id": 205156724,
    "type": "reply",
    "date": "2026-01-26T01:26:46.936Z",
    "body": "This is such a thoughtful comment, thank you.",
    "name": "Vanessa",
    "handle": "vanessathe",
    "children_count": 0
  },
  "comment_count": 0,
  "page_rank": 1
}
```

### Field Reference

#### Record type: `post`

- **id** *(string, required)*: unique post identifier.
- **type** *(string, required)*: record type, always `post`.
- **url** *(string, required)*: public post URL.
- **title** *(string, required)*: post title.
- **description** *(string, optional)*: summary or teaser text.
- **subtitle** *(string, optional)*: secondary heading or subtitle.
- **published\_at** *(string, required)*: publication timestamp in ISO 8601 format.
- **author\_name / author\_handle** *(string, optional)*: primary author display name and handle.
- **publication\_name / publication\_url** *(string, optional)*: source publication name and homepage URL.
- **main\_post\_id / main\_post\_title / main\_post\_url / main\_post\_published\_at** *(string, optional)*: parent post references when applicable.
- **image\_url** *(string, optional)*: primary image or media preview URL.
- **truncated\_body\_text** *(string, optional)*: shortened plain-text excerpt from the post detail payload.
- **body\_text** *(string, optional)*: plain-text body content.
- **body\_html / body\_json** *(string or object, optional)*: HTML and structured rich-text body content.
- **word\_count** *(integer, optional)*: word count for the post body.
- **child\_comment\_count** *(integer, optional)*: nested reply count returned by post detail endpoints.
- **reaction\_count / comment\_count / restack\_count** *(integer, optional)*: visible engagement counters.
- **post\_tags / published\_bylines / audio\_items** *(array, optional)*: structured post tags, bylines, and audio metadata when returned by enriched post detail.
- **comments** *(array, optional)*: trimmed top-level comment preview objects included on enriched post records.
- **post / publication / publication\_settings** *(object, optional)*: cleaned public detail payloads preserved from the enriched response.
- **page\_rank** *(integer, optional)*: rank position within the collected result page.
- **attributes** *(object, optional)*: generic normalized attributes bucket used for non-Substack-like fields such as price, brand, SKU, currency, or location when available.

#### Record type: `comment`

- **id** *(string, required)*: unique comment or note identifier.
- **type** *(string, required)*: record type, always `comment`.
- **url** *(string, required)*: public URL of the note or comment.
- **title** *(string, optional)*: short title derived from the content.
- **description** *(string, optional)*: summary text or repeat of the body.
- **published\_at** *(string, required)*: publication timestamp in ISO 8601 format.
- **author\_name / author\_handle** *(string, optional)*: primary author display name and handle.
- **publication\_name / publication\_url** *(string, optional)*: related publication name and URL.
- **main\_post\_id / main\_post\_title / main\_post\_url / main\_post\_published\_at** *(string, optional)*: parent post references when the comment belongs to a post thread.
- **subtitle** *(string, optional)*: subtitle when the record is comment-like but backed by a richer content item.
- **image\_url** *(string, optional)*: preview image URL when present.
- **body\_text** *(string, optional)*: plain-text comment or note body.
- **body\_html / body\_json** *(string or object, optional)*: HTML or structured rich-text body content when the source exposes it.
- **attachments** *(array, optional)*: media attachments associated with the comment or note.
- **word\_count** *(integer, optional)*: word count when available.
- **reaction\_count / comment\_count / restack\_count** *(integer, optional)*: visible engagement counters.
- **context** *(object, optional)*: cleaned note or thread context metadata such as source type and timestamp.
- **comment** *(object, optional)*: cleaned enriched comment payload, including public author-facing fields, attachments, and visible counters.
- **parent\_comments** *(array, optional)*: trimmed parent-thread metadata when the detail endpoint includes thread ancestry.
- **can\_reply / is\_muted** *(boolean, optional)*: public interaction state returned by comment detail endpoints.
- **publication / author / post** *(object, optional)*: cleaned public nested entities when exposed by the source detail endpoint.
- **search\_score** *(number, optional)*: ranking score returned with search results.
- **page\_rank** *(integer, optional)*: rank position within the collected result page.
- **attributes** *(object, optional)*: generic normalized attributes bucket used for non-Substack-like fields when available.

#### Record type: `reply`

- **id** *(string, required)*: unique reply identifier.
- **type** *(string, required)*: record type, always `reply`.
- **url** *(string, required)*: public URL of the reply.
- **author\_name / author\_handle** *(string, optional)*: primary author display name and handle.
- **main\_post\_id / main\_post\_title / main\_post\_url / main\_post\_published\_at** *(string, optional)*: parent post references for the thread.
- **body\_text / body\_html / body\_json** *(string or object, optional)*: reply content in plain text, HTML, or structured rich-text form.
- **attachments** *(array, optional)*: media attachments associated with the reply.
- **context** *(object, optional)*: cleaned thread context metadata.
- **comment** *(object, optional)*: cleaned enriched reply payload.
- **publication / author / post** *(object, optional)*: cleaned nested entities when available from the detail endpoint.
- **reaction\_count / comment\_count / restack\_count** *(integer, optional)*: visible engagement counters when present.
- **page\_rank** *(integer, optional)*: rank position within the collected result page.

### Data Quality, Guarantees, And Handling

- **Structured records:** results are normalized into predictable JSON objects for downstream use.
- **Best-effort extraction:** fields may vary by region, session, public availability, or source-side interface changes.
- **Optional fields:** null-check in downstream code.
- **Deduplication:** recommend `type + ":" + id`.
- **Freshness:** results reflect the publicly available data at run time.
- **Repeated runs:** use the recommended idempotency key when syncing data into warehouses, CRMs, or search indexes.

### Tips For Best Results

- Start with a small `limit` to validate the output shape before scaling up.
- Use `startUrls` for known high-value targets and `queries` for broader discovery.
- Choose `result_type` carefully because it affects query-based searches, not direct `startUrls`.
- Apply `publication_date`, `within_days`, or `language` only when you need tighter targeting.
- Turn on `enrich_data` when downstream systems need fuller records rather than lightweight search entries.
- Enable `get_replies` only for runs where note or comment thread depth matters.
- Use `type + ":" + id` as your stable deduplication key across repeated runs.

### How to Run on Apify

1. Open the actor in Apify Console.
2. Configure the available input fields for the target scope.
3. Set the maximum number of outputs to collect with `limit` if you want a capped run.
4. Click **Start** and wait for the run to finish.
5. Review the dataset and download results in JSON, CSV, Excel, or other supported formats.

### Scheduling & Automation

#### Scheduling

**Automated Data Collection**

You can schedule recurring runs to keep your dataset current without manual execution. This is useful for monitoring keywords, publications, and recent public conversation over time.

- Navigate to **Schedules** in Apify Console
- Create a new schedule (daily, weekly, or custom cron)
- Configure input parameters
- Enable notifications for run completion
- Add webhooks for automated processing

#### Integration Options

- **Data warehouses:** load normalized post and comment records into analytical stores for trend analysis, historical reporting, and topic-level monitoring.
- **BI dashboards:** track publication activity, content volume, engagement signals, and keyword coverage over time.
- **Webhooks:** trigger ingestion, validation, alerting, or transformation workflows after each completed run.
- **API-driven applications:** consume dataset records directly in internal tools, search services, or customer-facing applications.
- **Google Sheets or Excel-based review:** export smaller runs for editorial review, QA, or lightweight operational workflows.
- **Enrichment pipelines:** append fresh public publication, author, and content attributes to existing CRM or analytics records.

### Export Formats And Downstream Use

Apify datasets can be exported or consumed in formats that fit both manual review and automated data delivery workflows.

- **JSON:** for APIs, applications, and data pipelines
- **CSV or Excel:** for spreadsheet workflows and manual review
- **API access:** for automated ingestion into internal systems
- **BI and warehouses:** for reporting, dashboards, and historical analysis

### Performance

Estimated run times:

- **Small runs (< 1,000 outputs):** ~3-5 minutes
- **Medium runs (1,000-5,000 outputs):** ~5-15 minutes
- **Large runs (5,000+ outputs):** ~15-30 minutes

Execution time varies based on filters, result volume, and how much information is returned per record. Highly filtered runs can finish faster, while broad discovery or detail-rich records may take longer.

### Limitations

- Availability depends on what <https://substack.com> publicly exposes at run time.
- Some optional fields may be absent on sparse records or record types with limited visible metadata.
- Very broad searches may take longer or require a higher `limit` to capture the desired coverage.
- Public field availability and naming can change when the target platform changes how content is presented.
- Regional, account-level, language, or visibility differences can affect which records and attributes are available.
- Reply-thread depth depends on the selected inputs and whether `get_replies` is enabled.

### Troubleshooting

- **No results returned:** check your filters, keyword spelling, direct URLs, and whether the target has matching public records.
- **Fewer results than expected:** broaden filters, raise `limit`, or confirm that enough matching public records exist.
- **Some fields are empty:** optional fields depend on what each record publicly provides.
- **Run takes longer than expected:** reduce scope, lower `limit` for validation, or split broad collection into smaller runs.
- **Output changed:** compare the current dataset with the field reference and include a small sample if you need support.

### FAQ

#### What data does this actor collect?

It collects public Substack records such as posts, publications, people, notes, comments, and optional reply-thread content, depending on your selected inputs and filters.

#### Can I use direct URLs and keyword queries in the same run?

Yes. You can provide `startUrls`, `queries`, or both if you want direct collection and discovery-oriented search in one workflow.

#### Can I filter by date or language?

Yes. The actor supports `publication_date`, `within_days`, and `language` where those filters apply to the selected search flow.

#### Why did I receive fewer results than my limit?

`limit` sets an upper bound, not a guarantee. The final count depends on how many matching public records are available for your inputs and filters.

#### Can I schedule recurring runs?

Yes. Apify schedules can run the actor automatically on a daily, weekly, or custom cadence.

#### How do I avoid duplicates across runs?

Use `type + ":" + id` as the idempotency key when storing or syncing records downstream.

#### Can I export the data to CSV, Excel, or JSON?

Yes. Apify datasets support JSON export for pipelines and CSV or Excel export for spreadsheet-based review.

#### Does this actor collect private data?

No. It is intended for publicly available information from <https://substack.com>.

#### What should I include when reporting an issue?

Include the input you used with sensitive values redacted, the run ID, a short description of expected versus actual behavior, and an optional small output sample that shows the issue.

### Compliance & Ethics

#### Responsible Data Collection

This actor collects publicly available **newsletter, publication, author, post, note, and comment** information from **https://substack.com** for legitimate business purposes, including:

- **Media and publishing** research and market analysis
- **Content monitoring and reporting**
- **Dataset enrichment and operational analytics**

This section is informational and not legal advice. Users are responsible for ensuring their collection and use of data complies with applicable laws, regulations, and platform terms.

#### Best Practices

- Use collected data in accordance with applicable laws, regulations, and the target site's terms
- Respect individual privacy and personal information
- Use data responsibly and avoid disruptive or excessive collection
- Do not use this actor for spamming, harassment, or other harmful purposes
- Follow relevant data protection requirements where applicable (for example GDPR or CCPA)

### Support

For help, use the actor page or the repository Issues section. When reporting a problem, include the input used with sensitive values redacted, the run ID, expected versus actual behavior, and, if helpful, a small sample of the output that demonstrates the issue.

# Actor input Schema

## `startUrls` (type: `array`):

Paste the exact Substack pages you want to collect from. Supported page types include search result pages, publication homepages, author profiles, individual posts, note or comment pages, and custom-domain publication pages. Use this when you already know the specific Substack pages you want to track or export.

## `queries` (type: `array`):

Enter one or more search terms to discover matching Substack content. Good query examples include topics, creator names, publication names, industries, brands, or phrases your audience follows. Each query runs separately, which makes it easy to compare themes or monitor several subjects in one run.

## `result_type` (type: `string`):

Select the type of results to return for keyword searches, such as top results, recent items, posts only, publications, or people.

## `publication_date` (type: `string`):

Narrow keyword-based post results to a recent time window when you want fresher content instead of the full available history.

## `within_days` (type: `integer`):

Keep only posts published within the last N days. This applies the date cutoff to keyword searches and supported direct URLs, while older posts are ignored.

## `language` (type: `string`):

Restrict keyword-based post results to a specific language when you want content for a particular audience or market.

## `enrich_data` (type: `boolean`):

Enable this to expand supported search results with additional detail requests, which can produce richer output for posts, publications, or people. Turn it off if you only need the lighter search result data.

## `get_replies` (type: `boolean`):

Enable this to gather reply threads for note and comment detail pages, including paginated branches when available. Leave it off if you only want the main note or comment record.

## `max_replies` (type: `integer`):

Limit how many replies are saved for each individual post or thread. Use this to keep reply-heavy pages manageable without reducing results collected from other inputs.

## `limit` (type: `integer`):

Maximum listings to save per query or starting URL. Use a small number for quick sampling, testing, or spot checks, and a larger number when you need broader coverage for dashboards, research, or ongoing monitoring.

## Actor input object example

```json
{
  "startUrls": [
    "https://substack.com/search/movie?searching=top",
    "https://unstackit.substack.com",
    "https://substack.com/@vanessathe",
    "https://www.lennysnewsletter.com"
  ],
  "result_type": "top",
  "enrich_data": true,
  "get_replies": false,
  "limit": 100
}
```

# Actor output Schema

## `results` (type: `string`):

No description

# API

You can run this Actor programmatically using our API. Below are code examples in JavaScript, Python, and CLI, as well as the OpenAPI specification and MCP server setup.

## JavaScript example

```javascript
import { ApifyClient } from 'apify-client';

// Initialize the ApifyClient with your Apify API token
// Replace the '<YOUR_API_TOKEN>' with your token
const client = new ApifyClient({
    token: '<YOUR_API_TOKEN>',
});

// Prepare Actor input
const input = {
    "startUrls": [
        "https://substack.com/search/movie?searching=top",
        "https://unstackit.substack.com",
        "https://substack.com/@vanessathe",
        "https://www.lennysnewsletter.com"
    ],
    "limit": 100
};

// Run the Actor and wait for it to finish
const run = await client.actor("fatihtahta/substack-scraper").call(input);

// Fetch and print Actor results from the run's dataset (if any)
console.log('Results from dataset');
console.log(`💾 Check your data here: https://console.apify.com/storage/datasets/${run.defaultDatasetId}`);
const { items } = await client.dataset(run.defaultDatasetId).listItems();
items.forEach((item) => {
    console.dir(item);
});

// 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/js/docs

```

## Python example

```python
from apify_client import ApifyClient

# Initialize the ApifyClient with your Apify API token
# Replace '<YOUR_API_TOKEN>' with your token.
client = ApifyClient("<YOUR_API_TOKEN>")

# Prepare the Actor input
run_input = {
    "startUrls": [
        "https://substack.com/search/movie?searching=top",
        "https://unstackit.substack.com",
        "https://substack.com/@vanessathe",
        "https://www.lennysnewsletter.com",
    ],
    "limit": 100,
}

# Run the Actor and wait for it to finish
run = client.actor("fatihtahta/substack-scraper").call(run_input=run_input)

# Fetch and print Actor results from the run's dataset (if there are any)
print("💾 Check your data here: https://console.apify.com/storage/datasets/" + run["defaultDatasetId"])
for item in client.dataset(run["defaultDatasetId"]).iterate_items():
    print(item)

# 📚 Want to learn more 📖? Go to → https://docs.apify.com/api/client/python/docs/quick-start

```

## CLI example

```bash
echo '{
  "startUrls": [
    "https://substack.com/search/movie?searching=top",
    "https://unstackit.substack.com",
    "https://substack.com/@vanessathe",
    "https://www.lennysnewsletter.com"
  ],
  "limit": 100
}' |
apify call fatihtahta/substack-scraper --silent --output-dataset

```

## MCP server setup

```json
{
    "mcpServers": {
        "apify": {
            "command": "npx",
            "args": [
                "mcp-remote",
                "https://mcp.apify.com/?tools=fatihtahta/substack-scraper",
                "--header",
                "Authorization: Bearer <YOUR_API_TOKEN>"
            ]
        }
    }
}

```

## OpenAPI specification

```json
{
    "openapi": "3.0.1",
    "info": {
        "title": "Substack Scraper | All-In-One",
        "description": "Get full articles, user profiles, and search results with All-in-One Substack Scraper. Extract rich data including titles, bios, subscriber counts, social links and engagement metrics. ideal for market research, creator discovery, trend tracking, and audience analysis.",
        "version": "1.1",
        "x-build-id": "MQu14AaqJ3d3udBbm"
    },
    "servers": [
        {
            "url": "https://api.apify.com/v2"
        }
    ],
    "paths": {
        "/acts/fatihtahta~substack-scraper/run-sync-get-dataset-items": {
            "post": {
                "operationId": "run-sync-get-dataset-items-fatihtahta-substack-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for its completion, and returns Actor's dataset items in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        },
        "/acts/fatihtahta~substack-scraper/runs": {
            "post": {
                "operationId": "runs-sync-fatihtahta-substack-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor and returns information about the initiated run in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK",
                        "content": {
                            "application/json": {
                                "schema": {
                                    "$ref": "#/components/schemas/runsResponseSchema"
                                }
                            }
                        }
                    }
                }
            }
        },
        "/acts/fatihtahta~substack-scraper/run-sync": {
            "post": {
                "operationId": "run-sync-fatihtahta-substack-scraper",
                "x-openai-isConsequential": false,
                "summary": "Executes an Actor, waits for completion, and returns the OUTPUT from Key-value store in response.",
                "tags": [
                    "Run Actor"
                ],
                "requestBody": {
                    "required": true,
                    "content": {
                        "application/json": {
                            "schema": {
                                "$ref": "#/components/schemas/inputSchema"
                            }
                        }
                    }
                },
                "parameters": [
                    {
                        "name": "token",
                        "in": "query",
                        "required": true,
                        "schema": {
                            "type": "string"
                        },
                        "description": "Enter your Apify token here"
                    }
                ],
                "responses": {
                    "200": {
                        "description": "OK"
                    }
                }
            }
        }
    },
    "components": {
        "schemas": {
            "inputSchema": {
                "type": "object",
                "properties": {
                    "startUrls": {
                        "title": "Choose Starting Pages to Scrape",
                        "type": "array",
                        "description": "Paste the exact Substack pages you want to collect from. Supported page types include search result pages, publication homepages, author profiles, individual posts, note or comment pages, and custom-domain publication pages. Use this when you already know the specific Substack pages you want to track or export.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "queries": {
                        "title": "Search by Keyword",
                        "type": "array",
                        "description": "Enter one or more search terms to discover matching Substack content. Good query examples include topics, creator names, publication names, industries, brands, or phrases your audience follows. Each query runs separately, which makes it easy to compare themes or monitor several subjects in one run.",
                        "items": {
                            "type": "string"
                        }
                    },
                    "result_type": {
                        "title": "Choose Which Search Results to Collect",
                        "enum": [
                            "top",
                            "recent",
                            "posts",
                            "publications",
                            "people"
                        ],
                        "type": "string",
                        "description": "Select the type of results to return for keyword searches, such as top results, recent items, posts only, publications, or people.",
                        "default": "top"
                    },
                    "publication_date": {
                        "title": "Limit Results by Publication Date",
                        "enum": [
                            "last_day",
                            "last_week",
                            "last_month",
                            "last_year"
                        ],
                        "type": "string",
                        "description": "Narrow keyword-based post results to a recent time window when you want fresher content instead of the full available history."
                    },
                    "within_days": {
                        "title": "Keep Only Recent Posts",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Keep only posts published within the last N days. This applies the date cutoff to keyword searches and supported direct URLs, while older posts are ignored."
                    },
                    "language": {
                        "title": "Limit Post Results by Language",
                        "enum": [
                            "ar",
                            "zh",
                            "cs",
                            "nl",
                            "en",
                            "fr",
                            "de",
                            "el",
                            "hi",
                            "hu",
                            "id",
                            "it",
                            "ja",
                            "ko",
                            "la",
                            "no",
                            "pl",
                            "pt",
                            "ro",
                            "ru",
                            "es",
                            "sv",
                            "th",
                            "tr",
                            "vi"
                        ],
                        "type": "string",
                        "description": "Restrict keyword-based post results to a specific language when you want content for a particular audience or market."
                    },
                    "enrich_data": {
                        "title": "Fetch More Detailed Records",
                        "type": "boolean",
                        "description": "Enable this to expand supported search results with additional detail requests, which can produce richer output for posts, publications, or people. Turn it off if you only need the lighter search result data.",
                        "default": true
                    },
                    "get_replies": {
                        "title": "Collect Replies to Notes and Comments",
                        "type": "boolean",
                        "description": "Enable this to gather reply threads for note and comment detail pages, including paginated branches when available. Leave it off if you only want the main note or comment record.",
                        "default": false
                    },
                    "max_replies": {
                        "title": "Set the Maximum Replies to Collect",
                        "minimum": 0,
                        "type": "integer",
                        "description": "Limit how many replies are saved for each individual post or thread. Use this to keep reply-heavy pages manageable without reducing results collected from other inputs."
                    },
                    "limit": {
                        "title": "Set a Result Limit for Each Input",
                        "minimum": 1,
                        "type": "integer",
                        "description": "Maximum listings to save per query or starting URL. Use a small number for quick sampling, testing, or spot checks, and a larger number when you need broader coverage for dashboards, research, or ongoing monitoring."
                    }
                }
            },
            "runsResponseSchema": {
                "type": "object",
                "properties": {
                    "data": {
                        "type": "object",
                        "properties": {
                            "id": {
                                "type": "string"
                            },
                            "actId": {
                                "type": "string"
                            },
                            "userId": {
                                "type": "string"
                            },
                            "startedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "finishedAt": {
                                "type": "string",
                                "format": "date-time",
                                "example": "2025-01-08T00:00:00.000Z"
                            },
                            "status": {
                                "type": "string",
                                "example": "READY"
                            },
                            "meta": {
                                "type": "object",
                                "properties": {
                                    "origin": {
                                        "type": "string",
                                        "example": "API"
                                    },
                                    "userAgent": {
                                        "type": "string"
                                    }
                                }
                            },
                            "stats": {
                                "type": "object",
                                "properties": {
                                    "inputBodyLen": {
                                        "type": "integer",
                                        "example": 2000
                                    },
                                    "rebootCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "restartCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "resurrectCount": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "computeUnits": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "options": {
                                "type": "object",
                                "properties": {
                                    "build": {
                                        "type": "string",
                                        "example": "latest"
                                    },
                                    "timeoutSecs": {
                                        "type": "integer",
                                        "example": 300
                                    },
                                    "memoryMbytes": {
                                        "type": "integer",
                                        "example": 1024
                                    },
                                    "diskMbytes": {
                                        "type": "integer",
                                        "example": 2048
                                    }
                                }
                            },
                            "buildId": {
                                "type": "string"
                            },
                            "defaultKeyValueStoreId": {
                                "type": "string"
                            },
                            "defaultDatasetId": {
                                "type": "string"
                            },
                            "defaultRequestQueueId": {
                                "type": "string"
                            },
                            "buildNumber": {
                                "type": "string",
                                "example": "1.0.0"
                            },
                            "containerUrl": {
                                "type": "string"
                            },
                            "usage": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "integer",
                                        "example": 1
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            },
                            "usageTotalUsd": {
                                "type": "number",
                                "example": 0.00005
                            },
                            "usageUsd": {
                                "type": "object",
                                "properties": {
                                    "ACTOR_COMPUTE_UNITS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATASET_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "KEY_VALUE_STORE_WRITES": {
                                        "type": "number",
                                        "example": 0.00005
                                    },
                                    "KEY_VALUE_STORE_LISTS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_READS": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "REQUEST_QUEUE_WRITES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_INTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "DATA_TRANSFER_EXTERNAL_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_RESIDENTIAL_TRANSFER_GBYTES": {
                                        "type": "integer",
                                        "example": 0
                                    },
                                    "PROXY_SERPS": {
                                        "type": "integer",
                                        "example": 0
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
}
```