Pricing

Pay per event

Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Download captions from any Archive.org film, TV, or audio item: clean transcript text, timestamped cues, normalized SRT & VTT, one row per language. Search 3M+ captioned items, monitor for new ones. No login or API key. $2 per 1,000 transcripts.

Pricing

Pay per event

Rating

0.0

(0)

Developer

Scrapers Delight

Actor stats

Bookmarked

Total users

Monthly active users

3 days ago

Last modified

🎞️ Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT

Pull the subtitles/captions from any Internet Archive film, TV recording, or audio item — no login, no API key, no AI transcription. Archive.org hosts 3M+ captioned items (classic films, newsreels, lectures, TV news) and exposes them through public APIs; this actor downloads the caption files and parses them into clean transcript text, timestamped cues, and normalized SRT/VTT — one row per language. Point it at item URLs or an archive.org search query.

Because the captions already exist (uploaded subtitles or archive's own ASR), there's no speech-to-text compute — it's fast and cheap.

What does it do?

For each archive.org item you give it (by URL/identifier or discovered via search), it returns:

📝 Full transcript (clean plain text) — always included
⏲️ Timestamped cues — {index, start_ms, end_ms, start, end, text}
🎬 Normalized SRT / VTT — re-emitted with proper 3-digit millisecond stamps (archive's raw ASR files use non-standard 2-digit millis that break many players)
🌍 One row per caption file/language — grab English ASR plus every uploaded translation
🏷️ Item metadata — title, language, mediatype, collections, item URL
🔎 Search discovery — any advanced-search (Lucene) query, auto-scoped to captioned movies/audio, sorted by downloads
🚩 Honest flags — items with no captions, access-restricted items, and private/empty caption files are reported as such, never as silent zero-cue "successes"

No ASR, no API key — it reads the caption files archive.org already publishes.

What data does it extract?

One dataset record per caption file (per language):

🆔 identifier, 🏷️ item_title, 🔗 item_url, 🌍 language, 📦 mediatype, 🗂️ collection[], ⬇️ downloads
📄 caption_file_name, caption_format (SubRip / Web Video Text Tracks), 🌍 caption_lang_code, 🤖 is_autogenerated (.asr = archive's English ASR)
🔗 caption_url, 📏 caption_size_bytes
📝 transcript, ⏲️ segments[], 🎬 srt, vtt, 🔢 cue_count
🚩 restricted, note, ✨ is_new (monitor), 🕒 scraped_at

Example output

{
  "identifier": "Doctorin1946",
  "item_title": "Doctor in Industry (Part I)",
  "item_url": "https://archive.org/details/Doctorin1946",
  "mediatype": "movies",
  "caption_file_name": "Doctorin1946.asr.srt",
  "caption_format": "SubRip",
  "caption_lang_code": "en",
  "is_autogenerated": true,
  "caption_url": "https://archive.org/download/Doctorin1946/Doctorin1946.asr.srt",
  "caption_size_bytes": 14725,
  "cue_count": 217,
  "transcript": "When the thing with the name names …",
  "restricted": false,
  "scraped_at": "2026-06-12T00:00:00.000Z"
}

Who is it for?

🤖 AI / RAG dataset builders — millions of hours of public-domain era film and TV speech, already transcribed.
✍️ Documentary makers & editors — search inside classic films and newsreels, get ready-to-cut SRT/VTT.
🔎 Researchers & historians — full-text search across mid-century educational films, TV news, and lectures.
🌍 Localization & subtitle teams — pull every language track an item carries in one run.

How to use it (step by step)

Click Try for free.
Paste one or more item URLs (https://archive.org/details/{identifier}) or bare identifiers — or set a search query (e.g. collection:prelinger).
(Optional) filter languages, toggle autogenerated (.asr) captions, add extra formats (srt, vtt, segments).
Click Start, then open the Dataset tab to view/export.
(Optional) set monitorMode + a searchQuery + a Schedule to capture newly captioned items automatically.

Quick start

{
  "itemUrls": ["https://archive.org/details/his_girl_friday"],
  "transcriptFormats": ["txt", "srt"]
}

Search a whole collection

{
  "searchQuery": "collection:prelinger",
  "maxItems": 50,
  "transcriptFormats": ["txt", "segments"]
}

Input

Field	What it does
`itemUrls`	archive.org item URLs / identifiers
`searchQuery`	advanced-search (Lucene) query — auto-scoped to captioned movies/audio, restricted items excluded, sorted by downloads
`languages`	keep only these caption language codes (empty = all)
`includeAutoGenerated`	include archive's `.asr` English ASR captions (default on)
`transcriptFormats`	`txt` · `segments` · `srt` · `vtt`
`maxItems`	hard cap on items per run (default 5; 0 = unlimited)
`maxCaptionFilesPerItem`	cap caption files per item (default 5; 0 = all)
`monitorMode`, `alertOnNewItem`	recurring new-item watcher + alerts
`webhookUrl`, `slackWebhookUrl`, `emailRecipients`	alert channels
`proxyConfiguration`, `requestConcurrency`	proxy + parallelism

Output

Each caption file is one dataset record (fields above). Items with no captions, access-restricted items, and private/empty caption files are emitted as flagged rows (restricted, note) so you always know why a transcript is missing. Export to JSON, CSV, Excel, HTML, or RSS, or fetch via the Apify API.

How much does it cost?

Pay-per-event — and with no transcription compute, it's cheap:

Event	What it covers	Price
`lot-scraped`	each record returned	$0.004 / record
`lot-detail-enriched`	each caption file downloaded + parsed	$0.004 / file
`monitor-run-completed`	each scheduled watch run	$0.05 / run
`new-lot-detected`	each new item found by the monitor	$0.02 / item
`alert-delivered`	each Slack/email/webhook push	$0.005 / alert

That's about $8 per 1,000 transcripts (fetch + parse). No charge for actor starts or empty runs.

Monitor & alert setup

Set a searchQuery (e.g. collection:prelinger or subject:"television news").
Turn on monitorMode (and keep alertOnNewItem on).
Add a webhookUrl, slackWebhookUrl, and/or emailRecipients.
Create an Apify Schedule (e.g. daily). The first run baselines the seen items; every later run outputs and alerts only new items. State persists in a named key-value store (archive-transcript-monitor-state), so it survives between runs.

How does it work without AI transcription?

Archive.org items carry caption files: uploader-provided .srt/.vtt subtitles and archive's own autogenerated English ASR (.asr.srt). This actor reads the item's public metadata, picks the caption files, downloads them, and runs a hardened parser that handles every variant found in the wild — BOM + CRLF files, 2-digit millisecond ASR stamps, <i> formatting tags, VTT headers with trailing junk, and cues without indices. It does not run speech-to-text, so there's no GPU cost and results are instant.

Is it legal to scrape archive.org captions?

The Internet Archive is a non-profit library that publishes these items and APIs for public access, and much of the captioned material is public-domain era film. The output is published media content and item stats, not personal data. Scraping public data is generally legal, but you are responsible for your use — review archive.org's Terms of Use and each item's rights/license statement before redistributing content.

FAQ

Which items have captions? 3M+ movies/audio items carry .srt/.vtt files — classic films, Prelinger educational shorts, TV news, lectures. The search mode finds them for you (it filters to format:"SubRip" OR "Web Video Text Tracks" automatically).

Is there a Whisper/ASR step? No — it downloads the caption files archive.org already publishes (including archive's own ASR track), so it's fast and cheap.

Can I get subtitles for video editing? Yes — add srt and/or vtt to transcriptFormats. The actor normalizes archive's non-standard 2-digit-millisecond stamps to proper hh:mm:ss,mmm, so the files work in any editor/player.

What about multiple languages? Each caption file becomes its own row with caption_lang_code parsed from the filename. Use languages to keep only the ones you want.

Why did an item return no transcript? Three honest cases, all flagged in the row: the item has no caption files (note), the item is access-restricted (its files are private and download as empty bodies — restricted: true), or a specific file is private/zero-byte. The actor never reports those as empty "successes".

Can I crawl a whole collection? Yes — searchQuery: "collection:{name}" + maxItems: 0. Archive's search window caps at 10,000 rows per query; slice bigger collections by date (publicdate:[2020-01-01 TO 2021-01-01]).

How fresh is monitor mode? Every scheduled run re-queries your search and diffs against the named state store — you get only items it hasn't seen before, plus optional Slack/webhook/email alerts.

Does it need a proxy or login? No login or API key. Archive.org's endpoints are public; the default datacenter proxy rotation is plenty.

How do I export? JSON, CSV, Excel, HTML, or RSS from the Dataset tab, or via the Apify API.

What does a 1,000-film crawl cost? With one caption file each: 1,000 × ($0.004 + $0.004) = ~$8.

Feedback

Want full-text search inside transcripts, TV-news-specific fields, or bulk export to a single file? Open an issue on the actor.

Dailymotion Transcript Scraper — Subtitles to TXT, SRT, VTT

scrapersdelight/dailymotion-transcript-scraper

Extract any public Dailymotion video's subtitle transcript — no login, no ASR. By video URL/ID or a search query: full text, timestamped segments & SRT/VTT, plus title, owner and duration, from Dailymotion's own subtitle tracks. $2 per 1,000 videos.

Scrapers Delight

Coursera Transcript Scraper — Lecture Subtitles (No Login)

scrapersdelight/coursera-transcript-scraper

Extract Coursera lecture transcripts from the course's own subtitle tracks — no login, no ASR. By course slug: each open lecture's transcript as text, timestamped segments & SRT/VTT, in 30+ languages. Gated lectures are flagged, not faked. $2 per 1,000 lectures.

Scrapers Delight

Loom Video Transcript Scraper — TXT, SRT, VTT (No Login)

scrapersdelight/loom-transcript-scraper

Extract any public Loom video's transcript — no login, no ASR. Reads Loom's own auto-captions from the share page: full text, timestamped segments & SRT/VTT, plus title, owner and duration. Schedule it to transcribe new videos in a folder. $2 per 1,000 videos.

Scrapers Delight

TED Talk Transcript Scraper — TXT, SRT & VTT (No Login)

scrapersdelight/ted-transcript-scraper

Extract any TED Talk's transcript via TED's own public API — no login, no ASR. Full text, timestamped segments & SRT/VTT in any available language, plus speaker, views, topics and TED's AI takeaway. Point it at talk URLs or a topic/speaker page. $2 per 1,000 talks.

Scrapers Delight

Vimeo Transcript Scraper — Captions to TXT, SRT & VTT

scrapersdelight/vimeo-transcript-scraper

Extract any public Vimeo video's captions and transcript — no login, no ASR. By video URL/ID or a page that links Vimeo videos: transcript text, timestamped segments & SRT/VTT, plus title, owner and duration, from Vimeo's own caption tracks. $2 per 1,000 videos.

Scrapers Delight

Wistia Transcript Scraper — Captions to TXT, SRT & VTT

scrapersdelight/wistia-transcript-scraper

Extract any public Wistia video's transcript and captions — no login, no ASR. By hashedId or any page that embeds Wistia: full text, timestamped segments & SRT/VTT, plus title and duration, straight from Wistia's CDN. $2 per 1,000 videos.

Scrapers Delight

Podcast Transcript Scraper — Any RSS Feed to Text & SRT

scrapersdelight/podcast-transcript-scraper

Extract per-episode transcripts from any podcast RSS feed via the Podcasting 2.0 <podcast:transcript> tag — no login, no ASR. Clean text, timestamped segments & SRT/VTT per episode, plus metadata. Works with Buzzsprout, Captivate, Transistor, RSS.com & more. $2 per 1,000 episodes.

Scrapers Delight

YouTube Subtitle Extractor

entertained_rattlesnake/youtube-subtitle-extractor

Extract subtitles and transcripts from YouTube videos and export them as JSON, TXT, SRT and VTT.

Entertained Rattlesnake

MIT OpenCourseWare Transcript Scraper — Lectures to Text

scrapersdelight/mit-ocw-transcript-scraper

Extract MIT OpenCourseWare video-lecture transcripts — no login, no ASR. Give it a course (crawls every lecture) or specific lecture URLs: full transcript text, timestamped segments & SRT/VTT, plus course and lecture titles. Creative-Commons content. $2 per 1,000 lectures.

Scrapers Delight

Youtube Scraper

bluephantom/youtube-scraper

Unlimited and Affordable YouTube video scraper. Easily extract all video data using search terms or direct URLs. Carefully developed code, optimised through over 100 iterations, to minimise computational costs.

BluePhantom

5.0

(1)

Kick VOD Transcription — Stream to Text, SRT & VTT

scrapersdelight/kick-transcript-scraper

Transcribe Kick.com VODs (which have no captions) with AI speech-to-text — searchable transcript in TXT, SRT & VTT plus VOD metadata, by channel or VOD URL. No login or API key. Schedule it to transcribe new VODs automatically. $0.012 per audio minute.

Scrapers Delight