Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT
Pricing
Pay per event
Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT
Download captions from any Archive.org film, TV, or audio item: clean transcript text, timestamped cues, normalized SRT & VTT, one row per language. Search 3M+ captioned items, monitor for new ones. No login or API key. $2 per 1,000 transcripts.
Pricing
Pay per event
Rating
0.0
(0)
Developer
Scrapers Delight
Maintained by CommunityActor stats
0
Bookmarked
2
Total users
1
Monthly active users
3 days ago
Last modified
Categories
Share
🎞️ Archive.org Subtitle & Transcript Scraper — TXT, SRT & VTT
Pull the subtitles/captions from any Internet Archive film, TV recording, or audio item — no login, no API key, no AI transcription. Archive.org hosts 3M+ captioned items (classic films, newsreels, lectures, TV news) and exposes them through public APIs; this actor downloads the caption files and parses them into clean transcript text, timestamped cues, and normalized SRT/VTT — one row per language. Point it at item URLs or an archive.org search query.
Because the captions already exist (uploaded subtitles or archive's own ASR), there's no speech-to-text compute — it's fast and cheap.
What does it do?
For each archive.org item you give it (by URL/identifier or discovered via search), it returns:
- 📝 Full transcript (clean plain text) — always included
- ⏲️ Timestamped cues —
{index, start_ms, end_ms, start, end, text} - 🎬 Normalized SRT / VTT — re-emitted with proper 3-digit millisecond stamps (archive's raw ASR files use non-standard 2-digit millis that break many players)
- 🌍 One row per caption file/language — grab English ASR plus every uploaded translation
- 🏷️ Item metadata — title, language, mediatype, collections, item URL
- 🔎 Search discovery — any advanced-search (Lucene) query, auto-scoped to captioned movies/audio, sorted by downloads
- 🚩 Honest flags — items with no captions, access-restricted items, and private/empty caption files are reported as such, never as silent zero-cue "successes"
No ASR, no API key — it reads the caption files archive.org already publishes.
What data does it extract?
One dataset record per caption file (per language):
- 🆔
identifier, 🏷️item_title, 🔗item_url, 🌍language, 📦mediatype, 🗂️collection[], ⬇️downloads - 📄
caption_file_name,caption_format(SubRip / Web Video Text Tracks), 🌍caption_lang_code, 🤖is_autogenerated(.asr = archive's English ASR) - 🔗
caption_url, 📏caption_size_bytes - 📝
transcript, ⏲️segments[], 🎬srt,vtt, 🔢cue_count - 🚩
restricted,note, ✨is_new(monitor), 🕒scraped_at
Example output
{"identifier": "Doctorin1946","item_title": "Doctor in Industry (Part I)","item_url": "https://archive.org/details/Doctorin1946","mediatype": "movies","caption_file_name": "Doctorin1946.asr.srt","caption_format": "SubRip","caption_lang_code": "en","is_autogenerated": true,"caption_url": "https://archive.org/download/Doctorin1946/Doctorin1946.asr.srt","caption_size_bytes": 14725,"cue_count": 217,"transcript": "When the thing with the name names …","restricted": false,"scraped_at": "2026-06-12T00:00:00.000Z"}
Who is it for?
- 🤖 AI / RAG dataset builders — millions of hours of public-domain era film and TV speech, already transcribed.
- ✍️ Documentary makers & editors — search inside classic films and newsreels, get ready-to-cut SRT/VTT.
- 🔎 Researchers & historians — full-text search across mid-century educational films, TV news, and lectures.
- 🌍 Localization & subtitle teams — pull every language track an item carries in one run.
How to use it (step by step)
- Click Try for free.
- Paste one or more item URLs (
https://archive.org/details/{identifier}) or bare identifiers — or set a search query (e.g.collection:prelinger). - (Optional) filter languages, toggle autogenerated (.asr) captions, add extra formats (
srt,vtt,segments). - Click Start, then open the Dataset tab to view/export.
- (Optional) set monitorMode + a searchQuery + a Schedule to capture newly captioned items automatically.
Quick start
{"itemUrls": ["https://archive.org/details/his_girl_friday"],"transcriptFormats": ["txt", "srt"]}
Search a whole collection
{"searchQuery": "collection:prelinger","maxItems": 50,"transcriptFormats": ["txt", "segments"]}
Input
| Field | What it does |
|---|---|
itemUrls | archive.org item URLs / identifiers |
searchQuery | advanced-search (Lucene) query — auto-scoped to captioned movies/audio, restricted items excluded, sorted by downloads |
languages | keep only these caption language codes (empty = all) |
includeAutoGenerated | include archive's .asr English ASR captions (default on) |
transcriptFormats | txt · segments · srt · vtt |
maxItems | hard cap on items per run (default 5; 0 = unlimited) |
maxCaptionFilesPerItem | cap caption files per item (default 5; 0 = all) |
monitorMode, alertOnNewItem | recurring new-item watcher + alerts |
webhookUrl, slackWebhookUrl, emailRecipients | alert channels |
proxyConfiguration, requestConcurrency | proxy + parallelism |
Output
Each caption file is one dataset record (fields above). Items with no captions, access-restricted items, and private/empty caption files are emitted as flagged rows (restricted, note) so you always know why a transcript is missing. Export to JSON, CSV, Excel, HTML, or RSS, or fetch via the Apify API.
How much does it cost?
Pay-per-event — and with no transcription compute, it's cheap:
| Event | What it covers | Price |
|---|---|---|
lot-scraped | each record returned | $0.004 / record |
lot-detail-enriched | each caption file downloaded + parsed | $0.004 / file |
monitor-run-completed | each scheduled watch run | $0.05 / run |
new-lot-detected | each new item found by the monitor | $0.02 / item |
alert-delivered | each Slack/email/webhook push | $0.005 / alert |
That's about $8 per 1,000 transcripts (fetch + parse). No charge for actor starts or empty runs.
Monitor & alert setup
- Set a
searchQuery(e.g.collection:prelingerorsubject:"television news"). - Turn on
monitorMode(and keepalertOnNewItemon). - Add a
webhookUrl,slackWebhookUrl, and/oremailRecipients. - Create an Apify Schedule (e.g. daily). The first run baselines the seen items; every later run outputs and alerts only new items. State persists in a named key-value store (
archive-transcript-monitor-state), so it survives between runs.
How does it work without AI transcription?
Archive.org items carry caption files: uploader-provided .srt/.vtt subtitles and archive's own autogenerated English ASR (.asr.srt). This actor reads the item's public metadata, picks the caption files, downloads them, and runs a hardened parser that handles every variant found in the wild — BOM + CRLF files, 2-digit millisecond ASR stamps, <i> formatting tags, VTT headers with trailing junk, and cues without indices. It does not run speech-to-text, so there's no GPU cost and results are instant.
Is it legal to scrape archive.org captions?
The Internet Archive is a non-profit library that publishes these items and APIs for public access, and much of the captioned material is public-domain era film. The output is published media content and item stats, not personal data. Scraping public data is generally legal, but you are responsible for your use — review archive.org's Terms of Use and each item's rights/license statement before redistributing content.
FAQ
Which items have captions?
3M+ movies/audio items carry .srt/.vtt files — classic films, Prelinger educational shorts, TV news, lectures. The search mode finds them for you (it filters to format:"SubRip" OR "Web Video Text Tracks" automatically).
Is there a Whisper/ASR step? No — it downloads the caption files archive.org already publishes (including archive's own ASR track), so it's fast and cheap.
Can I get subtitles for video editing?
Yes — add srt and/or vtt to transcriptFormats. The actor normalizes archive's non-standard 2-digit-millisecond stamps to proper hh:mm:ss,mmm, so the files work in any editor/player.
What about multiple languages?
Each caption file becomes its own row with caption_lang_code parsed from the filename. Use languages to keep only the ones you want.
Why did an item return no transcript?
Three honest cases, all flagged in the row: the item has no caption files (note), the item is access-restricted (its files are private and download as empty bodies — restricted: true), or a specific file is private/zero-byte. The actor never reports those as empty "successes".
Can I crawl a whole collection?
Yes — searchQuery: "collection:{name}" + maxItems: 0. Archive's search window caps at 10,000 rows per query; slice bigger collections by date (publicdate:[2020-01-01 TO 2021-01-01]).
How fresh is monitor mode? Every scheduled run re-queries your search and diffs against the named state store — you get only items it hasn't seen before, plus optional Slack/webhook/email alerts.
Does it need a proxy or login? No login or API key. Archive.org's endpoints are public; the default datacenter proxy rotation is plenty.
How do I export? JSON, CSV, Excel, HTML, or RSS from the Dataset tab, or via the Apify API.
What does a 1,000-film crawl cost? With one caption file each: 1,000 × ($0.004 + $0.004) = ~$8.
Feedback
Want full-text search inside transcripts, TV-news-specific fields, or bulk export to a single file? Open an issue on the actor.