PDF → RAG Chunks (Token-Aware, Vector-Ready) avatar

PDF → RAG Chunks (Token-Aware, Vector-Ready)

Pricing

Pay per usage

Go to Apify Store
PDF → RAG Chunks (Token-Aware, Vector-Ready)

PDF → RAG Chunks (Token-Aware, Vector-Ready)

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. Returns one row per chunk with page, char count, token estimate. Feed directly into OpenAI text-embedding-3 / Voyage / Cohere. $0.005 per PDF + $0.0002 per chunk.

Pricing

Pay per usage

Rating

0.0

(0)

Developer

Hojun Lee

Hojun Lee

Maintained by Community

Actor stats

0

Bookmarked

2

Total users

1

Monthly active users

3 days ago

Last modified

Share

PDF → RAG Chunks

Download any PDF and chunk into semantically coherent segments ready for embedding/RAG. Configurable chunk size + overlap. No LLM cost (zero tokens). Vector-ready output. $0.005 per PDF + $0.0002 per chunk.


Why this exists

To build a RAG (retrieval-augmented generation) system over a corpus of PDFs, you need:

  1. Download → extract text per page
  2. Chunk into semantic segments (1000-2000 chars typical)
  3. Optional: embed each chunk and store in vector DB
  4. Query: embed question, retrieve top-k chunks, ask LLM

This actor handles steps 1-2 (the most painful boilerplate). The output is shaped so you can pipe each chunk directly into OpenAI's text-embedding-3-small, Voyage AI, Cohere Embed, or any embedding model.

Other chunking SaaS (Unstructured.io API, LangChain Hosted) charge $5-20 per 1K pages. This actor: $0.50 per 1K pages.


What you get

Summary row (one per PDF)

{
"_type": "summary",
"url": "https://www.sec.gov/.../aapl-10k.pdf",
"ok": true,
"page_count": 80,
"title": "Apple Inc. — Annual Report 2024",
"author": "Apple Inc.",
"chunk_size_chars": 1500,
"overlap_chars": 200
}

Per-chunk row

{
"_type": "chunk",
"url": "https://...",
"page": 12,
"chunk_index": 0,
"global_chunk_index": 17,
"text": "Item 1A. Risk Factors\n\nOur business is...",
"char_count": 1480,
"token_estimate": 370
}

Quick start

Single PDF

{
"url": "https://www.example.com/report.pdf"
}

Batch with custom chunk size

{
"urls": [
"https://...filing1.pdf",
"https://...filing2.pdf"
],
"chunkSizeChars": 2000,
"overlapChars": 300,
"maxPages": 100
}

Optimize for OpenAI text-embedding-3-small (8K-token max)

{
"url": "https://...",
"chunkSizeChars": 1500,
"overlapChars": 200
}

Embedding modelchunkSizeCharsNotes
OpenAI text-embedding-3-small1500~375 tokens, fits well
OpenAI text-embedding-3-large2000~500 tokens
Voyage voyage-3-large1500optimal balance
Cohere embed-v31800works with 512-token chunks

Overlap of 100-300 chars boosts recall by ~5-10% with minimal storage cost.


Pricing

Pay-Per-Event:

  • $0.005 per PDF processed
  • $0.0002 per chunk emitted
RunChunksCost
One 80-page 10-K~200$0.045
Batch of 100 papers @ 20 pages~6000$1.70
Compliance archive 1000 PDFs~80000$21

vs Unstructured.io ($30+/mo + per-doc) or LangChain Hosted ($500+/mo).


Pipeline pattern: PDFs → vector DB

import apify_client, openai, pinecone
# 1. Chunk PDFs
client = apify_client.ApifyClient(token)
run = client.actor("gochujang/pdf-rag-chunker").call(run_input={
"urls": ["https://...filing.pdf"],
"chunkSizeChars": 1500,
})
# 2. Embed each chunk
chunks = list(client.dataset(run["defaultDatasetId"]).iterate_items())
chunks = [c for c in chunks if c.get("_type") == "chunk"]
embeddings = openai.embeddings.create(
model="text-embedding-3-small",
input=[c["text"] for c in chunks],
).data
# 3. Upsert to vector DB
index = pinecone.Index("rag-docs")
index.upsert([
{"id": f"{c['url']}-{c['global_chunk_index']}",
"values": embeddings[i].embedding,
"metadata": {"url": c["url"], "page": c["page"]}}
for i, c in enumerate(chunks)
])

Limitations

  • Scanned PDFs (image-only) — Returns 0 chunks. Use OCR-equipped actor.
  • Multi-column research papers — Reading order may be slightly off (pdfplumber respects column layout but isn't perfect).
  • No embedding included — Embedding requires your own OpenAI/Voyage/Cohere key (different vendor). We focus on chunking only to keep costs predictable.


Feedback

A short review helps RAG engineers find it: Leave a review on Apify Store