RAG Video Analysis System: How I Built It

Why it exists

The problem isn't summarization. It's retrieval.

I was rewatching long videos just to find a specific point I half-remembered. A two-hour lecture compressed to five bullet points loses the thing you actually needed: the specific argument at 47:22, the counterargument at 1:12:40, the timestamp you'd return to. Compression is a blunt instrument.

The insight that shaped the build: what I actually needed wasn't a shorter version of the video, it was a searchable one. A system where I could ask "what did they say about retrieval augmentation?" and get back the exact segment, with a timestamp, grounded in the actual transcript. The summarization is a bonus output; the retrieval is the product.

Existing tools (Otter.ai, Notion AI, etc.) are built for meetings: short, structured, English-language. Long-form educational content (a 3-hour conference talk, a dense academic lecture) breaks all their assumptions about length, density, and the fact that users want to navigate, not just read.

System Architecture

Two parallel pipelines, one retrieval layer

The system runs two pipelines in parallel once a URL is submitted: summarization (streams immediately to the frontend) and indexing (runs in the background to build the retrieval layer for chat/flashcards/quiz).

YouTube URL

User input

yt-dlp + Transcript API

Metadata · Subtitles · Segments

Runs in parallel

Summarization Pipeline

Type detect → Trim → Map-reduce → Stream SSE

Claude / OpenRouter

Summary · Sections · Deep dive · Mind map

Indexing Pipeline

Chunk → Embed → Store (background job)

Voyage AI + FastEmbed

Dense 512-dim + BM25 sparse → Qdrant

Upstash Redis

Transcript cache (24h TTL) · Job state

Chat / Flashcards / Quiz

Tool-calling RAG · Qdrant hybrid search · Timestamp-grounded output

User → AI → Output pipeline for chat: User asks a question → LLM calls search_transcript tool (up to 3 iterations) → each call queries Qdrant with hybrid search (dense + sparse fusion) → top 4 chunks returned with timestamps → LLM generates a grounded answer citing the exact moments in the transcript.

Streaming: Summarization results stream to the frontend via Server-Sent Events (SSE) as tokens arrive. The user sees output within seconds of submitting, not after a full 2-minute wait.

Design Decisions

Five non-obvious calls

Hybrid search over pure vector search

Pure vector (dense) search is semantically rich but misses exact keyword matches: it will find "retrieval" when you ask about "RAG" but fail if you search for a speaker's specific name. Pure BM25 (sparse) is exact but semantically blind. Hybrid fuses both via Qdrant's reciprocal rank fusion, giving you semantic understanding and keyword precision simultaneously.

Tradeoff: Requires maintaining two embedding models (Voyage dense + FastEmbed BM25) and a Qdrant collection with both vector types. More infrastructure, meaningfully better retrieval quality.

SSE streaming, not polling

A 2-minute wait for a complete summary is psychologically brutal even if the total time is the same. Server-Sent Events let the first tokens appear within 3–5 seconds. The user sees the summary building in real time and trusts the system is working. Polling would require a round-trip every N seconds with no guaranteed update.

Tradeoff: SSE is a persistent connection that holds a Vercel function alive longer, which matters on the serverless free tier. Demo mode mitigates this by using hardcoded responses for public traffic.

Background indexing, not blocking

The summarization and the vector indexing are independent operations. Running them sequentially would double the total wait. Running indexing in the background means the user gets their summary immediately; by the time they finish reading and want to chat, the index is usually ready.

Tradeoff: The chat feature has a brief unavailability window right after summarization. A status endpoint (/api/index/status) surfaces indexing progress; the frontend can show "building retrieval index..." rather than silently failing.

Tool-calling for chat, not naive RAG

In naive RAG, you embed the question, retrieve k chunks, stuff them into the context, and ask the LLM to answer. The problem: one retrieval pass often misses the answer if the question and the relevant transcript segment use different vocabulary. Tool-calling lets the LLM iteratively search: it can run search_transcript up to 3 times with different queries before synthesizing a response.

Tradeoff: 3× the retrieval cost and latency per chat message compared to single-pass RAG. The quality improvement for dense educational content is worth it.

Demo mode for public deployment

The full system consumes API credits on every request: Voyage embeddings, LLM calls, Qdrant queries. A public deployment with no auth would burn through credits with no control. Demo mode runs the full UI with a fixed demo video and hardcoded responses for chat/flashcards/quiz. The real pipeline is available for local setup or paid access.

Tradeoff: The public demo undersells what the system does: visitors see a curated example, not their own video. But it's the only way to show a sophisticated product without running up a bill on strangers' requests.

Operational Thinking

The numbers that actually govern the system

Latency

1–2 minutes

URL to first summary tokens appearing. Streaming means visible output starts within ~5 seconds. Full summary takes 1–2 min depending on video length and LLM speed.

Caching strategy

24h Redis TTL

Same video re-processed within 24 hours skips transcript fetching entirely. Qdrant vector index persists indefinitely; repeat chat sessions on the same video are instant.

Cost per video

<$0.01

Using Claude Haiku + Voyage 3-lite (512-dim). Map-reduce on a 90-min lecture: ~6 chunks × 2 map calls + 1 reduce call ≈ 12 LLM calls total at Haiku pricing.

Rate limiting

5 req / 15 min

Per-IP throttle on the summarize endpoint. Prevents runaway API spend from a single session. Configurable via environment variables.

Chunk sizing

512 tokens / 64 overlap

Target chunk size matches Voyage embedding model's sweet spot. 64-token overlap preserves context across chunk boundaries, critical for multi-sentence arguments that span chunks.

Multilingual

Auto-detected

yt-dlp pulls available subtitle tracks; the system uses whatever YouTube provides. Voyage 3-lite supports 100+ languages for embeddings. No special handling required for non-English videos.

AI System Thinking

Prompt orchestration and retrieval design

Video type detection: The first LLM call classifies the video (lecture, interview, tutorial, documentary) and adjusts the summarization prompt accordingly. A lecture prompt extracts concepts, key arguments, and conceptual relationships. An interview prompt extracts questions, answers, and positions stated.

Map-reduce for length: Videos with transcripts over 46,000 characters trigger map-reduce. The transcript is split into up to 16 chunks. Each chunk is summarized independently (the map phase, 2 concurrent LLM calls). The partial summaries are then synthesized into a final output (the reduce phase). This keeps any single LLM call well within context window limits.

Hybrid retrieval mechanics: On a chat query, Qdrant runs both dense cosine similarity (Voyage embeddings) and sparse BM25 scoring. The two result sets are fused via Reciprocal Rank Fusion (RRF). The top 4 chunks by fused score are passed to the LLM as grounded context, along with their timestamps.

Tool-calling loop: The chat LLM has access to a single tool: search_transcript(query: string). It can call this up to 3 times with different queries before generating its final answer. This allows multi-hop retrieval (find the initial claim, then search for the evidence, then search for the counterargument) within a single user turn.

Memory / context handling: there's none, deliberately. Each chat turn is stateless: the conversation history is passed in the context window, not persisted. This keeps the system simple and the retrieval grounded in the transcript, not in prior chat context that might drift. The transcript is the single source of truth.

← Back to all projects

See demo ↗ GitHub ↗

Let's talk.

Open to full-time roles and consulting engagements.
Based in India · Open to relocate globally.

Email me LinkedIn Twitter