Case Study
RAG Video Analysis System
Turns any YouTube video into notes, a mind map, flashcards, quizzes, and a grounded chat interface. Under $0.01 per video in local setup.
Why it exists
The problem isn't summarization. It's retrieval.
I was rewatching long videos just to find a specific point I half-remembered. A two-hour lecture compressed to five bullet points loses the thing you actually needed: the specific argument at 47:22, the counterargument at 1:12:40, the timestamp you'd return to. Compression is a blunt instrument.
The insight that shaped the build: what I actually needed wasn't a shorter version of the video, it was a searchable one. A system where I could ask "what did they say about retrieval augmentation?" and get back the exact segment, with a timestamp, grounded in the actual transcript. The summarization is a bonus output; the retrieval is the product.
Existing tools (Otter.ai, Notion AI, etc.) are built for meetings: short, structured, English-language. Long-form educational content (a 3-hour conference talk, a dense academic lecture) breaks all their assumptions about length, density, and the fact that users want to navigate, not just read.
System Architecture
Two parallel pipelines, one retrieval layer
The system runs two pipelines in parallel once a URL is submitted: summarization (streams immediately to the frontend) and indexing (runs in the background to build the retrieval layer for chat/flashcards/quiz).
User → AI → Output pipeline for chat: User asks a question → LLM calls search_transcript tool (up to 3 iterations) → each call queries Qdrant with hybrid search (dense + sparse fusion) → top 4 chunks returned with timestamps → LLM generates a grounded answer citing the exact moments in the transcript.
Streaming: Summarization results stream to the frontend via Server-Sent Events (SSE) as tokens arrive. The user sees output within seconds of submitting, not after a full 2-minute wait.
Design Decisions
Five non-obvious calls
Hybrid search over pure vector search
Pure vector (dense) search is semantically rich but misses exact keyword matches: it will find "retrieval" when you ask about "RAG" but fail if you search for a speaker's specific name. Pure BM25 (sparse) is exact but semantically blind. Hybrid fuses both via Qdrant's reciprocal rank fusion, giving you semantic understanding and keyword precision simultaneously.
Tradeoff: Requires maintaining two embedding models (Voyage dense + FastEmbed BM25) and a Qdrant collection with both vector types. More infrastructure, meaningfully better retrieval quality.
SSE streaming, not polling
A 2-minute wait for a complete summary is psychologically brutal even if the total time is the same. Server-Sent Events let the first tokens appear within 3–5 seconds. The user sees the summary building in real time and trusts the system is working. Polling would require a round-trip every N seconds with no guaranteed update.
Tradeoff: SSE is a persistent connection that holds a Vercel function alive longer, which matters on the serverless free tier. Demo mode mitigates this by using hardcoded responses for public traffic.
Background indexing, not blocking
The summarization and the vector indexing are independent operations. Running them sequentially would double the total wait. Running indexing in the background means the user gets their summary immediately; by the time they finish reading and want to chat, the index is usually ready.
Tradeoff: The chat feature has a brief unavailability window right after summarization. A status endpoint (/api/index/status) surfaces indexing progress; the frontend can show "building retrieval index..." rather than silently failing.
Tool-calling for chat, not naive RAG
In naive RAG, you embed the question, retrieve k chunks, stuff them into the context, and ask the LLM to answer. The problem: one retrieval pass often misses the answer if the question and the relevant transcript segment use different vocabulary. Tool-calling lets the LLM iteratively search: it can run search_transcript up to 3 times with different queries before synthesizing a response.
Tradeoff: 3× the retrieval cost and latency per chat message compared to single-pass RAG. The quality improvement for dense educational content is worth it.
Demo mode for public deployment
The full system consumes API credits on every request: Voyage embeddings, LLM calls, Qdrant queries. A public deployment with no auth would burn through credits with no control. Demo mode runs the full UI with a fixed demo video and hardcoded responses for chat/flashcards/quiz. The real pipeline is available for local setup or paid access.
Tradeoff: The public demo undersells what the system does: visitors see a curated example, not their own video. But it's the only way to show a sophisticated product without running up a bill on strangers' requests.
Operational Thinking
The numbers that actually govern the system
AI System Thinking
Prompt orchestration and retrieval design
Video type detection: The first LLM call classifies the video (lecture, interview, tutorial, documentary) and adjusts the summarization prompt accordingly. A lecture prompt extracts concepts, key arguments, and conceptual relationships. An interview prompt extracts questions, answers, and positions stated.
Map-reduce for length: Videos with transcripts over 46,000 characters trigger map-reduce. The transcript is split into up to 16 chunks. Each chunk is summarized independently (the map phase, 2 concurrent LLM calls). The partial summaries are then synthesized into a final output (the reduce phase). This keeps any single LLM call well within context window limits.
Hybrid retrieval mechanics: On a chat query, Qdrant runs both dense cosine similarity (Voyage embeddings) and sparse BM25 scoring. The two result sets are fused via Reciprocal Rank Fusion (RRF). The top 4 chunks by fused score are passed to the LLM as grounded context, along with their timestamps.
Tool-calling loop: The chat LLM has access to a single tool: search_transcript(query: string). It can call this up to 3 times with different queries before generating its final answer. This allows multi-hop retrieval (find the initial claim, then search for the evidence, then search for the counterargument) within a single user turn.
Memory / context handling: there's none, deliberately. Each chat turn is stateless: the conversation history is passed in the context window, not persisted. This keeps the system simple and the retrieval grounded in the transcript, not in prior chat context that might drift. The transcript is the single source of truth.
Let's talk.
Open to full-time roles and consulting engagements.
Based in India · Open to relocate globally.