Case Study
Palzo
A voice-first research platform for India's PG medical doctors. Multilingual patient interviews, auto-transcription, and AI-powered thesis output, built because no existing tool understood Indian medical research.
Why it exists
My sister's thesis. And the gap behind it.
My sister is a PG doctor. Like every PG doctor in India, she has to complete a research thesis before she can get her degree, a non-negotiable NMC requirement. The thesis means months of patient data collection across hospital wards, before she writes a single word of the actual report.
The current workflow: paper questionnaires handed to patients (many of whom can't read English), manual transcription of responses, data entry into Excel, analysis by a statistician friend, writing in Word, formatting citations by hand. Each step is a bottleneck. Each bottleneck costs weeks.
The specific problem I kept returning to: the language barrier destroys data quality. An English-language questionnaire administered to a patient whose primary language is Kannada or Marathi produces distorted, incomplete data. Doctors know this. They work around it manually: reading questions aloud, translating on the fly, losing nuance in the process. There was no tool built for this reality.
Palzo is built for this reality. The patient receives a link on their phone, hears the question read aloud in their own language, and records a voice response. No forms. No English. No app download. The doctor never has to be in the room.
System Architecture
Two users, one pipeline, zero app installs
The system has two distinct user flows: the doctor's workflow (authenticated, dashboard-driven) and the patient's workflow (anonymous link, mobile-first, voice-only). They meet at the transcript layer.
Audio engine abstraction: All TTS and STT calls route through a single service layer (src/lib/audio/service.ts) with pluggable adapters. Google Cloud, Sarvam AI, and OpenWhisper are all supported; the active provider is set via environment variable. Swapping providers requires no product code changes.
Multi-tenancy via Supabase RLS: Doctors can only see their own patients, responses, and transcripts. The isolation is enforced at the database level via Row Level Security policies, not just in the application layer, which is required for medical data compliance.
Design Decisions
Four calls that made this usable
Voice-first for patients, not forms
A patient in a hospital ward may not read English. They may not read at all. A digital form, even a well-translated one, introduces a literacy barrier that corrupts the data. Voice removes the barrier entirely. The patient hears the question in their language and speaks their answer naturally. The data quality is fundamentally better.
Tradeoff: Voice data requires transcription, which introduces a processing step and an accuracy variable. Medical terminology in regional languages has lower STT accuracy than conversational speech. The doctor verification step exists precisely for this reason: the transcript is a draft, not a final record.
No app install for patients
A patient in a hospital ward who needs to install an app, create an account, and navigate a new interface will abandon the process before the first question. The patient experience is a browser link that works on any phone. That's it. The entire interview happens in the mobile browser: no install, no login, no friction.
Tradeoff: The browser's MediaRecorder API has cross-platform inconsistencies. iOS Safari doesn't support WebM Opus; a separate MP4 fallback was required. Browser audio quality is lower than a native app. For the data quality required at this stage of research, it's sufficient.
Engine-agnostic audio layer from day one
Google Cloud TTS/STT is the default provider: reliable, well-documented, globally available. Sarvam AI is purpose-built for Indic languages and produces better results for Hindi, Tamil, and Telugu. OpenWhisper runs fully on-premise when data sovereignty matters. All three are live adapters; the active provider is a config switch, not a code change.
Tradeoff: Building the abstraction layer before it's needed adds upfront complexity. The bet is that provider-switching will happen within 6 months as the product scales and language accuracy becomes critical for research validity.
Doctor verification as a required step, not optional
Auto-transcription accuracy for medical terminology in Indian regional languages is imperfect. A Marathi-speaking patient describing symptoms will use medical terms that any STT model mishears or omits. The doctor review step isn't a nice-to-have; it's data integrity infrastructure. No response moves into analysis until a doctor has verified the transcript.
Tradeoff: Adds time to the data collection process. A fully automated pipeline would be faster. But for medical research that will end up in a published thesis, data accuracy isn't negotiable.
Operational Thinking
Designing for the hospital ward, not the office
AI System Thinking
Where the AI lives and where it doesn't
TTS pipeline: Doctor writes a question in the questionnaire interface → system routes it through the audio service layer → active TTS adapter (Google Cloud, Sarvam, or OpenWhisper) converts to MP3/WAV using the selected language voice → audio file served to the patient's browser for playback. Voice selection is language-aware and doctor-configurable. Sarvam is preferred for Indic languages; Google is the fallback.
STT pipeline: Patient records response → browser captures audio using MediaRecorder (WebM Opus on Android, MP4 on iOS) → audio uploaded to Supabase Storage → API route dispatches to the active STT adapter with a language hint and medical terminology hints → transcript + confidence score returned → stored against the response row → surfaced to the doctor for verification. Sarvam and OpenWhisper are available as drop-in replacements for Google Cloud STT.
Medical terminology hints: The STT request passes a contextual vocabulary list to the active provider, using terms specific to the thesis topic (e.g., cardiology terminology for a cardiac study). All three supported providers accept custom vocabulary hints. This boosts recognition accuracy for domain-specific language that generic models frequently mishear.
Fallback mechanisms: If the active TTS provider is unavailable, the system degrades gracefully: the question is displayed as text and the patient reads (or asks the doctor to read). If STT fails, the audio file is stored and the transcript field is blank, prompting manual transcription by the doctor. The system degrades to a worse experience, not a broken one.
Phase 2 AI roadmap: LLM-powered statistical analysis of verified transcripts (identifying themes, coding responses, running sentiment analysis), automated Vancouver citation formatting, and thesis chapter generation, covering the parts of the thesis process that come after data collection. These modules are planned but not yet built. The current product proves the data collection pipeline; phase 2 proves the analysis pipeline.
Let's talk.
Open to full-time roles and consulting engagements.
Based in India · Open to relocate globally.