Palzo: How I Built It · Shatakshi Mishra

Why it exists

My sister's thesis. And the gap behind it.

My sister is a PG doctor. Like every PG doctor in India, she has to complete a research thesis before she can get her degree, a non-negotiable NMC requirement. The thesis means months of patient data collection across hospital wards, before she writes a single word of the actual report.

The current workflow: paper questionnaires handed to patients (many of whom can't read English), manual transcription of responses, data entry into Excel, analysis by a statistician friend, writing in Word, formatting citations by hand. Each step is a bottleneck. Each bottleneck costs weeks.

The specific problem I kept returning to: the language barrier destroys data quality. An English-language questionnaire administered to a patient whose primary language is Kannada or Marathi produces distorted, incomplete data. Doctors know this. They work around it manually: reading questions aloud, translating on the fly, losing nuance in the process. There was no tool built for this reality.

Palzo is built for this reality. The patient receives a link on their phone, hears the question read aloud in their own language, and records a voice response. No forms. No English. No app download. The doctor never has to be in the room.

System Architecture

Two users, one pipeline, zero app installs

The system has two distinct user flows: the doctor's workflow (authenticated, dashboard-driven) and the patient's workflow (anonymous link, mobile-first, voice-only). They meet at the transcript layer.

Doctor flow

Doctor

Authenticated dashboard

Create Questionnaire

Questions + language selection

Enroll Patients

Secure anonymous link token

Patient flow

Patient

Mobile browser · No login

TTS Question Playback

Google · Sarvam · pluggable · 10 Indian languages

Voice Recording

Browser MediaRecorder · Supabase Storage

STT Transcription

Google · Sarvam · OpenWhisper · pluggable

Convergence

Transcript Layer

Supabase · RLS per doctor

Doctor Review

Edit · Verify · Approve

Analysis + Report

Stats · Citations · Thesis export

Audio engine abstraction: All TTS and STT calls route through a single service layer (src/lib/audio/service.ts) with pluggable adapters. Google Cloud, Sarvam AI, and OpenWhisper are all supported; the active provider is set via environment variable. Swapping providers requires no product code changes.

Multi-tenancy via Supabase RLS: Doctors can only see their own patients, responses, and transcripts. The isolation is enforced at the database level via Row Level Security policies, not just in the application layer, which is required for medical data compliance.

Design Decisions

Four calls that made this usable

Voice-first for patients, not forms

A patient in a hospital ward may not read English. They may not read at all. A digital form, even a well-translated one, introduces a literacy barrier that corrupts the data. Voice removes the barrier entirely. The patient hears the question in their language and speaks their answer naturally. The data quality is fundamentally better.

Tradeoff: Voice data requires transcription, which introduces a processing step and an accuracy variable. Medical terminology in regional languages has lower STT accuracy than conversational speech. The doctor verification step exists precisely for this reason: the transcript is a draft, not a final record.

No app install for patients

A patient in a hospital ward who needs to install an app, create an account, and navigate a new interface will abandon the process before the first question. The patient experience is a browser link that works on any phone. That's it. The entire interview happens in the mobile browser: no install, no login, no friction.

Tradeoff: The browser's MediaRecorder API has cross-platform inconsistencies. iOS Safari doesn't support WebM Opus; a separate MP4 fallback was required. Browser audio quality is lower than a native app. For the data quality required at this stage of research, it's sufficient.

Engine-agnostic audio layer from day one

Google Cloud TTS/STT is the default provider: reliable, well-documented, globally available. Sarvam AI is purpose-built for Indic languages and produces better results for Hindi, Tamil, and Telugu. OpenWhisper runs fully on-premise when data sovereignty matters. All three are live adapters; the active provider is a config switch, not a code change.

Tradeoff: Building the abstraction layer before it's needed adds upfront complexity. The bet is that provider-switching will happen within 6 months as the product scales and language accuracy becomes critical for research validity.

Doctor verification as a required step, not optional

Auto-transcription accuracy for medical terminology in Indian regional languages is imperfect. A Marathi-speaking patient describing symptoms will use medical terms that any STT model mishears or omits. The doctor review step isn't a nice-to-have; it's data integrity infrastructure. No response moves into analysis until a doctor has verified the transcript.

Tradeoff: Adds time to the data collection process. A fully automated pipeline would be faster. But for medical research that will end up in a published thesis, data accuracy isn't negotiable.

Operational Thinking

Designing for the hospital ward, not the office

Languages

10 Indian

Hindi, Tamil, Telugu, Bengali, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Urdu. Indian English variant also supported. Coverage for ~95% of India's population.

TTS cost

₹0.10–0.30 / 1K chars

Google Cloud TTS pricing (default). A 15-question questionnaire in Hindi averages ~800 characters. Cost per patient interview: under ₹0.25. Sarvam and OpenWhisper adapters offer different cost profiles.

STT cost

₹0.40–1.00 / 15 sec

Google Cloud STT pricing (default). A 2-minute response = 8 billing units. Full patient interview (15 questions × 2 min each): estimated ₹60–80. OpenWhisper reduces this to near-zero at the cost of self-hosted compute.

Patient requirement

Phone + browser

Zero literacy in technology required. A patient who has never used a web app can complete an interview if they can operate a phone and speak into a microphone.

Data isolation

Per-doctor RLS

Supabase Row Level Security enforced at DB layer. A doctor querying their patients cannot see another doctor's data, even with direct API access using a valid auth token.

Audio storage

Supabase Storage

Private bucket. Audio files linked to response rows. Retention policy not yet defined. Medical data retention is a compliance question that will need legal input before full launch.

AI System Thinking

Where the AI lives and where it doesn't

TTS pipeline: Doctor writes a question in the questionnaire interface → system routes it through the audio service layer → active TTS adapter (Google Cloud, Sarvam, or OpenWhisper) converts to MP3/WAV using the selected language voice → audio file served to the patient's browser for playback. Voice selection is language-aware and doctor-configurable. Sarvam is preferred for Indic languages; Google is the fallback.

STT pipeline: Patient records response → browser captures audio using MediaRecorder (WebM Opus on Android, MP4 on iOS) → audio uploaded to Supabase Storage → API route dispatches to the active STT adapter with a language hint and medical terminology hints → transcript + confidence score returned → stored against the response row → surfaced to the doctor for verification. Sarvam and OpenWhisper are available as drop-in replacements for Google Cloud STT.

Medical terminology hints: The STT request passes a contextual vocabulary list to the active provider, using terms specific to the thesis topic (e.g., cardiology terminology for a cardiac study). All three supported providers accept custom vocabulary hints. This boosts recognition accuracy for domain-specific language that generic models frequently mishear.

Fallback mechanisms: If the active TTS provider is unavailable, the system degrades gracefully: the question is displayed as text and the patient reads (or asks the doctor to read). If STT fails, the audio file is stored and the transcript field is blank, prompting manual transcription by the doctor. The system degrades to a worse experience, not a broken one.

Phase 2 AI roadmap: LLM-powered statistical analysis of verified transcripts (identifying themes, coding responses, running sentiment analysis), automated Vancouver citation formatting, and thesis chapter generation, covering the parts of the thesis process that come after data collection. These modules are planned but not yet built. The current product proves the data collection pipeline; phase 2 proves the analysis pipeline.

← Back to all projects Visit Palzo ↗

Let's talk.

Open to full-time roles and consulting engagements.
Based in India · Open to relocate globally.

Email me LinkedIn Twitter