An end-to-end, real-time speech-to-speech system leveraging WebRTC and fine-tuned LLM/TTS architectures to train healthcare students in colloquial Puerto Rican Spanish.
Note: The live deployment utilizes our active, fine-tuned Coqui XTTS-v2 pipeline via a quantized inference setup. The companion Gemma LLM backbone is currently running on a vanilla baseline while our customized, dialect-tuned reasoning layers undergo strict sociolinguistic evaluation to prevent cultural misrepresentation.
The deployment is restricted to authorized users. Please email me for access credentials.
In New York, Caribbean Spanish, specifically the colloquial Puerto Rican dialect, is heavily spoken. This vernacular differs significantly in phonology, lexicon, and prosody from the Standard Spanish traditionally taught in academic settings.
For nursing and pharmacy students, this linguistic mismatch creates severe communication barriers during clinical intakes, often leading to medical misinterpretations. Because mainstream Voice AI technologies focus almost exclusively on high-resource, standardized variants, a specialized pipeline was required to help students practice with these rich regional variations.
We built a low-resource dialect adaptation framework that nudges established base models toward colloquial variations. This avoids the unsustainable compute constraints of training foundation models from scratch.
Utilized audio corpora from the University of Puerto Rico. Built a data staging pipeline performing normalization, diarization, and forced token-to-audio alignment to clean the transcripts. Fine-tuned Coqui XTTS-v2 on this data, blending it with high-quality general corpora to preserve voice stability and eliminate voice synthesis hallucinations.
To generate colloquial responses from scarce text data, we reversed the traditional translation vector. We took the limited colloquial data available and converted it back to Standard Spanish using commercial LLMs. Training the Gemma backbone on this inverse mapping yielded vastly superior contextual generation.
As the lead engineer on this project, I architected and deployed the entire system. My concrete core contributions include:
Transitioning the generation layers to fully streaming chunk-by-chunk decoding to reduce Time-to-First-Phoneme (TTFP) under tight WebRTC window limitations.
Developing conversational data augmentation strategies to model natural turn-taking behavior, regional speech pacing, and authentic dialectal fillers to maximize dialog flow.
Co-designing a formal study to measure student fluency gains, tracking conversational confidence and error rates among nursing and pharmacy cohorts.