Caribbean Spanish S2S Pipeline | Atharva Vikas Jadhav

The Problem: The Clinical Dialect Gap

In New York, Caribbean Spanish, specifically the colloquial Puerto Rican dialect, is heavily spoken. This vernacular differs significantly in phonology, lexicon, and prosody from the Standard Spanish traditionally taught in academic settings.

For nursing and pharmacy students, this linguistic mismatch creates severe communication barriers during clinical intakes, often leading to medical misinterpretations. Because mainstream Voice AI technologies focus almost exclusively on high-resource, standardized variants, a specialized pipeline was required to help students practice with these rich regional variations.

The Solution: Real-Time S2S Architecture

We built a low-resource dialect adaptation framework that nudges established base models toward colloquial variations. This avoids the unsustainable compute constraints of training foundation models from scratch.

Acoustic Target Alignment (TTS)

Utilized audio corpora from the University of Puerto Rico. Built a data staging pipeline performing normalization, diarization, and forced token-to-audio alignment to clean the transcripts. Fine-tuned Coqui XTTS-v2 on this data, blending it with high-quality general corpora to preserve voice stability and eliminate voice synthesis hallucinations.

Dialectal Inversion (LLM)

To generate colloquial responses from scarce text data, we reversed the traditional translation vector. We took the limited colloquial data available and converted it back to Standard Spanish using commercial LLMs. Training the Gemma backbone on this inverse mapping yielded vastly superior contextual generation.

Technical Deep Dive & My Role

As the lead engineer on this project, I architected and deployed the entire system. My concrete core contributions include:

End-to-End Data Staging: Developed modular Python pipelines for raw data ingestion, executing speaker diarization, chunking, and forced transcript alignment to maximize the value of highly constrained regional datasets.
Iterative Post-Training: Conducted rounds of post-training across the Gemma and XTTS blocks. To guarantee clinical relevance and phonetic validity, I co-designed human-in-the-loop evaluation protocols alongside linguistics PhD researchers.
LiveKit & WebRTC Infrastructure: Shifted the pipeline from a batch request pattern to a streaming architecture. Deployed LiveKit on private OpenStack cloud instances to stream the unified ASR → LLM → TTS pipeline directly through authenticated WebRTC sessions.
Cross-Functional Leadership: Acted as the primary technical bridge, partnering with clinical faculty to translate ambiguous educational simulator requirements into scalable backend API constraints.

Future Roadmap & Next Steps

Latency Optimization

Transitioning the generation layers to fully streaming chunk-by-chunk decoding to reduce Time-to-First-Phoneme (TTFP) under tight WebRTC window limitations.

Conversational Quality

Developing conversational data augmentation strategies to model natural turn-taking behavior, regional speech pacing, and authentic dialectal fillers to maximize dialog flow.

Clinical Impact Study

Co-designing a formal study to measure student fluency gains, tracking conversational confidence and error rates among nursing and pharmacy cohorts.