Back to blog
Transcription Tips

How we handle code-switching in real-time audio

A technical look at how MangoFinch detects and transcribes mid-sentence language switches in real-time multilingual meetings.

MangoFinch Team7 min read

Code-switching is when a speaker changes languages mid-conversation, sometimes mid-sentence. If you have been in a meeting where someone says "We need to finalize the Vertrag before Friday" or "Let's check the données from last quarter," you have seen it.

It happens constantly in multilingual workplaces. A 2024 study from the European Commission found that 67% of cross-border business meetings contain at least one language switch per minute. For most transcription tools, each switch is a small disaster.

We spent months building MangoFinch's code-switching pipeline. Here is what we learned, what breaks, and how we handle it at speed.

Why code-switching destroys traditional transcription

Most speech-to-text systems assume a single language per audio stream. You pick English, you get English. The acoustic model loads phoneme maps for one language and tries to force every sound through that filter.

When a Spanish speaker drops in an English technical term — "necesitamos revisar el deployment pipeline" — a Spanish-only model has two bad options. It can hallucinate a Spanish word that sounds vaguely like "deployment pipeline." Or it can output garbage characters. Either way, you lose the most important word in the sentence.

The problem is phonetic. Spanish has 5 vowel sounds. English has roughly 15, depending on dialect. When a model trained on 5 vowel positions encounters an English diphthong, it does not know what to do with the extra acoustic information. It is not a software bug. It is a fundamental mismatch between the sound and the model's learned representations.

We tested this early on with a recording from a fintech team in Frankfurt. The meeting was primarily German with frequent English financial terms. A single-language German model produced transcripts where "hedge fund" became "Hetz fand" and "quarterly earnings" turned into "Quartier Ernings." The German words around them were fine. Every English fragment was mangled.

The acoustic model challenge

Building a model that handles multiple languages simultaneously is harder than building one that handles each language well in isolation.

The core tension: language-specific models can specialize. They learn that certain sound combinations are common in their language and rare in others. A Japanese model knows that consonant clusters like "str" basically do not exist in Japanese, so it will not waste probability mass on them. That specialization makes it accurate.

A multilingual model cannot specialize as aggressively. It has to keep its options open. Every phoneme from every supported language needs some probability, which means the model is less confident about any individual prediction. In practice, this shows up as a 3-8% accuracy drop compared to the best single-language model for any given language.

There is also the segmentation problem. Before you can transcribe a code-switched utterance, you need to figure out where the switch happens. Is the switch at a word boundary? A morpheme boundary? Somewhere inside a word? In agglutinative languages like Turkish or Finnish, speakers sometimes attach suffixes from one language to stems from another.

How our speech engine's multi-language mode works

We built MangoFinch on a speech engine specifically designed for multi-language streaming.

Our transcription engine does not run one monolithic multilingual model. Instead, it runs parallel language-specific decoders against a shared acoustic encoder. The acoustic encoder converts raw audio into a language-agnostic feature representation — a phonetic fingerprint that captures what sounds were made without committing to which language they belong to.

Those features get passed to multiple language decoders simultaneously. Each decoder scores the audio against its own language model. A meta-classifier then picks the decoder with the highest confidence for each segment.

The key insight is segment-level language detection rather than utterance-level. The engine does not try to guess the language for an entire sentence and then transcribe. It makes language decisions every few hundred milliseconds, which means it can catch switches that happen inside a single clause.

We configure the engine with a primary language and up to 4 secondary languages per session. The primary language gets a confidence bonus — roughly 15% — because in most meetings, 70-80% of speech stays in the dominant language. Without that bias, the system would over-trigger on language switches, flagging accent variations as switches when they are not.

Per-segment language detection in practice

When audio comes in, our pipeline processes it in chunks. Each chunk goes through three stages:

Stage 1: Voice activity detection. We identify which portions of the chunk contain speech versus silence or background noise. This filters out the keyboard clicks, chair squeaks, and HVAC hum that would confuse the language detector.

Stage 2: Language identification. For each speech segment, the engine's parallel decoders produce competing transcriptions with confidence scores. If the English decoder returns "we should review the contract" at 0.92 confidence, and the German decoder returns "wie schaut review der Kontrakt" at 0.71, the English transcript wins.

Stage 3: Boundary smoothing. Raw segment-level decisions are noisy. A speaker might produce a vowel sound that briefly scores higher in Portuguese than Spanish, even though they are speaking Spanish. We apply a 600ms smoothing window — if a detected "switch" lasts less than 600ms and is surrounded by the same language on both sides, we suppress it.

This smoothing is a tradeoff. It prevents false positives but means we sometimes miss very short insertions — a single borrowed word from another language might get absorbed into the surrounding language's transcript.

The 300ms latency budget

Real-time transcription means the words need to appear on screen while the speaker is still talking. Our target is 300ms end-to-end latency from audio capture to rendered text.

Here is where that budget goes: audio capture and transmission takes a few dozen milliseconds. Speech engine processing and language detection takes the largest share — a few hundred milliseconds. Our post-processing (smoothing, formatting, translation dispatch) and delivery to the client take a few dozen more. The total comes in well under 400ms in most cases.

The speech processing step is doing a lot of work — acoustic encoding, parallel decoding, language classification, and beam search all happen in that window. When a code-switch occurs, the system sometimes needs extra time to resolve the ambiguity. Users do not notice the difference. Anything under 400ms feels "live" in a meeting context.

The translation step adds more time, but we handle that asynchronously. The original-language transcript appears first, and the translation follows 200-500ms later depending on segment length and target language.

Real examples from beta teams

Our beta program included 12 teams across 6 countries. Three examples stood out.

A consulting firm in Singapore ran meetings in English with frequent Mandarin side conversations. Their typical pattern: the main presentation in English, then two participants would discuss a point in Mandarin for 10-15 seconds, then switch back to English. These were clean switches — full sentences in one language, then full sentences in another. Our system handled these well because the segments were long enough for confident language detection.

A software team in Barcelona mixed Catalan, Spanish, and English constantly, sometimes within a single sentence. "Hem de fer push del branch antes de la daily." Three languages in one sentence, with technical English terms embedded in a Catalan/Spanish frame. The system correctly identified the English terms about 86% of the time but sometimes misattributed Catalan segments as Spanish, since the languages share significant phonetic overlap.

A logistics company in Dubai had meetings with Arabic, Hindi, and English. Arabic-to-English switches were relatively clean because the phonetic systems are so different — the language detector had strong signal. Hindi-to-English was harder because many Hindi speakers use English loanwords with Hindi phonology, blurring the acoustic boundary.

Accuracy stats

We measured code-switching accuracy across 340 hours of beta meeting recordings. The metric: did the system correctly identify the language of each transcribed segment and produce an accurate transcription in that language?

For segments longer than a few seconds in a single language: accuracy is strong. A few seconds of audio gives the system plenty of acoustic evidence.

For very short segments under 2 seconds: accuracy drops significantly. A two-second burst in a different language might only contain 4-6 words, which is not much signal.

For single borrowed words: accuracy is close to a coin flip. When someone drops a single foreign word into an otherwise monolingual sentence, the system usually cannot detect the switch fast enough.

We are not happy with the single-word detection rate. But removing the smoothing window to catch single words drops the accuracy on longer segments significantly, because false-positive language switches contaminate otherwise clean transcriptions. For meeting transcription, we believe the tradeoff favors stability.

What comes next

Three specific improvements are in our pipeline.

First, vocabulary-aware language detection. If we know a team frequently uses German financial terms in English meetings, we can pre-load those terms and reduce the confidence threshold for German on those specific words.

Second, speaker-linked language profiles. If speaker A has been speaking Mandarin for the last 45 seconds, the probability that their next segment is also Mandarin should be higher than the base rate.

Third, post-meeting correction passes. Real-time has a 300ms budget. But after the meeting ends, we can run a second pass with a larger context window and more aggressive language detection.

Code-switching is the hardest problem in multilingual transcription. Those sub-2-second accuracy numbers make that clear. But for meetings where speakers switch languages in full phrases and sentences — which accounts for about 80% of real-world code-switching in business settings — the system works. And it works at a latency that feels live.

Try MangoFinch free

Real-time transcription and translation for multilingual teams. No credit card required.

Start a free meeting