Back to blog
Multilingual Meetings

Why meeting transcription still fails multilingual teams

Most transcription tools were built for English. Here is what actually goes wrong when your team speaks three languages in one meeting, and how we fixed it.

MangoFinch Team6 min read

We built MangoFinch because we kept losing conversations.

Not in the dramatic, existential sense. In the literal sense: someone on the call would switch from English to Portuguese mid-sentence, the transcription tool would choke, and 40 seconds of context would vanish. The notes would show "[inaudible]" or, worse, a confident English transcription of Portuguese words that meant something completely different.

This happened enough that we started looking into it. What we found was not surprising, but it was frustrating.

The single-language assumption

Most transcription tools pick a language at the start of a session and stick with it. Rev, Otter, Google Meet's built-in captions — they all ask you to set a language before the meeting starts. Some let you pick two. None of them handle what actually happens in a multilingual team meeting, which is that people drift between languages without announcing it.

A product manager in Mexico City might open in English for the New York team, switch to Spanish when explaining a UI string, then read a customer quote in Portuguese. That is one agenda item. Three languages. No pauses between them.

If your transcription engine committed to English at minute zero, it just produced three paragraphs of garbage.

Why "multi-language support" usually means "we support many languages, one at a time"

Look at the feature pages carefully. When a tool says it supports 30+ languages, they almost always mean you can pick any one of those 30 before the meeting starts. The dropdown is impressive. The actual behavior is single-track.

There is a reason for this. Language detection in real-time audio is hard. The acoustic models for each language are different, the vocabulary sets are different, and the latency budget for live transcription is tiny — you have about 300 milliseconds before the delay becomes noticeable. Running multiple language models in parallel multiplies compute cost. Running a language-detection layer on top of that adds another 100-200ms.

So most services make the practical choice: pick one, run it fast, keep the bill low.

What we did differently

MangoFinch uses a speech engine built specifically for multi-language mode. It was trained to handle code-switching — the technical term for what happens when someone flips between languages inside a single utterance.

The engine listens to a rolling window of audio and makes per-segment language decisions. It does not commit to English for the whole call. It evaluates each phrase independently. When someone switches from English to Japanese mid-sentence, the transcription reflects that within about a second.

We then feed each segment to our translation layer, tagged with its detected source language. The translations render inline, below the original text, so every participant sees both the original and a version in their preferred language.

The result is a transcript that looks like what actually happened in the meeting, not a single-language approximation of it.

The cost question

This is more expensive to run than single-language transcription. We are not going to pretend otherwise. The compute cost per minute is roughly 2.5x what a single-language stream costs, because the engine is doing more work per audio frame.

We absorbed that into the pricing because the alternative — delivering broken transcripts to multilingual teams — is not a product. It is a demo that works in English and apologizes in every other language.

What this changes for teams

The teams using MangoFinch in beta have between 3 and 8 languages represented in their regular meetings. The common pattern is a primary language (usually English) with frequent switches to local languages for precision, cultural context, or because someone is more comfortable expressing a complex idea in their first language.

Before MangoFinch, these teams had two options: ask everyone to stick to English (which slows down the native speakers and loses nuance), or accept that the transcript will be incomplete.

Now the transcript captures everything, in every language, with inline translations. The meeting notes are searchable across all languages. A Japanese team member can search for a concept discussed in Spanish and find it, because the English translation is indexed alongside the original.

Where it still breaks

We are not going to claim this is perfect. Language detection fails on very short utterances — a single word in a different language sometimes gets classified as the surrounding language. Proper nouns cause confusion, especially names that exist in multiple languages. And some language pairs are harder than others: Portuguese and Spanish segments occasionally get misclassified because the acoustic signatures overlap.

We track these failure rates. Current accuracy on language detection is strong for segments longer than a few seconds. For very short segments, accuracy drops noticeably. We are working on it.

Try it

If your team regularly speaks more than one language in meetings, we built this for you. The beta is open at mangofinch.com. No credit card, no sales call. Start a room, invite your team, and see what your meetings actually sound like when every language gets captured.

Try MangoFinch free

Real-time transcription and translation for multilingual teams. No credit card required.

Start a free meeting