Back to blog
AI in Business

Speaker diarization in multilingual audio: the unsolved problem

Figuring out who said what is hard enough in one language. When speakers switch between three languages in the same sentence, current diarization models fall apart. Here is why, and what we are doing about it.

MangoFinch Team7 min read

Speaker diarization is the task of figuring out who spoke when. You have an audio stream with multiple voices, and you need to label each segment: Speaker A said this from 0:03 to 0:07, Speaker B said that from 0:08 to 0:14, and so on.

In a monolingual meeting, this works reasonably well. Modern diarization systems achieve 8-12% Diarization Error Rate (DER) on English conference calls — meaning roughly 90% of the time, the right words are attributed to the right person. Not perfect, but useful.

In a multilingual meeting, everything gets worse. Our internal benchmarks show DER climbing to 18-25% when the same meeting includes three or more languages. That is not a small degradation. It means one out of every four or five utterances is attributed to the wrong person.

This article is an honest assessment of where the field is, where MangoFinch is, and what we think it takes to fix it.

How diarization works in a single language

The standard pipeline has four stages.

**Voice Activity Detection (VAD).** The system identifies which portions of the audio contain speech versus silence, background noise, or music. This is the easiest stage — modern VAD is above 98% accurate in most conditions.

**Segmentation.** The speech portions are split into short segments, typically 1-3 seconds each. The goal is to create segments where only one person is speaking. Overlapping speech — two people talking at once — is the first major source of errors.

**Embedding extraction.** Each segment is run through a neural network that produces a fixed-size vector (an "embedding") representing the vocal characteristics of whoever is speaking. The most common approach uses x-vectors or d-vectors, which are trained to capture speaker identity regardless of what words are being said. Think of it as a voiceprint.

**Clustering.** The embeddings are grouped together. Segments with similar embeddings get assigned to the same speaker. Spectral clustering and agglomerative clustering are the two dominant methods. The system does not know who the speakers are by name — it just knows that segments 1, 4, 7, and 12 sound like the same person.

This pipeline works because, in a single language, a person's voice is relatively consistent. Their pitch range, speaking rate, formant frequencies, and vocal timbre stay stable enough that the embedding model can group their segments together.

Why multilingual audio breaks the pipeline

Here is the problem: when a speaker switches languages, their voice changes.

This is not a metaphor. Research from the University of Munich published in 2024 measured acoustic parameters across bilingual speakers and found measurable shifts in fundamental frequency (average pitch), speaking rate, and formant structure when the same person switched between German and English. The shifts were smaller than the differences between two different speakers in the same language, but large enough to confuse embedding models.

The issue compounds across three dimensions.

**Acoustic feature overlap.** The embedding model learns to distinguish speakers based on acoustic features. But some of those features are language-dependent. Japanese has a narrower pitch range than Italian. Mandarin uses tonal variation that English does not. When a speaker switches from English to Mandarin, their pitch patterns change in ways that the embedding model interprets as a different speaker.

We measured this directly. In a test with four speakers alternating between English and Japanese, the embedding model produced two distinct clusters for one of the speakers — one cluster for her English segments and another for her Japanese segments. The model thought she was two different people.

**Language detection competition.** The acoustic features that distinguish languages (tonal patterns, phoneme inventory, rhythm) overlap substantially with the features that distinguish speakers. A diarization model and a language identification model are, in a sense, competing for the same information in the audio signal.

When you run both simultaneously — which is what any multilingual transcription system needs to do — the two tasks interfere with each other. The language model says "this segment is Japanese based on the tonal patterns." The speaker model says "this segment sounds different from Speaker A's English segments, so it must be Speaker B." Both models are looking at the same tonal patterns and drawing different conclusions.

**The vocabulary gap in training data.** Diarization models are overwhelmingly trained on English data. The most widely used benchmark, the NIST SRE (Speaker Recognition Evaluation) dataset, is primarily English. The AMI corpus, the CALLHOME corpus, the VoxCeleb dataset — all heavily English, with some representation of other major languages.

Datasets containing natural code-switching between multiple languages in meeting contexts are almost nonexistent. You cannot train a model on data that does not exist.

Current approaches and their limitations

The research community has proposed several approaches to multilingual diarization. None of them fully work yet.

**Language-conditioned embeddings.** The idea is to train the speaker embedding model with language labels, so it learns to normalize for language-specific acoustic differences. A 2025 paper from INTERSPEECH demonstrated a 15% relative DER improvement on bilingual English-Mandarin data. The problem: the improvement only held for the two languages in the training data. Adding a third language (Japanese) degraded performance back to baseline.

**Multi-stage pipelines.** First detect the language, then run a language-specific diarization model. This avoids the interference problem by separating the two tasks. The downside is latency — you are now running language detection, then diarization, sequentially. In a real-time system, this adds 500-800ms of additional delay. It also requires maintaining separate diarization models for each supported language, which is expensive and complicated.

**End-to-end models.** Systems like Pyannote 3.0 attempt to do joint language identification and speaker diarization in a single model. The results are promising on benchmark data (16% DER on the DIHARD III multilingual subset) but the model was trained on European languages and performs significantly worse on CJK (Chinese, Japanese, Korean) languages.

None of these approaches handle the real-world case well: four speakers, three languages, with natural code-switching happening multiple times per minute.

Where MangoFinch is right now — an honest assessment

Our current diarization is basic. I want to be direct about this because I think the industry has a habit of overselling capabilities.

We use our speech engine's built-in diarization, which runs alongside the transcription model. It works by extracting speaker embeddings from the same audio frames used for speech recognition. This is efficient — one model, one pass — but it means the diarization does not get any special optimization for multilingual content.

Our measured DER across the beta:

- **Monolingual English meetings (2-4 speakers):** 10.3% DER. This is in line with industry benchmarks and is fine for most use cases.

- **Bilingual meetings (English + one other language):** 16.8% DER. Noticeable degradation, but the transcripts are still broadly usable. Most errors are at language transition points.

- **Trilingual or more (3+ languages, 3+ speakers):** 23.1% DER. This is where it gets rough. Nearly one in four segments is attributed to the wrong speaker. In a meeting where you need to know who committed to a deadline, this is a real problem.

The pattern of errors is consistent. When a speaker switches languages, there is a roughly 35% chance their next segment gets attributed to a different speaker. The model is confused by the acoustic shift, and it guesses wrong.

We display speaker labels in the transcript, and users can manually correct them. But manual correction in a live meeting is not practical — by the time you fix the attribution, the conversation has moved on.

The compound error problem

Here is what makes multilingual meeting transcription genuinely difficult: errors multiply.

In a monolingual system, you have one error source — transcription accuracy. If the Word Error Rate (WER) is 8%, you know roughly how much of the transcript is wrong, and the errors are randomly distributed.

In a multilingual system with diarization, you have three error sources that compound:

1. **Language detection error.** If the system misidentifies a Japanese segment as Korean, the wrong transcription model runs on it, and the output is garbage. Our language detection accuracy is strong for segments over a few seconds, but the miss rate propagates downstream.

2. **Transcription error.** Even with correct language detection, speech-to-text is not perfect. WER varies by language — some languages perform significantly better than others on our current system.

3. **Diarization error.** The wrong speaker gets credited with the wrong words.

These are not independent. A language detection error causes a transcription error (wrong model applied), which makes the diarization harder (the embedding model is working with poorly transcribed audio that doesn't match any known speaker pattern).

The compound effect: when you multiply a few percentage points of language detection error, a moderate WER, and significant DER, the probability that a given segment is correctly identified by language, accurately transcribed, and attributed to the right speaker drops to roughly 70%.

That means for a trilingual meeting, about 31% of segments have at least one thing wrong with them. Not all errors are equally severe — a minor transcription error with correct attribution is much less problematic than a perfect transcription attributed to the wrong person. But the compound effect is real, and we measure it.

Our roadmap for improvement

We have three specific initiatives in progress. I am sharing them because I think transparency about what is hard — and what we have not solved yet — is more useful than a marketing claim about "AI-powered speaker identification."

**1. Enrollment-based voice profiles (Q3 2026).** Before a meeting starts, each participant records a 15-second voice sample in each language they plan to use. This gives the diarization model a known reference point for each speaker-language combination. In our internal prototype, this reduces DER from 23% to 14% for trilingual meetings. The tradeoff: it requires setup before each meeting, and new participants who have not enrolled get worse diarization than the enrolled ones.

**2. Language-aware clustering (Q4 2026).** We are building a custom clustering step that takes language identity into account when grouping speaker embeddings. Instead of clustering purely on acoustic similarity, the algorithm will first group segments by detected language, then cluster within each language group, then merge clusters across languages using a cross-lingual speaker similarity model. Early experiments show a 20% relative DER improvement, but we need more multilingual training data before this is production-ready.

**3. Continuous speaker adaptation (2027).** The most ambitious item. During a meeting, the model builds an evolving profile of each speaker that updates as more data comes in. The first 30 seconds of speaker attribution might have high error; by minute 10, the model has heard enough from each speaker in each language to be significantly more accurate. Think of it as the model learning the participants during the meeting. This approach has shown strong results in academic papers but has not been productionized in a real-time system yet, as far as we know.

What this means for MangoFinch users today

If you are using MangoFinch for multilingual meetings with three or more languages, the diarization will make mistakes. The transcript content — the actual words spoken — will be accurate. But the speaker labels will be wrong roughly 20-25% of the time in complex multilingual settings.

For meetings where attribution matters (decisions, action items, commitments), we recommend two practices:

- Use our manual speaker correction feature to fix misattributed segments during or after the meeting.

- Turn on the meeting summary feature, which uses a language model to infer speaker attribution from conversational context. This catches about 40% of diarization errors — when the language model can tell from context that "I will handle the deployment" was said by the engineer, not the product manager.

We are working on making this better. The honest truth is that multilingual speaker diarization is an unsolved problem in the research community, not just in our product. We are investing in solving it because our users need it, and because nobody else is focusing specifically on the intersection of multiple languages and multiple speakers in real-time audio.

If you are a researcher working on multilingual diarization and want to collaborate, reach out. We have real-world meeting data across 14 language combinations that does not exist in any public dataset. Science moves faster with real data.

Try MangoFinch free

Real-time transcription and translation for multilingual teams. No credit card required.

Start a free meeting