Back to blog
Transcription Tips

How we handle Japanese — the hardest language for real-time transcription

Three writing systems, verbs that come at the end of the sentence, 100+ readings for a single kanji, and zero spaces between words. Here is what makes Japanese transcription different from everything else we support.

MangoFinch Team8 min read

Every language has transcription challenges. Mandarin has tones. Arabic has right-to-left text and connected script. German has compound nouns that could be their own sentences.

Japanese has all of those problems and several more that are uniquely its own. After eight months of tuning MangoFinch for Japanese-language meetings, I can say with confidence that it is the single hardest language we support for real-time transcription. Not by a small margin.

This is a detailed look at why, and what we do about it.

Three writing systems in one language

Japanese uses three scripts simultaneously: hiragana, katakana, and kanji. A single sentence routinely contains all three. The sentence "I ate sushi at the restaurant" in Japanese — "レストランで寿司を食べました" — contains katakana (レストラン, "restaurant," a loanword from English), kanji (寿司, "sushi," and 食, "eat"), and hiragana (で, を, べました, grammatical particles and verb conjugation).

For transcription, this means the speech-to-text model has to produce output in three different character sets, switching between them based on word origin and grammatical function. An English STT model has one alphabet. A Japanese STT model needs to navigate roughly 2,136 commonly used kanji (the joyo kanji set), 46 hiragana characters, 46 katakana characters, and the rules for when to use which.

Getting the script wrong is a real error, not a cosmetic one. Writing a kanji word in hiragana is like writing "2" as "two" in an accounting spreadsheet — technically the same information, but it looks wrong and changes how the reader processes the text. Worse, some words are conventionally written in one script in some contexts and another in others. "Things" can be written as 物 (kanji) or もの (hiragana), and the choice carries subtle differences in formality and emphasis.

Our accuracy on script selection is strong in formal business Japanese. In casual conversation, it drops noticeably because speakers use more hiragana for words that would traditionally be written in kanji — a style preference that varies by generation, region, and individual habit.

The sentence-final verb problem

English is an SVO (Subject-Verb-Object) language. "I approved the budget." You know the action (approved) by the third word. A simultaneous interpreter or transcription system can start processing the meaning early.

Japanese is SOV. The equivalent sentence is "私は予算を承認しました" — literally "I (topic) budget (object) approved (past tense)." The verb comes at the end. The negation particle, which determines whether the sentence means "I approved" or "I did not approve," is the very last syllable.

For real-time transcription, this is a display problem more than an accuracy problem. The STT model can transcribe each word as it hears it. But for real-time translation — which is what our multilingual users need — you cannot start translating the sentence until you hear the verb. Otherwise you might translate "I... the budget..." and then have to revise when the verb turns out to be "rejected" instead of "approved."

We handle this with a buffering strategy. When the detected language is Japanese, the translation engine waits for end-of-utterance signals — a pause, a sentence-final particle, or a grammatical completion marker — before committing the translation. The raw Japanese transcription appears in real-time, character by character. The English translation appears after a delay that averages 2.8 seconds from the end of the Japanese utterance.

For comparison, our English-to-Spanish translation delay averages 1.4 seconds. Japanese-to-English takes twice as long because of the verb-final structure. We cannot cheat on this without sacrificing accuracy. A partial translation that says "I the budget approved" and then revises to "I rejected the budget" would be worse than waiting 1.4 extra seconds for the correct translation.

We tested a predictive approach in early 2026 — using a language model to guess the likely verb based on context before the speaker finishes. It produced the correct verb 72% of the time but the wrong verb 28% of the time. A 28% chance of translating "approved" when the speaker said "rejected" is obviously unacceptable in a business meeting. We shelved it.

Homophones: the 100-readings problem

Japanese has an extreme homophone problem. The sound "こうしょう" (koushou) can mean: negotiation (交渉), factory (工場), high school (高校), public testimony (公証), ore processing (鉱床), or at least eight other things depending on which kanji you write.

In spoken Japanese, context disambiguates. Humans figure out which "koushou" from the surrounding conversation. For an STT model, this means the acoustic signal alone is insufficient — the model needs a language model layer that evaluates which kanji reading is most probable given the sentence context.

Our speech engine handles this with a contextual language model that runs alongside the acoustic model. For common business vocabulary, the disambiguation accuracy is strong. For less frequent words or domain-specific vocabulary, it drops. In a test with a pharmaceutical company meeting, accuracy on technical terminology homophones was noticeably lower than on common business terms.

We added custom vocabulary boosting specifically for this reason. Users can upload a glossary of terms used in their organization — product names, technical terms, project codenames — and the model biases toward those readings when a homophone is detected. A biotech company using MangoFinch can add "抗体" (koutai, antibody) to their vocabulary list, and the model will prefer that reading over "交替" (koutai, alternation) when it hears the same sound in context.

The vocabulary boost improved our homophone accuracy by roughly 10 percentage points in domain-specific testing. It is not automatic — someone has to set up the glossary — but it is the difference between a usable and unusable transcript for technical Japanese meetings.

No spaces between words

English has spaces between words. Japanese does not. The sentence "今日の会議は3時からです" has no visual word boundaries. A native reader knows that this breaks down as 今日/の/会議/は/3時/から/です ("Today's meeting is from 3 o'clock"), but the segmentation is inferred, not explicit.

For transcription, the STT model has to perform word segmentation as part of the transcription process. Get the segmentation wrong and you change the meaning. "きょうはいしゃ" could be 今日は医者 ("today, a doctor") or 今日歯医者 ("today, dentist"), and the correct segmentation depends on context the acoustic model may not have.

This affects search. If a user searches for "会議" (meeting) in a transcript, the search engine needs to find it within unsegmented text. Standard text search assumes word boundaries. Japanese search requires morphological analysis — breaking the text into tokens first, then matching.

We use MeCab, an open-source morphological analyzer, as a preprocessing step for our search index. Every Japanese transcript gets tokenized before indexing. This adds about 200ms of processing per minute of transcript, which is fast enough that it runs as a background job after each meeting.

Keigo: when politeness changes the words

Japanese has a formalized system of honorific speech called keigo. It is not just adding "please" — it changes the verbs, the nouns, and the sentence structure entirely.

"To eat" in plain form is 食べる (taberu). In polite form: 食べます (tabemasu). In humble form (when you are eating and speaking to a superior): いただく (itadaku). In respectful form (when describing your superior eating): 召し上がる (meshiagaru). Same action, four different words that sound completely different acoustically.

A transcription model trained mostly on casual speech will struggle with formal keigo because the vocabulary is almost entirely different. Conversely, a model trained on formal speech will misrecognize casual contracted forms — like "食べちゃった" (tabechatta), the casual contraction of "食べてしまった" (tabete shimatta, "ended up eating").

Business meetings in Japanese companies almost always use keigo, but the level varies. A meeting between peers uses polite form. A presentation to senior management uses honorific form. An informal standup might drop to plain form. Within a single one-hour meeting, a speaker might shift between two or three keigo levels depending on who they are addressing.

Our accuracy numbers reflect this:

- **Formal keigo (presentations, client meetings):** best performance

- **Polite form (standard internal meetings):** slightly lower

- **Casual/plain form (standups, brainstorms):** noticeably lower accuracy

The casual form accuracy is lower because casual Japanese compresses and contracts heavily. "ではないでしょうか" (dewa nai deshou ka, "isn't it?") becomes "じゃないっしょ" (ja naisho) in casual speech. The contracted form is phonetically distant from the full form, and training data for casual business Japanese is scarce because most corporate recordings use polite form.

The keigo translation problem

Keigo does not translate into English. There is no English equivalent of the distinction between "I will do it" (plain), "I will do it" (polite), "I will humbly do it" (humble), and "you will graciously do it" (respectful). They all translate to "I will do it" or "you will do it."

This creates a real information loss in our Japanese-to-English translations. In a Japanese meeting, the shift from polite to humble form when addressing a client signals deference and relationship dynamics. In the English translation, this signal disappears.

We currently handle this with translator's notes — when keigo level shifts significantly within a conversation, we add a parenthetical note in the English translation: "(formal register)" or "(casual register)." This is clunky but better than nothing. Two of our Japanese beta testers specifically requested this feature because they were sharing translated transcripts with English-speaking colleagues who needed to understand the social dynamics of the conversation.

Code-switching in Japanese tech companies

This is where Japanese transcription meets our multilingual mission head-on.

Japanese tech companies have a specific speech pattern: English technical terms embedded in Japanese grammar. A developer might say "プルリクエストをレビューして、マージしておきます" — "I will review the pull request and merge it." The words "pull request," "review," and "merge" are pronounced with Japanese phonology (pururi-kuesuto, rebyuu, maaji) but are recognizably English-origin.

These loanwords are written in katakana, not in English letters. The transcription model needs to recognize English words spoken with Japanese pronunciation and render them in katakana, not in roman alphabet. Writing "pull requestをreviewして" (mixing scripts) is technically readable but non-standard. The correct output is "プルリクエストをレビューして."

Then there is actual code-switching, where a Japanese speaker drops into full English for a phrase. "The deadline is Friday" spoken in English within an otherwise Japanese conversation. The transcription model needs to recognize the language switch, transcribe the English in English, and resume Japanese transcription without attributing the English segment to a different speaker (see our diarization article for why this is hard).

In our testing with three Japanese tech companies, approximately 12% of words in Japanese meeting audio were English-origin terms rendered in katakana, and another 4% were full English phrases. The katakana rendering accuracy is 94%. The full English switch detection is 88% — meaning 12% of the time, an English phrase gets transcribed as garbled Japanese because the model does not detect the language switch quickly enough.

The numbers, honestly

Here is our current accuracy profile for Japanese, broken down by condition:

| Condition | Performance | Notes |

|-----------|-------------|-------|

| Formal Japanese, quiet room, single speaker | Best | Our best case |

| Polite Japanese, meeting room, 2-4 speakers | Good | Standard business meeting |

| Casual Japanese, any environment | Lower | Standups, brainstorms |

| Japanese with heavy English code-switching | Moderate | Tech company meetings |

| Japanese to English translation, formal | Strong | Formal register only |

| Japanese to English translation, casual | Weaker | Contractions cause issues |

Across all conditions, Japanese accuracy lags behind English and most European languages by a meaningful margin. Japanese is consistently our lowest performer, and the gap is directly attributable to the linguistic features described in this article.

What we are doing about it

Three things, all in progress.

**Custom Japanese acoustic model fine-tuning.** We are working with our speech engine provider on a fine-tuning pass specifically for business Japanese, using hundreds of hours of meeting audio from our beta users (anonymized and with consent). The fine-tuning targets casual speech contractions and keigo transitions, which are the two biggest sources of errors. Early results show a 3-point WER improvement. We expect this to ship in Q3 2026.

**Improved katakana rendering.** We are building a post-processing layer that specifically handles English-origin technical terms. When the model outputs ambiguous text that could be katakana or attempted English transcription, the layer checks against a database of 15,000 common English loanwords used in Japanese and selects the correct katakana rendering. This is in testing now.

**Contextual keigo detection.** A lightweight model that runs alongside the main transcription and detects keigo level shifts. This feeds into both the Japanese transcription (improving word selection) and the translation (enabling more accurate register notes). Prototype stage, targeting Q4 2026.

Japanese will probably always be our hardest language. The structural complexity is inherent — three writing systems and SOV structure are not problems we can engineer away. But the gap between our current Japanese accuracy and what we achieve in English is partly a training data problem and partly an engineering problem. We are working on both.

If you run Japanese-language meetings and want to test MangoFinch, we are especially interested in feedback from Japanese users. Every error report helps us tune the model for the specific patterns of Japanese business speech. Start a room at mangofinch.com — it takes about 30 seconds.

Try MangoFinch free

Real-time transcription and translation for multilingual teams. No credit card required.

Start a free meeting