Back to blog
AI in Business

How real-time translation actually works under the hood

A technical walkthrough of the full pipeline from speech to translated text: WebSockets, streaming speech-to-text, machine translation, and the latency budget that makes it work.

MangoFinch Team9 min read

When someone speaks Japanese in a meeting and another participant reads the English translation 1.4 seconds later, a lot happened in that gap. Seven distinct systems touched that audio, and each one had a millisecond budget it could not exceed.

I am going to walk through exactly what happens — every hop, every tradeoff, every place where we had to choose speed over perfection. This is the architecture that powers MangoFinch's real-time meeting translation, and I will share the actual latency numbers we measure in production.

The full pipeline

Here is the path a single spoken phrase takes from mouth to translated text on screen:

1. Audio capture — browser MediaStream API grabs raw PCM audio from the microphone

2. WebRTC transport — audio frames move to our media server via WebRTC

3. Speech engine WebSocket — streaming audio hits our speech-to-text engine

4. Language detection — the engine identifies the spoken language per utterance

5. Per-segment transcription — partial and final transcripts come back as JSON events

6. Translation engine — final transcript segments get translated to each participant's target language

7. Real-time data channels — translated text is pushed to every connected client

8. Browser rendering — the translation overlay updates without blocking the UI thread

Eight steps. In production, the whole thing typically completes in 1.2 to 1.8 seconds from the moment someone finishes a phrase to the moment every participant sees the translation.

The latency budget

We track latency at every boundary. Here is a representative breakdown from a recent production session:

Audio capture and encoding takes a few dozen milliseconds, depending on device. WebRTC transport to server adds a few dozen more, varying with geography. The speech engine takes a few hundred milliseconds — partial results come faster, finals are slower. The translation API adds under 200ms per segment. Data channel push to the client and DOM rendering add negligible time since the connections are already open.

That total looks lower than the 1.2-1.8 second claim. The difference is utterance endpoint detection — the speech engine needs to determine that someone has finished a phrase before it emits a final transcript. That detection adds several hundred milliseconds depending on speech patterns and language.

Those numbers are real. We log them. Every translated segment carries timestamps from capture through delivery, and we aggregate them nightly.

Audio capture and the WebRTC hop

The pipeline starts in the browser. We use the standard MediaStream API to capture microphone input. The audio is 16-bit PCM, mono, at 16kHz — matching the format our speech engine expects avoids a resampling step that would add unnecessary latency.

The audio does not go directly to the speech engine from the browser. It routes through our real-time media server first. This might seem like an unnecessary hop, but it solves two problems: it gives us server-side control over the audio stream, and it means the speech engine WebSocket connection lives on our server, not in the user's browser. That matters for reconnection handling.

The streaming speech engine

This is the core of the transcription pipeline.

We maintain a persistent WebSocket connection to our speech-to-text engine. Audio frames stream in continuously. The engine processes the audio incrementally and sends back two types of events:

Partial results arrive every 100-300ms. These are the engine's best guess at what is being said right now. We display these as preview text so participants can see that transcription is happening.

Final results arrive when endpoint detection determines a phrase is complete. These are the high-confidence transcripts that we send to translation.

We configure the engine with interim results enabled for live preview, a 1-second silence threshold to force-finalize utterances, voice activity detection events, and language set to "multi" for automatic per-utterance language detection.

That last setting is important. Each final result includes a detected language code, and we use that to determine whether translation is needed and what the source language is.

Why we chose our translation approach

We optimized our translation layer for speed over feature richness. Many translation APIs offer advanced features like glossary support, batch translation, and custom models. We use a simpler, faster tier. Here is why.

Speed. Our translation layer consistently responds in under 200ms for our segment lengths (5-30 words). More advanced tiers average roughly twice that. That difference is significant when your total budget is under 2 seconds.

Cost. Translation APIs charge per character. Advanced features push costs higher. We translate a lot of text — a typical one-hour meeting with three active speakers generates 8,000-12,000 words.

Quality for short text. For short, conversational segments — which is exactly what meeting transcription produces — the basic and advanced translation tiers produce nearly identical output. We ran a 200-segment blind comparison across five languages. A bilingual reviewer scored each translation 1-5. The simpler tier scored within 0.2 points of the advanced tier. That difference did not justify the latency penalty.

The WebSocket architecture

Our server maintains three distinct WebSocket-based connections per active room.

Connection 1 is the media transport layer. This carries audio from each participant's browser to our server. Our real-time video infrastructure handles WebRTC complexity.

Connection 2 is the streaming speech-to-text connection. One WebSocket per active speaker, connecting our server to the speech engine. Audio frames from connection 1 get forwarded here.

Connection 3 is the data channel back to clients. Lightweight channels that push translated text back to every participant's browser. These ride on the existing media connection, so there is no additional overhead.

All three connections are persistent and bidirectional. We never open a connection per utterance or per translation request. This eliminates connection setup latency that would add 100-200ms per request with REST APIs.

We open one speech engine WebSocket per speaker rather than multiplexing. This lets language detection work independently per speaker and means if one speaker's connection has issues, it does not affect others.

Rendering without blocking the UI

The translation overlay is a separate DOM layer positioned over the video feeds. When new text arrives via data channel, we update only the text node content — no layout recalculation, no reflow.

We batch DOM updates using requestAnimationFrame. If multiple translations arrive within the same frame, they are applied in a single DOM write.

The entire render path adds 5-15ms. We measure this with the Performance API, and it rarely exceeds the 16ms budget for a 60fps frame.

Network jitter and reconnection

Our real-time video layer handles WebRTC jitter buffers and ICE restart automatically. If a few audio packets drop, the listener hears a brief glitch but the stream continues.

For the speech engine, if the WebSocket drops, we reconnect with exponential backoff (100ms, 200ms, 400ms, capped at 5 seconds). During the gap, audio frames buffer on our server — up to 10 seconds in a ring buffer. Once reconnected, we flush the buffer so no speech is lost.

For translation, requests are fire-and-forget with a 2-second timeout. If a translation fails, we display the original-language transcript with a pending indicator and retry once. The original text stays visible either way.

The system never shows a blank screen or an error modal. In the worst case, participants still see and hear each other through WebRTC. Transcription and translation resume automatically when connectivity stabilizes.

Scaling

Each active room requires 2 + N WebSocket connections, where N is the number of active speakers. A 5-person room with 3 active speakers needs 5 connections.

A single server instance handles 50-80 concurrent rooms before event loop contention. Our bottleneck is the speech engine connection pool — at high connection counts, we see increased tail latency.

Our scaling plan is horizontal: add server instances with room affinity behind a load balancer. Our real-time video infrastructure supports this natively. We have not needed it yet, but the architecture does not have shared state that would make horizontal scaling difficult.

What I would do differently

First, I would run an on-premise speech engine for high-volume accounts. The round-trip to the cloud adds a few dozen milliseconds that could be eliminated locally.

Second, I would implement speculative translation — translating partial results and replacing them when the final arrives. Participants would see a rough translation 300-500ms earlier. The tradeoff is visual instability.

Third, I would build a translation cache from day one. People repeat phrases constantly in meetings. Caching common segments would eliminate the Translate API call for roughly 15-20% of text. We are building this now.

The 1.4-second reality

A Japanese speaker says a sentence, and 1.4 seconds later (median, measured across 10,000+ segments in production last month), every other participant reads an accurate translation.

That 1.4 seconds is split roughly between endpoint detection (the largest chunk), transport overhead, speech engine processing, translation, and delivery. Some segments are faster (short English phrases can land under a second). Some are slower (long compound Korean sentences can take over 2 seconds).

Is 1.4 seconds good enough? For meetings, yes. Conversational speech has natural pauses and turn-taking. The translation arrives well before the next speaker starts responding.

We measure, we log, we optimize the bottlenecks that show up in the data. Our job is to make sure everything around the speech engine is as thin as possible, so when the engine gets faster, users feel the difference immediately.

Try MangoFinch free

Real-time transcription and translation for multilingual teams. No credit card required.

Start a free meeting