Why we chose streaming over batch transcription for live meetings
Our evaluation of streaming vs batch speech-to-text for real-time multilingual meeting transcription — latency, accuracy, cost, and architecture tradeoffs.
When we started building MangoFinch, we had one question that would shape everything: streaming or batch speech-to-text?
The two paradigms look similar from the outside. Both take audio and produce text. But the architectural difference between them determines everything about latency, accuracy on live audio, and how well they handle multilingual meetings.
We chose streaming. Here is every reason why.
The architecture problem
Batch transcription processes audio in chunks. You feed it a complete audio file or a segment, it thinks, and it hands back a transcript. This works beautifully for podcasts, recorded interviews, and any scenario where the audio already exists.
Live meetings are not that scenario.
In a live meeting, audio arrives continuously. A speaker says something at 10:03:15, and the person reading the transcript in another language needs to see those words within a second or two. Not five seconds later.
A batch architecture means you have to buffer audio into chunks (typically 5-30 seconds), send each chunk for processing, wait for results, then stitch the chunks back together. Every chunk boundary creates a potential error. Words that straddle two chunks get split. Context from the previous chunk is not available to improve the next one.
Our streaming engine uses a persistent WebSocket connection. Audio bytes flow in continuously, transcript text flows back continuously. There is no chunking, no stitching, no boundary errors.
This is not a minor implementation detail. It is a different paradigm.
Latency: the number that matters most
We ran our own latency benchmarks across dozens of meeting recordings in 6 languages.
Batch-based engines (self-hosted on GPU): median latency per chunk was several seconds. Worst case on long Japanese segments exceeded 10 seconds.
Batch-based engines (cloud-hosted API): median around 2 seconds. Plus network round-trip.
Our streaming engine (persistent WebSocket): median latency under 300ms. Worst case across all recordings stayed under 600ms.
That is roughly a 10x difference on median latency. The streaming transcript appears while the speaker is still finishing their sentence. The batch transcript appears after the speaker has moved on.
For MangoFinch, where we chain transcription into real-time translation, every millisecond compounds. A multi-second transcription delay plus translation delay means the translated text falls behind the conversation. With streaming, the total pipeline stays under a second.
Accuracy: closer than you would expect
Batch engines have legitimately excellent accuracy, particularly on clean audio. On single-speaker recordings with studio conditions, they produce transcripts that are hard to improve on.
We tested word error rate across our recordings. On clean audio with a single English speaker, batch engines scored slightly better than streaming. But on multi-speaker audio with crosstalk, streaming pulled ahead. On heavily accented English from L2 speakers, streaming was noticeably better. And on mixed-language segments with code-switching, the gap widened significantly in favor of streaming.
Our streaming engine handles messy, real-world meeting audio better. It handles accented speech better. And it handles code-switching significantly better. Since MangoFinch exists for multilingual teams where code-switching is constant, this gap matters enormously.
The self-hosting trap
When we first evaluated batch engines, we were attracted to the self-hosting angle. Open-source models, run on your own infrastructure, no per-minute API costs. Sounds economical.
Then we did the math.
Running large speech models with acceptable latency requires GPUs. Not small ones — you need significant VRAM. On AWS, GPU instances run roughly $0.60-1.00/hour reserved.
That adds up to hundreds per month for a single instance that can process roughly 8 concurrent audio streams before latency degrades. During business hours with 50-100 simultaneous meetings, you need many instances. The monthly GPU bill alone reaches thousands, plus engineering time for load balancing, failure handling, model updates, and CUDA driver compatibility.
We spent two weeks on a self-hosted prototype. The GPU memory management alone consumed three full days of debugging.
With a managed streaming API, costs scale linearly and predictably with usage. We are not at the volume where self-hosting becomes cheaper, and when we get there, the predictability still matters.
Multi-language support: built in versus bolted on
Many speech models support dozens of languages. On paper. In practice, you set the language parameter at the start of transcription, and the model expects the entire audio to be in that language. If you enable auto-detection, it guesses based on the first 30 seconds and commits.
This fails badly for multilingual meetings. A meeting that starts in English but shifts to Japanese at minute 12 will have the Japanese portion transcribed as garbled English.
Our streaming engine's multi-language mode handles multiple languages within a single stream. We tested a recording with hundreds of language transitions. The streaming engine correctly identified and transcribed the vast majority of them. Batch approaches using chunk-and-detect handled roughly two-thirds.
What batch engines do better
I want to be honest about where batch wins.
Offline processing of clean audio. If you have a recorded podcast with studio-quality audio and one speaker, a batch model will produce a slightly more accurate transcript.
Complete control over the model. Self-hosting means you can fine-tune on your domain vocabulary, run it air-gapped, and guarantee audio never leaves your infrastructure.
Cost at extreme scale. If you are transcribing millions of hours per month, self-hosted models on reserved GPUs will eventually be cheaper per minute.
None of those advantages apply to our use case. MangoFinch needs real-time streaming in multilingual environments, from day one, without a dedicated ML infrastructure team.
The decision
We committed to a streaming architecture after a 6-week evaluation. Since then, we have processed over 14,000 meeting-minutes. The median transcription latency in production is under 300ms. Uptime during meeting hours has been excellent.
The total integration — from initial setup to production-ready streaming transcription — took days, not weeks. The equivalent self-hosted batch setup consumed two weeks and still had reliability issues under load.
If your use case is live multilingual speech-to-text, streaming is the way. If your use case is transcribing pre-recorded audio in a single language with maximum accuracy, batch engines are the right tool. Right architecture for the right job.
Try MangoFinch free
Real-time transcription and translation for multilingual teams. No credit card required.
Start a free meeting