Inside the Mimi Codec
Voice-to-voice models are fantastic. Reading about Kyutai's Moshi and the Mimi codec got me excited, and I wanted more intuition for how the codec actually works, so I put this demo together. It turned out neat, so I figured I'd share it.
Credit where it's due. Mimi was designed, trained, and released by Kyutai as part of Moshi. The technical claims below come from their paper and open-source code. Frisson Labs is not affiliated with Kyutai — I just built a browser UI around their open weights.
Here's what Mimi is doing under the hood. It takes a 24 kHz audio waveform and turns it into 32 token streams, one frame every 80 ms. The streams aren't interchangeable. The first one is trained (distilled from WavLM) to carry phonetic content — roughly, what is being said. The rest carry acoustic detail: timbre, texture, the sharp edges of consonants. The demo below lets you encode your own audio, switch streams on and off, and hear what each one is doing.
Load the Model
The Mimi encoder and decoder run locally via transformers.js and ONNX Runtime. The weights download once, then everything runs in your browser.
Your Voice In
Record a few seconds of speech or upload a file. The audio is resampled to 24 kHz mono for the codec.
Encode
This ONNX checkpoint compresses the waveform into 32 codebooks × T frames. Each cell is a token index (0–2047) — one of 2048 learned entries per quantizer level. (Moshi itself only uses the first 8; this export happens to expose the full residual stack.)
Strip It Down
Toggle individual codebook levels on or off, then decode. In Mimi, level 0 is the stream most explicitly pushed toward semantic speech content; higher levels add progressively more acoustic detail. What's the minimum set you need before intelligibility and voice quality begin to fall apart?
Presets
Codebook Levels
Click a level to start a range, then click another to select all between them. Or click individually to toggle.
What you just heard
If you played with the levels, you probably noticed the split. The first two levels are enough to roughly make out what's being said, but the voice is flat — no speaker, no mood, nothing you'd pick out of a lineup. Turn the next few levels back on and the voice comes back. By the time most of the stack is active, it sounds close to the original.
What's interesting isn't the compression. It's that the streams come apart the way they do. Level 0 leans toward content, the rest lean acoustic. Nobody hand-coded that split — it falls out of how the codec is trained. And once you have discrete tokens that are roughly disentangled like this, you can predict them one at a time the way an LLM predicts text. That's what Moshi is doing, and it's the reason realtime voice is tractable at all.
Model (not ours): Kyutai's Mimi · Moshi paper · source code.
Interactive front-end (ours): built with Transformers.js running an ONNX export of the Mimi weights. Everything runs in your browser — no audio leaves your machine.