Grafting a Speech Head onto Gemma 4 E4B
For a Discord buddy, the tempting model shape is small, fast, and multimodal. It should hear the call, see the game, read the chat, and respond quickly enough that the moment is still alive. That is why Gemma 4 E4B is interesting: it is small enough to run locally, takes text, image, and audio as inputs, can analyze video as frames, and still behaves like a regular language model at the end.
The last part matters. Gemma 4 E4B can listen, but it does not speak natively. Its audio tower is an input encoder. The model turns audio into embeddings the text decoder can read, then predicts text tokens. To understand what a speech branch would have to add, first look at where the released model stops.
The useful mental model: image and audio do not become separate output streams. They get converted into the same hidden-space language as text, inserted into the prompt where special placeholder tokens sit, and then a 42-layer text decoder predicts the next text token.
Pick an input path
Click a modality and follow what lights up. The chips are intentionally symbolic: they show where information flows without pretending we can inspect real hidden activations in a static blog post.
Why Gemma stops at text
Gemma 4 E4B's audio support is trained around speech recognition and speech-to-text translation. It can listen to an audio clip and write out text. It is not trained to produce Mimi-style audio tokens, and it does not include the codec decoder needed to turn those tokens into a waveform.
That is where models like Moshi point in a different direction. In the Mimi Codec post, the audio is represented as discrete token streams. Once speech lives as tokens, a model can generate speech directly in roughly the same way a language model generates text. Gemma 4 E4B is not built that way. It is closer to a multimodal text model that can hear than a native speech-to-speech model.
That leaves two paths. The practical one is to let Gemma write text, then send that text to a normal TTS engine. The architectural experiment is more direct: train a new audio head on Gemma's decoder hidden states, before those states are collapsed into text-token logits.
Making it talk
The smoke-test experiment is not ordinary text-to-speech bolted onto the end of a chat response. Text can go into Gemma, then a trainable audio head maps Gemma's hidden states into Kyutai Mimi codec tokens. A frozen Mimi decoder turns those tokens into a waveform. No text is passed to Mimi at inference; it only sees predicted codec tokens.
The same head can also be trained on Gemma states produced from audio input. In that version, the spoken prompt lives inside the WAV, Gemma receives no text instruction, and the audio head still reads the final Gemma decoder states before predicting Mimi tokens.
The tap point is specific: a learned mix of the last six Gemma decoder layers, after the transformer stack and before the tied text-output head. Gemma stays frozen, Mimi stays frozen, and the only trained module is the Gemma-to-Mimi token head.
The code required to replicate the experiment, along with a minimal reproduction dataset and generated samples, is available in the gemma4-audio GitHub repo.
Why this is different from TTS
Attaching a normal TTS engine to Gemma's text is useful and probably the easiest production path. This experiment does something narrower and more architectural: it asks whether Gemma's hidden states can be trained to predict speech-code tokens directly. The result is still rough, but the wiring is meaningfully different from a text-to-TTS pipeline.
The head reads Gemma's hidden states before the model turns them into vocabulary probabilities. The output branch is not downstream of decoded text.
The training target is Mimi audio tokens from teacher WAVs. The error is measured over predicted codebook IDs, not over text tokens.
For text samples, text enters Gemma only. For the audio sample, audio enters Gemma only. Mimi only receives predicted codec tokens and decodes them to waveform audio.
Seven Gemma-to-Mimi samples
These clips are from the step-500 smoke checkpoint. Each input sentence is fed to Gemma; the audio head reads Gemma hidden states and predicts Mimi tokens; frozen Mimi decodes the result. This is an overfit architecture proof, not a polished TTS model. Whisper found at least two target words in 5 of these 7 shown samples.
For each clip, Gemma receives the prompt template Say this naturally as speech:\n{text}. The {text} field is the input sentence shown below. Gemma is not generating an open-ended chat answer first; this smoke test is rendering a provided sentence through the Gemma-to-Mimi head.
“the final layer generates audio codes for the demo.”
Whisper heard: The final area is getting a read of all the available for the development.
“the teacher model connects the final sample for the demo.”
Whisper heard: the teacher model to connect the instrument to the video.
“the training run maps short phrases from text prompts.”
Whisper heard: the training even when that short phrase is fast pause.
“the small adapter conditions clear speech through frozen layers.”
Whisper heard: The small adapter conditions clear the picture through frozen windows.
“the speech head generates the voice output from text prompts.”
Whisper heard: The speech head generates the voices now.
“the voice sample conditions the final sample for the demo.”
Whisper heard: of always simple conditions on and off.
“the teacher model matches the waveform from text prompts.”
Whisper heard: The feature model matches the waveform with the X-POPs.
Audio in, audio out
We also ran the stricter path where Gemma receives no text instruction at all. The input WAV itself says the instruction and sentence. Gemma processes that audio input, the same adapter reads Gemma's audio-conditioned hidden states, and frozen Mimi decodes the predicted codec tokens.
The checkpoint for this clip is step 850 from the audio-only continuation. This is still a narrow smoke test, but it shows that the branch can be driven by Gemma's audio-input path rather than only by text tokens.
“Repeat the following sentence naturally as speech. The small adapter conditions clear speech through frozen layers.”
Text passed to Gemma: none. The Gemma message contains only the audio object.
Whisper heard: The small adapter conditions clear speeds through frozen mirrors.
What this experiment tests
The samples are rough, so this should be read as an architecture smoke test, not a finished voice model. The interesting research claim is the wiring: can Gemma's decoder state predict speech-code tokens directly, without handing decoded text to a separate TTS model?
The head does not read raw prompt embeddings. It reads the last Gemma decoder layers after Gemma has processed the prompt, before the normal text-output head scores vocabulary tokens.
Mimi is only the codec decoder here. The learned part is Gemma-owned: it turns Gemma hidden states into speech tokens.
A stronger version would need larger data, temporal modeling, and cleaner evaluation before it competes with real TTS.
What would make it real
The next version has to move beyond rendering provided phrases. A real speech-native assistant would need Gemma to generate an answer, expose the hidden states for those generated answer tokens, and train the audio head to speak that generated content reliably.
Run Gemma's normal autoregressive decoding loop first, then feed the hidden states for the generated answer into the audio head.
Use a much broader text/audio set and hold out prompts that are not paraphrases of the training examples. The current samples are an overfit smoke test.
The fixed parallel codec head is enough for a wiring proof. A real system needs better duration control, streaming behavior, and stronger audio-token decoding.
Why this matters for a game buddy
A Discord-native companion needs a fast brain that can fuse what people say, what is visible on screen, and what is happening in chat. Gemma 4 E4B is shaped for that kind of perception loop: image and audio can sit directly beside text in the prompt, and the decoder can answer from the combined context.
The production-simple version is still Gemma plus a low-latency voice layer: Gemma decides what to say and why it matters, while the voice layer decides how it sounds and how it handles interruption. The prototype above is the research branch. It asks whether Gemma's decoder state can become the source of the speech-token stream itself, which could eventually reduce service handoffs and keep more multimodal nuance available to the voice.
The model facts here come from Google's Gemma 4 E4B model card, the official E4B config, Google's audio understanding docs, our local MLX-VLM implementation inspection, and Kyutai's Moshi/Mimi paper for the speech-token setup. Video is shown as frame-based understanding, not a separate video-generation stream. The visual is a faithful schematic, not a tensor viewer: the chips and animation show flow and dimensionality, not real activation values. The Gemma-to-Mimi audio head described here is a smoke-test prototype, not a released Gemma capability or a production voice model.