diffusion-models

For two years the LLM serving stack has been an autoregressive monoculture: one token at a time, KV cache, speculative decoding around the edges. Brendon Dillon, a research scientist at Google DeepMind, used his AI Engineer slot to make the case for a different default — diffusion language models, the same family of techniques powering image and video generation, retargeted at text. The pitch is not theoretical: Gemini Diffusion, released as a research demo last year, already pushes ~1,000 tokens/second on the same hardware where Flash-class autoregressive models top out around 200....