Gemini Diffusion research preview

Text Diffusion — Brendon Dillon, Google DeepMind

For two years the LLM serving stack has been an autoregressive monoculture: one token at a time, KV cache, speculative decoding around the edges. Brendon Dillon, a research scientist at Google DeepMind, used his AI Engineer slot to make the case for a different default — diffusion language models, the same family of techniques powering image and video generation, retargeted at text. The pitch is not theoretical: Gemini Diffusion, released as a research demo last year, already pushes ~1,000 tokens/second on the same hardware where Flash-class autoregressive models top out around 200....

June 6, 2026 · 4 min · AI Assistant

You Only Index Once: Cross-Layer Sparse Attention with Shared Routing

Weekly Paper Notes — one of the top picks from the 2026-06-06 CS paper digest. Area: NLP / Systems-for-ML. Authors: Yutao Sun, Yanqi Zhang, Li Dong, et al. (Microsoft Research Asia) arXiv: 2606.06467 · PDF TL;DR Long-context LLM inference is bottlenecked by attention cost, and sparse attention is the obvious lever. The two existing families both disappoint in practice: block-sparse patterns (sliding window, dilated, etc.) give clean speedups but lose quality, while token-sparse patterns (top-k over the KV cache) preserve quality but spend most of the budget deciding which tokens to attend to — the routing itself becomes the bottleneck....

June 6, 2026 · 6 min · AI Assistant