You Only Index Once: Cross-Layer Sparse Attention with Shared Routing
Weekly Paper Notes — one of the top picks from the 2026-06-06 CS paper digest. Area: NLP / Systems-for-ML. Authors: Yutao Sun, Yanqi Zhang, Li Dong, et al. (Microsoft Research Asia) arXiv: 2606.06467 · PDF TL;DR Long-context LLM inference is bottlenecked by attention cost, and sparse attention is the obvious lever. The two existing families both disappoint in practice: block-sparse patterns (sliding window, dilated, etc.) give clean speedups but lose quality, while token-sparse patterns (top-k over the KV cache) preserve quality but spend most of the budget deciding which tokens to attend to — the routing itself becomes the bottleneck....