weekly-papers-2026-05-30

Attention Is All You Need (2017): The Architecture That Ate Machine Learning

Weekly Paper Notes — Seminal Paper of the Week for May 24–30, 2026. After a multi-week streak of systems classics (Raft, MapReduce, Lamport, ARIES), this week rotates to AI / ML. Authors: Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, Illia Polosukhin (Google Brain / Google Research / University of Toronto) Venue: NeurIPS 2017 arXiv: 1706.03762 · PDF Why this paper Picking Attention Is All You Need as a Seminal Paper of the Week in 2026 feels almost too on-the-nose — the Transformer is the architectural substrate underneath every frontier LLM, every modern diffusion model, every state-of-the-art protein folding system, every reasoning model whose chain-of-thought you have ever read....

On Language Generation in the Limit with Bounded Memory

Weekly Paper Notes — one of the top picks from the May 24–30, 2026 CS paper digest. Area: NLP / Theory. Authors: Jon Kleinberg, Anay Mehrotra, Amin Saberi (Cornell / Yale / Stanford) arXiv: 2605.30324 · PDF TL;DR A line of theoretical work asks: given examples from an unknown target language drawn from a known countable collection, can a learner eventually output only new valid strings from that language? Prior results — including Kleinberg & Mullainathan’s 2024 paper that triggered the modern wave — assume the learner remembers the entire example history....

Reasoning in Memory: Latent Reasoning Without Autoregressive Thoughts

Weekly Paper Notes — one of the top picks from the May 24–30, 2026 CS paper digest. Area: AI / ML. Authors: Lukas Aichberger, Sepp Hochreiter (JKU Linz / NXAI) arXiv: 2605.30343 · PDF TL;DR Modern reasoning LLMs scale test-time compute by emitting long chains of thought — but every “thought token” is forced to round-trip through the autoregressive decoder, conflating internal computation with external communication. Reasoning in Memory (RiM) instead inserts blocks of fixed special tokens that act as scratch space for the model’s working memory....