Weekly Paper Notes — one of the top picks from the May 24–30, 2026 CS paper digest. Area: AI / ML.

Authors: Lukas Aichberger, Sepp Hochreiter (JKU Linz / NXAI) arXiv: 2605.30343 · PDF

TL;DR

Modern reasoning LLMs scale test-time compute by emitting long chains of thought — but every “thought token” is forced to round-trip through the autoregressive decoder, conflating internal computation with external communication. Reasoning in Memory (RiM) instead inserts blocks of fixed special tokens that act as scratch space for the model’s working memory. Because these tokens are not generated, the model processes a whole memory block in one forward pass, recovering latent reasoning capacity without paying the per-token decode tax. A two-stage curriculum — first grounding the blocks with explicit step supervision, then dropping that supervision and refining only the final answer — teaches the model to actually use the slots. Across families and sizes, RiM matches or beats prior latent-reasoning methods.

Why this matters

There is a growing consensus that the chain-of-thought pattern is doing two things at once: (1) giving the network extra serial compute steps it would not have otherwise, and (2) producing a human-readable explanation. Those two goals pull in opposite directions — explanations want to be short and clean, while extra compute wants to be wide and possibly noisy. Latent-reasoning approaches (Coconut, Quiet-STaR, looped transformers, etc.) try to give the model serial scratch space without paying the decoder’s bandwidth cost. RiM’s twist is to make the scratch space a fixed token sequence rather than a generated one, so the entire block fuses into a single parallel forward pass. That is a meaningful efficiency change: instead of O(N_thought) autoregressive steps, you get O(1) per block.

The curriculum is the other half of the contribution. Without grounding, fixed memory tokens are unlikely to acquire interpretable structure; with grounding-then-refine, the blocks first learn to predict explicit reasoning, then learn to drop that crutch while still helping the final answer.

Author signal

Sepp Hochreiter is the co-inventor of LSTM and a long-standing advocate that memory-augmented recurrent computation is the right abstraction for reasoning. RiM is squarely in that lineage — and lands at exactly the moment when the “reasoning model” wave has made latent-reasoning research commercially relevant.

Where it fits in the landscape

  • Coconut kept the autoregressive structure but reasoned in continuous embeddings.
  • Quiet-STaR generated “thoughts” between every token.
  • RiM drops the generation requirement entirely for the reasoning portion — fixed tokens, single forward pass per block, supervised first then weaned.

If the result holds up under independent evaluation, this becomes one of the cleaner formulations of “compute-efficient latent reasoning” — and a natural building block for the next generation of test-time-compute models.

Read more