Pretraining Recurrent Networks without Recurrence

Weekly Paper Notes — one of the top picks from the 2026-06-06 CS paper digest. Area: AI / ML.

Authors: Akarsh Kumar, Phillip Isola (MIT) arXiv: 2606.06479 · PDF

TL;DR

This paper proposes Supervised Memory Training (SMT), a way to pretrain nonlinear RNNs without ever doing backpropagation through time (BPTT). The trick: replace recurrent credit assignment with a supervised problem over memory transitions. A Transformer-based “memory encoder” is first trained with a predictive-state objective — it learns a representation m_t that retains exactly the information about the past needed to predict the future. Those m_t labels are then used as supervision targets for the RNN, so the RNN only ever has to solve a one-step problem: given (m_t, x_{t+1}), predict m_{t+1}. The result is RNN pretraining that is time-parallel, has an O(1) gradient path between any two tokens, and — empirically — outperforms BPTT on language modeling and pixel-sequence modeling. If it scales, it removes one of the long-standing reasons we stopped pretraining recurrent architectures.

What problem is the paper actually attacking?

Pretraining nonlinear RNNs is hard for a structural reason: BPTT requires unrolling the sequence and chaining Jacobians across time, so (a) the forward/backward pass is sequential in T, killing parallelism on modern accelerators, and (b) the gradient signal decays or explodes geometrically, so long-range associations are nearly impossible to learn. The field’s response has been to drop nonlinear recurrence — that’s most of what state-space models, linear RNNs, and gated linear attention variants buy you: a linear recurrence with a closed-form prefix scan, which restores parallel training and a stable gradient path. The cost is expressive: you lose the freedom to put arbitrary nonlinear dynamics inside the recurrence.

The authors of SMT ask the obvious dual question: can we keep nonlinear recurrence and still get parallel training plus a stable gradient path? Their observation is that BPTT is overspecified for the actual learning problem. We don’t need gradients to flow across T time steps — we just need the RNN’s memory at every step to be a good summary of the past. If we can exogenously tell the RNN what its memory should be at every step, training reduces to a one-step regression and parallelizes trivially.

The mechanism: Supervised Memory Training

SMT has two stages.

Stage 1: train a memory encoder with a predictive-state objective. A Transformer (with full attention over the prefix) is trained so that its representation of the prefix at position t, call it m_t, is a sufficient statistic for predicting tokens x_{t+1}, x_{t+2}, …. Concretely, the encoder must compress the past into a fixed-size vector that can decode the future. This is a clean idea borrowed from predictive-state representations in classical RL: the future defines what the past summary needs to contain.

Stage 2: fit a nonlinear RNN to the memory transitions. The encoder is now frozen and used as an oracle. For every position t it produces the “right” memory m_t. The RNN is then trained as a one-step supervised model: given input (m_t, x_{t+1}), predict m_{t+1}. That’s it. There is no recurrence in the loss — each (m_t, x_{t+1}) → m_{t+1} example is independent, so all positions in all sequences can be trained in parallel. The RNN’s actual dynamics are still arbitrarily nonlinear; what’s been removed is the training-time sequentiality.

The “O(1) gradient path” claim follows from the structure: every supervised pair has exactly one nonlinear step between input and target, so the gradient never gets multiplied through T Jacobians. The Transformer encoder does the heavy lifting of credit assignment, but it does so via attention — which has its own O(1) path between any two tokens.

Why training/inference stays fast

The decoupling is the whole story. Training is now an embarrassingly parallel supervised problem; the per-example compute is small (one RNN step + one encoder forward to get the target), and the wall-clock scales as O(T/P) on P devices rather than O(T). The encoder is needed only during pretraining — at inference time you discard it and run the RNN recurrently in the usual O(1)-per-step fashion, with no attention cache.

There is a budget question the paper has to answer: the encoder is itself a Transformer, so you’ve added a Transformer’s training cost to the pipeline. The authors argue (and show experimentally) that the encoder need not be huge, since its job is constrained — produce a summary that decodes the next-few tokens, not generate text — and that once trained it can label memory for arbitrarily many downstream RNN training sequences. So the encoder cost amortizes across the RNN training run.

Results

The empirical claim is that SMT-trained nonlinear RNNs outperform BPTT-trained nonlinear RNNs of the same architecture on language modeling and pixel-sequence modeling. Critically, the comparison is RNN-to-RNN (not RNN-to-Transformer), so the result is about the training method, not the architecture. The gap widens with sequence length, which is the expected signature of an O(1) gradient path beating a O(T) one — BPTT’s long-range signal is what suffers first. They also report that SMT lets nonlinear RNNs capture long-range dependencies that BPTT-trained versions of the same network simply cannot.

The paper does not (yet) claim parity with frontier Transformers at scale. It is a method paper, and the headline is the training-procedure win, not a new SOTA.

Why this matters

If SMT generalizes, it weakens the main reason nonlinear recurrence fell out of fashion. The pitch for state-space models was always “you get nearly-RNN expressivity and parallel training” — SMT offers full RNN expressivity and parallel training, at the cost of an auxiliary encoder during pretraining. That’s a different point on the trade-off curve, and it reopens a design space that has been ignored for several years. The broader idea — use a stronger model to label the right intermediate targets for a weaker but cheaper model — is also a generalizable recipe; you can read SMT as a structured form of distillation where the teacher provides supervision at every hidden state rather than only at the output.

Read alongside

Mamba and S4 (Gu et al., 2021–2023) — the linear-recurrence competitors SMT is implicitly arguing against.
Predictive State Representations (Littman, Sutton, Singh, 2001) — the source of the “future defines the past summary” objective.
Real-Time Recurrent Learning (Williams & Zipser, 1989) — the other classical alternative to BPTT.
Distillation literature (Hinton et al., 2015) — the conceptual frame for “use a big model to supervise a small one.”

Links

📄 arXiv abstract · 📄 PDF

Part of the Weekly CS Paper Digest series. Summary written from a close read of the preprint abstract; the architectural and lineage commentary is the author’s synthesis.

TL;DR#

What problem is the paper actually attacking?#

The mechanism: Supervised Memory Training#

Why training/inference stays fast#

Results#

Why this matters#

Read alongside#

Links#