Transformer vs Post-Transformer: A Heavyweight Debate

Weekly Video Notes — a short article distilling one talk from the weekly digest. Source video and key frames are embedded throughout.

Pathway staged something unusual: a panel debate, framed as a literal boxing match, on whether the transformer is the final architecture of the AI era — or whether we are already living through the dawn of a post-transformer one. In the blue corner, defending the belt: Łukasz Kaiser, co-author of Attention Is All You Need and one of the minds behind GPT-4 and o-series reasoning models at OpenAI. In the purple corner, three challengers: Adrian Kosowski (CSO of Pathway, co-inventor of the Dragon Hatchling / BDH architecture), Matthias Lechner (CTO of Liquid AI, co-inventor of Liquid Neural Networks), and — switching corners — Llion Jones, another Attention Is All You Need author, now CTO and co-founder of Sakana AI.

The moderators, Susanna Stanisławska (Pathway, co-author of the BDH paper) and Dexter Hoddy (Human Layer), ran the contenders through seven rounds: opening statements, three-minute rebuttals, three timed “quick-punch” rounds on the nature of intelligence, scalability, and real-world deployments, audience Q&A, and closing remarks.

The four contenders take the stage.

Round 0 — Why this debate at all?

The framing matters. “Post-transformer” here is not the claim that attention is bad. As Kosowski put it: it is the claim that we now have at least two examples of intelligence in the universe — humans and transformers — and the interesting scientific question is whether there is a deeper theme of intelligence that both are instances of, which we could then implement directly.

Kaiser, defending the champion, made a parallel admission upfront. He worked with Geoff Hinton at a time when Hinton hated backpropagation; Kaiser himself spent years thinking transformers were a temporary hack. “But it works. It works, and that’s the very key thing about transformers.” He had just spent a recent week re-implementing RNNs to compare; on modern Nvidia hardware, a small GRU ran about 50× slower in wall-clock than a much bigger transformer, even though it used fewer FLOPs. That asymmetry, he argued, is not a small detail — it is structural.

Round 1 — Opening statements

Kaiser opens the case for the transformer — “an insanely simple machine that predicts the next token.”

Kaiser (transformer). His framing: the transformer is best understood as an RNN with a very simple memory. Every new token writes a key into a growing store and the value it might later need; queries find the most similar key and return its value, softmaxed for differentiability. “It just remembers everything.” Forgetting, chain-of-thought reasoning, mixture-of-experts capacity — all of those are additions on top. The base object is simple, beautiful, and works. Most importantly, “many other systems don’t quite do that.”

Kosowski (post-transformer). Intelligence, he argued, is the ability to solve difficult problems — especially problems you have not seen before. Until recently, humans were the only known intelligent species. Today, millions of hard problems are being solved by two intelligent species: humans and transformers. The interesting question is not “is the transformer good?” — it clearly is — but “what is the theme behind intelligence?” The 1990s analogy: information indexing was a hard problem; the breakthrough was an equation (PageRank) plus an implementation (MapReduce), and it built the largest software company in the world. We have not yet had a PageRank moment for intelligence itself. Kosowski’s bet on what that moment looks like: latent reasoning in high-dimensional spaces, combining the advantages of state-space models with sequence processing — the family his team’s BDH architecture lives in.

Kosowski pulls the PageRank analogy — searching for the “leitmotif” of intelligence beyond the transformer.

Lechner (Liquid AI, “both/and”). Lechner refused the framing. Liquid AI designs models with hardware and use-cases in mind, drawing from any building block that works. Their language model with roughly GPT-3-level capability runs on a Raspberry Pi at about 40 tokens/second — neither a pure transformer nor a pure recurrent model. New attention variants from DeepSeek, new state-space models from elsewhere — Lechner takes them all. The world is dynamic on both axes (capability requirements and hardware), so the right strategy is to keep the architectural toolbox as wide as possible.

Jones (Sakana, post-transformer). Jones opened with a concession: if he were in Kaiser’s shoes at OpenAI, he might be on Kaiser’s side too. Frontier labs have an economic reason to double down on what already works. Startups — and academia — should be the ones looking for what’s next, just as OpenAI itself once did when it discovered transformers scaled better than anyone realised. His core intuition: transformers are an elegant brute force. A human does not need to read the entire internet. Our brains are a proof-of-concept that something more data-efficient exists. Reasoning, as currently bolted on via chain-of-thought, feels like a hack — “if the transformer were really as powerful as we think, shouldn’t it be able to learn to reason natively?”

Round 2 — Rebuttals

Jones rebutting — “we’re stuck in a very weird local minimum, and the transformer’s success may be what’s keeping us there.”

The rebuttals tightened the disagreement to two cruxes.

Crux 1: Is recurrence actually the limit, or is the hardware? Kaiser conceded the brain is more parallel and faster than current hardware. But, he countered, had transformers never happened, we might have built hardware that does loops well. The choice of dense matmul as the substrate is partly a historical contingency. Still, parallel hardware is inherently easier to build than fast sequential hardware — so the bias is real, not just cultural.

Kosowski sharpened the post-transformer position: “We are not the RNN club; we are the post-transformer club.” The claim is not that vanilla RNNs win; it’s that there exist architectures that exploit matmuls fully and propagate informative state — and that this combination unlocks latent reasoning without requiring the model to externalise every intermediate thought as language tokens.

Crux 2: Where does reasoning live? Kosowski distinguished cleanly between reasoning and learning. During reasoning you want long unrolls — chain-of-thought or recurrent latent state — that compute over many steps. During learning, propagating gradients back through all of those steps is a mess, and that is exactly the compromise the transformer was designed around. Post-transformers, he argued, can shift that compromise: keep trainability, but let the model think in latent thoughts instead of being forced to think in words.

Jones added a sharp point about misreading the original transformer paper. People see it as “they rearranged standard components and got lucky.” The actual breakthrough was hardware-shaped design: the transformer let us process tokens thousands of times faster and scale farther. That one-time optimisation, he argued, is no longer available — and the field is now mistaking the consequences of that hardware fit for the essence of intelligence.

Lechner pushed on the convergence: a transformer with a very compressed KV cache and an RNN with a very large state are, in the limit, almost the same object. The line between them is already blurry; the future is probably not a clean win for either extreme.

Round 3 — Quick punches

The nature of intelligence

Three definitions surfaced, and the panel agreed they were not equivalent.

Kaiser (engineer’s view): intelligence is what you observe the system do — the ability to act in the world and get desired outcomes. Don’t try to define it via internal substrate; you’ll never finish.
Kosowski (process view): intelligence is not a product but a process. The way Toyota cars are a product but the Toyota process is what makes them, intelligence is a computational process — algorithmic, dynamical, discoverable. PageRank is a process; the brain runs a few such processes; transformers embody one but don’t explain it.
Jones (compression view): intelligence is compression. The better you compress the internet, the more intelligent you are. This implicitly justifies pre-training perplexity as the most honest metric we currently have.

On language vs. reasoning: Kaiser pushed back on the idea that transformers are inherently linguistic. They are sequence models — proteins, images, audio all fit. Jones countered that there is a reasoning gap “not grounded in language” — citing Wittgenstein’s famous quip about the lion. Kaiser fired back with a concrete data point: an Erdős problem open for 60 years was solved by GPT-5.5 a week before the panel — solved in words. Maybe shorter latent chains are possible, but for now the verbose path works.

Scalability and the bitter lesson

Round 2 — the scaling-laws round. The “bitter lesson” returns.

Jones invoked Rich Sutton’s bitter lesson reluctantly: yes, scale wins; any post-transformer will have to demonstrate competitive — or better — scaling. Kaiser was firmer. Different architectures have different slopes on the scaling curve, and that is the real claim a post-transformer has to make: “Show me a curve that bends down steeper than the transformer’s, and I will concede.” He has not yet seen that curve.

Lechner added that in his work training models across two orders of magnitude of scale, every architecture shows clean scaling laws — the slope and constants differ, but the phenomenon is universal. Kosowski’s hopeful nuance: in some regimes (small-data sciences, enterprise data), the dimensions of scaling decouple — you can scale compute and architectural sophistication without scaling data, the way a chess prodigy does.

Jones flagged the failure mode to watch for: an architecture that is more data-efficient at small scale but does not actually beat the transformer when scaled up. That would be sad but is possible.

Real-world deployments and benchmarks

Lechner pulled the conversation out of language. For protein sequences, genetic data, biomedical signals, recurrent architectures already show strong scaling — partly because of inherent structure, partly because the deployment constraints (latency, edge hardware) reward small recurrent state.

On benchmarks, Kaiser told a beautiful historical anecdote. When the original transformer paper was written, the standard machine-translation metric was BLEU — fiddly, gamed, with brittle scripts. Noam Shazeer suggested dropping BLEU and tracking perplexity instead — the probability assigned to the next token. Perplexity kept being informative long after BLEU saturated. Today, Kaiser claims, the way frontier labs actually evaluate models internally is perplexity on a private code base they trust the model has never seen. Jones agreed: “I would like to see people going back to trying to push perplexity.” It is a benchmark grounded in the compression view of intelligence, and very hard to game if your held-out set is genuinely held out.

Audience Q&A — Hardware lock-in, dynamic weights, latent thinking

On hardware lock-in. Jones was emphatic: the transformer was a response to “wow, we have TPUs now, how do we use them fully?” That fit has trapped the field in a local minimum. The first post-transformer iteration will not beat the transformer at compute parity, and people need to be comfortable with that.

Kaiser added a remarkable historical detail in agreement: when transformers were first served on early TPUs, the chips didn’t even have an exponent in hardware — they were built for RNNs. Softmax activations had to be offloaded to the CPU. Transformers were slow as hell. They had to clear a high bar to get the hardware companies to change course. Whatever comes next will have to clear a similarly high bar — perhaps 10× better — and that constraint is paradoxically useful, because it forces researchers to think big rather than tweak.

His optimistic spin: agents can now write CUDA kernels. So if your new architecture is constant-factor slower on Nvidia, you can let a coding agent close most of that gap automatically. “Don’t be scared of 50× slower,” he said, joining Jones in encouraging risk-taking. “Find a curve that bends the right way.”

On dynamic weights and continual learning. An audience member pressed on continual learning — humans never freeze their weights. Kosowski’s analogy: transformers’ in-context learning is the right place to look, not gradient updates. In-context learning provably has mathematical connections to gradient descent on a small ephemeral network inside the forward pass. The ideal future, he suggested, is something like in-context learning extended to time→∞: a session that never forgets and never stops updating its internal state.

Jones took this as a point for the post-transformer side: standard neural networks were designed with static weights, and now in 2026 the field is scrambling to bolt dynamic weights on top. He would rather see an architecture designed from the ground up around dynamic weights. Kaiser countered, half-conceding: post pre-training, transformer activations on the forward pass already implement something very like gradient descent — Adrian’s point. As an engineer he would still prefer this to be explicit; but the brain has fast and slow neurons too, so maybe the implicit version is fine. What he genuinely wants is a benchmark — none currently exists — that isolates in-context learning quality from raw retrieval. Needle-in-a-haystack tests measure retrieval, not learning, and that gap is where post-transformer architectures could most easily demonstrate dominance with less compute.

On latent-space reasoning and safety. A question raised the existential-risk angle: if models reason in opaque latent space rather than in text, are we losing interpretability? Kaiser’s surprising answer: we already are. Each token in a chain of thought is a few bytes; above each token sit dozens of layers of activations encoding enormously more information. Today those activations happen to be faithful to the text, but there is no guarantee they remain so. “We should not be complacent.” Jones spun the same observation hopefully: a post-transformer that more honestly mirrors how the brain works might turn out to be more interpretable, not less.

Key takeaways

The disagreement is sharper than it looks. Everyone agrees transformers will be used for years to come. The real question is whether the next big jump comes from a different architecture or from continued scaling of attention.
Hardware co-evolution is the central constraint. Transformers won partly because TPUs/GPUs reward dense matmul. Any successor must either fit that substrate or be compelling enough to bend the hardware roadmap — and the bar is roughly 10× better, not 2×.
Reasoning is the cleanest current wedge. All three challengers pointed at it: chain-of-thought is a hack; native latent reasoning that is also trainable would be a structural win.
Perplexity, not benchmarks. The most reliable internal metric at frontier labs is perplexity on private held-out data. Public benchmarks are too easily gamed; better in-context-learning benchmarks are missing and would matter most.
In-context learning ≈ implicit gradient descent. Kosowski and Kaiser converged on the view that the transformer’s forward pass already implements something like learning. Whether to make that explicit (dynamic weights) is the design question of the next architecture.
The “bitter lesson” has a sibling. Better scaling slopes — not just more compute — are what a successor must demonstrate. Constant-factor speed gaps are now closable by coding agents writing custom CUDA.
Sympathy across the corners. Kaiser admits he might be right “only until the day he is wrong, forever”; Jones admits that if he were at OpenAI he might defend the transformer too. Nobody on stage thought attention is the literal end of the story — they disagreed about how soon the successor arrives and who will find it. Kaiser’s parting wager: perhaps the transformer will find its own replacement.

Source

📺 Transformer vs Post-Transformer — Łukasz Kaiser, Adrian Kosowski, Matthias Lechner, Llion Jones; moderated by Susanna Stanisławska and Dexter Hoddy. Hosted by Pathway (~1h 20m).

Round 0 — Why this debate at all?#

Round 1 — Opening statements#

Round 2 — Rebuttals#

Round 3 — Quick punches#

The nature of intelligence#

Scalability and the bitter lesson#

Real-world deployments and benchmarks#

Audience Q&A — Hardware lock-in, dynamic weights, latent thinking#

Key takeaways#

Source#