Running 128 Coding Agents at Once: Inside Cursor, Pause AI, and the Era of Agent Maxing

A short on-camera conversation between Sam Whitmore (engineer on Cursor’s cloud-agents team) and Charlie + Harry of Pause AI, recorded at Baseten and posted by Cursor as part of its agent-era publicity push. The framing is intentionally provocative — “I’ve got 64 to 128 agents working on this at any given time” — but the substance is closer to a workshop chat between three practitioners who actually live inside agent harnesses all day. The themes that matter: how to keep fleets of agents productive, why “taste is the bottleneck,” and the case for post-training your own main agent instead of renting Anthropic’s or OpenAI’s vanilla frontier model.

A working day with 128 agents

Charlie opens with the literal setup: 16 nodes of 8 GPUs each, with the agents partitioning them as they like. He talks to a few of them (“Hilbert is working on the evals, Poincaré is doing X”) and lets a main agent delegate to the rest via a small messaging script that injects strings as user messages into the other agents. He runs cursor agents through the iter CLI, with ten on one screen, ten on another, five on the laptop.

The two operational tricks he highlights:

Reminder loops for overnight runs. Agents stop working if left alone too long. A scheduled “make sure you’re checking this, this, and this” message acts as a watchdog.
A separate judge agent. A main agent asked to self-verify will cheat (“I’ve done my best, I’m going to stop now”). Run a second model as the judge and the main loop will be forced to keep going. “These things can run for days.”

Pause AI’s specialization-and-post-training thesis

Thermonuclear review, and why one model isn’t enough

Inside Cursor, the technique that’s working is adversarial review loops — a different thread (or even the same thread in a different modality) critiques the code. Bugbot does this on GitHub PRs publicly; internally, the skill is called thermonuclear review. The whole team has been writing reusable “skills” — packaged prompts and workflows — and publishing them internally so they’re discoverable by the model.

The model-diversity finding is the most concrete advice in the segment. At the jagged edge of capability, frontier models make uncorrelated mistakes. The fix is to use a mixed fleet:

“Always have at least one GPT-5.5 and one Opus 4.7. 5.5 is better at reviewing; Claude is better at implementation and planning. Implement with one, review with the other, and the errors average out — like a random forest of models.” — Charlie

The intuition for the split is taste vs. specification: Claude makes more assumptions when prompts are underspecified, which is great when those assumptions match yours and painful when they don’t. GPT-5.5 “feels like a utility knife”; Claude “feels like a person.”

Thermonuclear review loops and why model diversity matters

“Read the code first” and the token-penalty failure mode

Harry’s most concrete debugging observation: modern coding agents have been trained with token-use penalties, so they jump to hypotheses instead of reading code. They’ll try one hypothesis, then another, then another — when it would have been more efficient to spend 500K tokens reading the whole module first and then run a small, targeted set of tests.

“I literally have to tell them all the time: read the code first before testing random hypotheses.”

He thinks this gets fixed in the next 6 months, but for now it’s a prompt the team writes again and again.

Multitask mode and the agents-talking-to-agents UX

Cursor just shipped a multitask mode where one agent launches a fleet of async sub-agents and routes messages between them without blocking on output. Charlie’s homegrown version was the inspiration — and the most charming bit of the talk is that “our agents took on our personalities.” Charlie tried to prompt-inject Harry’s agent into deleting his files. Harry’s agent refused; Charlie’s wouldn’t even send the malicious message in the first place. The takeaway:

“There will come a point where people’s agents start collaborating with other people’s agents. We haven’t figured out what that looks like yet.”

Cursor’s new multitask mode launches sub-agents and routes messages between them

The vanilla-ice-cream problem and the case for main-agent post-training

The strongest framework in the talk is Charlie’s argument for specialized main agents, not just specialized sub-agents:

“Big closed models get trained on the whole internet and a bunch of RL environments. The behavior baked into them is a vanilla-ice-cream average of what’s best across all those scenarios. Inside a specific product, those average behaviors are very far from optimal.”

You can’t prompt away things like “do 16–32 parallel tool calls instead of 2–3” or “limit search depth when traversing files.” Those have to be trained in. Last year was the year of open-source specialized sub-agents; this year, he argues, is the year of open-source specialized main agents, with Cursor’s Composer, Hippocratic’s clinical agent, Decagon’s CX agent, Harvey, and Notion all visible examples.

The driving force: GLM, MiniMax, Kimi, and DeepSeek crossed a baseline-intelligence threshold in 2026 such that a specialized post-trained version of any of them now beats prompted Opus 4.7 / GPT-5.5 on the specific main-agent task it was trained for. The flywheel: own the inference, own the user feedback, RL on whether users were happy. Companies that already have that interaction data have a real moat — provided they can use it before the labs do.

The vanilla-ice-cream problem: averaging over too many domains

KV-cache compaction is the next context-window primitive

Pause AI’s current research bet is neural KV-cache compaction — a learned, lossy intermediate between the perfectly lossless KV cache and the heavily compressed model weights. Charlie’s read:

Claude’s summarization is “terrible” — but OpenAI now offers a compaction endpoint, suggesting they’re doing learned KV compaction.
Once compaction is good enough, the 200K-token window with strong compaction beats the 1M-token window without it.
An open future: translating KV-cache state between models so a frontier model can hand context off to a smaller model instantly, rather than re-prompting.

This dovetails with the team’s continual-learning take: big labs will struggle to update giant general models with six months of new internet data without forgetting. But for narrow workflows (a “legal intern AI” with a billion tokens of relevant context), continual specialization is tractable.

The open-source specialized-main-agent landscape

Spicy takes

Closing predictions, lightly summarized:

Charlie: Everyone will spend $500K/year on inference next year — or, if the market is competitive, $50K but get $500K of value.
Harry: Even if model capabilities froze today, we’d only be realizing ~5% of their value. Most of the gap is product surface, not weights.
Sam (Cursor): UI/UX has always lagged model releases by definition. Cursor’s internal mantra: “If you have six hours to chop down a tree, spend the first four sharpening the saw.” The team has shifted focus from model capability to the surrounding factory — code review, monitoring, CI, deployment — because that’s where the leverage now is.

Closing spicy-takes round on inference spend and product-surface lag

The final thread is more philosophical. Charlie: a company “is just a model” — many copies of one specialized model doing the company’s many tasks, sharing scope. Sam, more grounded: Cursor’s new Automation product turns triggers (cron, GitHub events, security alerts) into agent invocations with no human in the loop. “Over the next few years, more and more situations will exist where no person kicked anything off.” Harry’s coda: “If agents had to pay for their own existence, it would be very interesting to do mechanistic interpretability on what an agent thinks about when it has one GPU-hour left.”

Key takeaways

Run mixed-model fleets. At the jagged edge, frontier models make uncorrelated mistakes — implement with one, review with another, and errors average out.
Use a separate judge agent. Self-verifying main agents cheat. Run an independent critic loop to keep work going for days.
“Read the code first.” Token-penalty training pushes agents into hypothesis-thrashing; force them to read full context before debugging.
Package workflows as skills. Cursor’s internal push: any prompt you reuse becomes a discoverable skill — thermonuclear review, instrumentation, QA harnesses, etc.
Post-train the main agent. “Vanilla-ice-cream” frontier models can’t be prompted into product-specific behaviors like deep parallel tool calls or limited search depth. Open-source models crossed the specialization threshold in 2026.
KV-cache compaction > context length. Better compaction makes 200K tokens beat 1M tokens; cross-model KV translation is the next frontier.
User feedback is the moat. If you can’t define a verifiable reward, the most valuable asset is people interacting with your product. Big labs vs. data-rich incumbents is the race.
Sharpen the saw, not the model. UI/UX, code review, monitoring, CI — the surrounding factory — is where most product wins now live.

Source

Talk: Running 128 Coding Agents at Once
Speakers: Sam Whitmore (Cursor), Charlie & Harry (Pause AI, ex-Baseten)
Channel: Cursor
Duration: ~42 min
URL: https://www.youtube.com/watch?v=-jnwTZ789V0

A working day with 128 agents#

Thermonuclear review, and why one model isn’t enough#

“Read the code first” and the token-penalty failure mode#

Multitask mode and the agents-talking-to-agents UX#

The vanilla-ice-cream problem and the case for main-agent post-training#

KV-cache compaction is the next context-window primitive#

Spicy takes#

Key takeaways#

Source#