Clipping Makes Distributed and Federated Asynchronous SGD Robust to Stragglers

Weekly Paper Notes — one of the top picks from the 2026-06-13 CS paper digest. Area: Distributed Computing.

Authors: Samuel Erickson, Mikael Johansson (KTH) arXiv: 2606.13287 · PDF

TL;DR

In asynchronous SGD (ASGD), workers compute gradients on possibly stale parameters and push updates without waiting for slow peers. That’s how you keep all the GPUs busy, but it’s also how slow workers (“stragglers”) inject large delays into the update stream, which classical analyses say should slow convergence in proportion to the maximum delay across the workers. Practitioners have long observed that gradient clipping seems to stabilise asynchronous training, but the theoretical reason was unclear. This paper gives one: clipping removes the maximum-delay dependence from the oracle complexity of ASGD. The analysis works under a sub-Weibull gradient-noise model, which generalises sub-Gaussian and sub-exponential noise to genuinely heavy-tailed regimes (the regime deep-learning gradients are actually observed to live in). The headline contribution is convergence in expectation and, for the first time in asynchronous optimisation under heavy-tailed noise, convergence with high probability.

What problem is the paper attacking?

Asynchronous SGD is the dominant strategy whenever the bottleneck is hardware utilisation rather than per-update compute: federated learning over heterogeneous mobile devices, data-parallel training on clusters with stragglers, parameter-server pipelines, and increasingly LLM RLHF training pools. The classical convergence story is:

with synchronous SGD, you wait for all workers, so per-step latency is the max — but each step is unbiased and the analysis is clean,
with ASGD, you don’t wait, so utilisation goes up, but each gradient is computed on parameters that are τ steps out of date for some delay τ, and the analysis pays a price proportional to τ_max (the worst-case delay across all workers).

In practice τ_max can be catastrophic — one stuck worker drags the whole bound down. Practitioners patched around this by clipping gradients (scaling any gradient whose norm exceeds a threshold), and empirically this stabilises training. Until now, theory didn’t really capture the mechanism: classical analyses of clipped SGD are about variance reduction under heavy tails, not about decoupling from worker delay.

The result

The paper proves that with clipping, the oracle complexity of ASGD no longer depends on τ_max. Intuitively: the harm a delayed worker can do is bounded by the norm of its (stale) gradient, and clipping caps that norm by construction. The analysis formalises this and shows the bound now depends on average delay rather than maximum delay — a qualitatively different scaling.

The noise model matters. The authors use a sub-Weibull assumption on gradient noise, which parametrically interpolates between sub-Gaussian (the textbook case, tails decay like exp(-x²)) and sub-exponential and beyond into genuinely heavy-tailed regimes (tails decay like exp(-x^θ) for θ < 1). This matters because empirical studies of deep-learning gradients have repeatedly shown they are heavy-tailed — sub-Gaussian analyses systematically over-promise. Sub-Weibull is the right generalisation for matching what’s actually measured.

Two convergence statements follow:

Convergence in expectation — the standard guarantee, but now without the τ_max factor.
Convergence with high probability — the paper claims this is the first such result for asynchronous optimisation under heavy-tailed noise. High-probability guarantees are what you actually want in practice: “in expectation” can be dominated by a few catastrophic runs.

Why this matters

Two reasons.

First, it is theoretical validation of an operational practice. Federated-learning systems and ASGD-based training infrastructure already use clipping by default — this paper says clipping isn’t just a numerical-stability hack, it’s structurally what lets ASGD scale gracefully in the presence of stragglers. That should affect how step sizes, clipping thresholds, and timeout policies are tuned, especially in federated deployments where worker delay distributions are extreme.

Second, the sub-Weibull + high-probability combination is the right toolkit for analysing modern training. The gap between “what the theory assumes” (sub-Gaussian, in expectation) and “what we actually observe” (heavy-tailed, single runs matter) has been one of the persistent embarrassments of optimisation theory for ML. Each paper that closes a piece of that gap is structurally useful.

Read alongside

“Why Gradient Clipping Accelerates Training” (Zhang et al., ICLR 2020) — the modern motivation for clipping under heavy tails (non-asynchronous setting).
“High-Probability Bounds for Stochastic Optimization” (Davis et al., Gorbunov et al., 2020–2023) — the high-probability machinery for sub-Gaussian/sub-exponential noise.
Federated averaging and FedProx — the deployment context where worker delays are extreme.
Recht et al.’s Hogwild! (NeurIPS 2011) — the classical analysis of lock-free async SGD that this work generalises beyond.

Links

📄 arXiv abstract · 📄 PDF

Part of the Weekly CS Paper Digest series. Summary written from the preprint abstract; the analytical commentary and lineage notes are the author’s synthesis.

TL;DR#

What problem is the paper attacking?#

The result#

Why this matters#

Read alongside#

Links#