ARGUS: Production-Scale Tracing and Performance Diagnosis for 10,000+ GPU Clusters
Weekly Paper Notes — one of the top picks from the 2026-06-20 CS paper digest. Area: Distributed Computing. Authors: Jiasheng Zhou, Longbin Zeng, Clavis Chen, Ruiming Lu et al. arXiv: 2606.20374 · PDF TL;DR ARGUS is a tracing and performance-diagnosis system designed for always-on operation on production LLM training clusters with more than 10,000 GPUs. The central insight is that no single profiler can be cheap, deep, and continuous all at once — so ARGUS decomposes observation along the training call hierarchy into three independent collection channels: CPU call stacks, framework semantics, and GPU kernel execution....