Weekly Paper Notes — one of the top picks from the 2026-06-13 CS paper digest. Area: Operating Systems / Systems.

Authors: Zhuoping Yang, Yiyu Shi, Alex Jones arXiv: 2606.06697 · PDF

TL;DR

The GPU has quietly become a multi-tenant device — applications no longer just dispatch compute kernels, they call into vendor libraries (cuFFT, cuBLAS, NCCL), interact with GPU-resident services, and touch storage and network adapters through GPUDirect paths. But the CUDA programming model still hands each process the full keys to the device: its own context, raw device pointers, runtime handles, module loader, and direct kernel launch. AgileOS argues this is the GPU equivalent of running every process in ring 0, and proposes an OS-style layer that interposes at the CUDA library boundary. A trusted runtime worker owns the real CUDA context; applications link against client-side shims that forward operations to the worker, which mediates them. A GPU memory manager separates user allocations from protected module/MMIO ranges, and a PTX-injected memory guard enforces the boundary inside untrusted kernels.

What problem is the paper attacking?

The world the original CUDA API was designed for — one application, one GPU, isolated compute — no longer exists. Today a single GPU is shared by inference servers, ML training, vendor libraries with internal state, and increasingly services that live on the device (resident KV caches, persistent kernels, storage-direct readers). The OS-side of the host has containers, cgroups, namespaces, MAC, and pointer validation in the kernel. The device-side has none of that. A misbehaving or malicious kernel that gets a raw device pointer can scribble anywhere in the GPU’s virtual address space the process is authorised for — including library-internal state and MMIO-mapped doorbells.

The paper frames this as an OS protection gap: “service metadata, device queues, memory-mapped I/O regions, and library-internal state should not be directly exposed to untrusted application kernels.” The CUDA model exposes all of those by default. Existing protected services have to roll their own ad hoc isolation, which is both duplicative and unaudited.

How AgileOS interposes

AgileOS virtualises CUDA at the library boundary rather than the driver. Applications link against client-side shims for the CUDA Runtime, CUDA Driver, and a curated set of libraries (cuFFT and PyTorch are called out). The shims do not own the device context — they forward each supported operation (allocation, memcpy, kernel launch, module load) to a trusted runtime worker that owns the real context. The worker mediates: it validates parameters, maps virtual handles to real ones via a virtualised CUDA object table, and decides which operations are allowed.

Two structural pieces make the isolation meaningful:

  1. A GPU memory manager that splits the device address space into user-allocation regions and protected regions (loaded modules, library MMIO ranges, AgileOS internal state). User allocations cannot land in the protected ranges.

  2. A PTX-level memory guard injected into untrusted kernels. Because a kernel could still compute a pointer at runtime that lands in a protected range, AgileOS rewrites the kernel’s PTX (the NVIDIA pre-SASS IR) to insert pointer-validation checks on loads and stores. This is roughly the GPU analogue of a SFI (software fault isolation) pass, applied at PTX-injection time rather than at source.

The prototype includes the client-side interceptors, worker-side CUDA handlers, virtualised CUDA object tables, the memory manager, trusted library adapters, and the PTX memory guard. It is explicitly initial-design / prototype-scope — the paper is staking out the design space, not benchmarking a finished system.

Why this is interesting now

GPU “OSes” have been talked about for a decade (GPUfs, PTask, GPUnet, Gullfoss, more recently NVIDIA’s own work on CUDA Confidential Computing). What’s different about AgileOS is the placement: it sits at the library boundary, not inside the driver and not inside the kernel. That makes it deployable without vendor cooperation — it’s a userspace runtime plus a PTX rewrite pass — which is the same property that made gVisor and Firecracker successful in the host-OS world. Whether the performance overhead of the worker hop and the PTX guard is tolerable for serving workloads is the obvious next question; the paper does not yet have those numbers.

The broader point is that as GPUs absorb more host-OS responsibilities (storage I/O, networking, scheduling of co-resident services), the device needs a real protection model. AgileOS reads as an early prototype in what is going to be an expanding research area, and worth tracking precisely because its design choices — library-boundary virtualisation, virtualised handles, PTX-injected guards — are the same patterns the host-OS world settled on for similar reasons.

Read alongside

  • gVisor and Firecracker — host-OS analogues of library-boundary interposition for protection.
  • NVIDIA Confidential Computing on H100/B200 — the hardware-rooted alternative path.
  • PTask (Rossbach et al., SOSP 2011) — early OS abstractions for GPUs.
  • “Serverless GPUs” and persistent-kernel literature — workloads that motivate device-side protection.

📄 arXiv abstract · 📄 PDF


Part of the Weekly CS Paper Digest series. Summary written from a close read of the preprint abstract; the architectural commentary and lineage notes are the author’s synthesis.