CUDA Agent

CUDA Agent: Revolutionizing High-Performance CUDA Kernel Generation with Large-Scale Agentic Reinforcement Learning

CUDA Agent represents a significant leap forward in the field of GPU optimization, specifically targeting the complex and specialized task of CUDA kernel generation. Developed by a collaborative team from ByteDance Seed, AIR Tsinghua, and SIA-Lab, this system leverages the power of large-scale agentic reinforcement learning (RL) to achieve unprecedented performance and efficiency in creating high-performance CUDA kernels. The core challenge addressed by CUDA Agent is the inherent difficulty in optimizing GPU code, which traditionally requires deep hardware expertise and extensive manual effort. Existing approaches often fall short due to limitations in training-free refinement or rigid execution-feedback loops, hindering the development of truly intrinsic optimization capabilities.

CUDA Agent tackles these challenges head-on with a sophisticated, three-pronged approach:

Scalable Data Synthesis: The system employs a robust pipeline to generate a vast and high-quality dataset of training tasks. This pipeline begins with crawling seed problems from popular libraries like torch and transformers, which are then expanded through LLM-based combinatorial synthesis. This process composes up to five PyTorch operators sequentially into fused tasks. Crucially, a rigorous filtering stage ensures that only tasks runnable in both eager and compiled modes are retained, while stochastic operators are removed. Further checks, including anti-hacking mechanisms to prevent constant or indistinguishable outputs, and workload controls to maintain eager runtimes within a practical range (1ms-100ms), are implemented. This meticulous process results in the CUDA-Agent-Ops-6K dataset, comprising 6,000 training samples designed for scalable RL training with broad task diversity and minimized contamination risk.
Skill-Augmented CUDA Development Environment: CUDA Agent operates within a specialized environment that facilitates iterative development, verification, and profiling of CUDA kernels. This environment supports a ReAct-style workflow, enabling the agent to utilize coding tools and adhere to a defined CUDA skill specification (SKILL.md). The workflow involves a standard loop of profiling native PyTorch code, implementing CUDA kernels and bindings, compiling them within a secure GPU sandbox, and iterating based on feedback. The target requirement for the agent is not only to pass correctness checks but also to exceed a 5% speedup over torch.compile. A robust reward schedule is employed, utilizing milestone-based discrete rewards for correctness and speed gains. To prevent reward hacking, strict controls are in place, including protected verify/profile scripts, prohibitions on fallback calls, 5-input correctness checks, synchronized warm-up profiling, and restrictions on web retrieval. These measures ensure that the RL policy learning focuses on genuine kernel quality rather than exploiting loopholes.
Stable Long-Horizon RL Training: Training RL agents for complex, long-horizon tasks like CUDA kernel generation is notoriously challenging. CUDA Agent addresses this by employing a multi-stage training strategy designed for stability. The process begins with single-turn PPO warm-up to improve base CUDA generation capabilities before transitioning to full multi-turn agentic RL. Actor initialization is performed using Rejection Fine-Tuning (RFT) on sampled trajectories with positive outcomes, with filtering to remove inefficient loops and invalid tool-call patterns, thereby reducing the risk of policy collapse. Critic initialization utilizes value pretraining, ensuring that advantage estimates are reliable from the initial stages of training. This multi-stage design allows for stable training even in long-context settings, supporting up to 128k context, 150 training turns, and up to 200 turns during evaluation, thereby enabling sustained reward growth and continuous improvement.

Key Features and Technical Aspects:

Agentic Reinforcement Learning: At its core, CUDA Agent is an RL system where an agent learns to generate and optimize CUDA kernels. This approach allows for adaptive and emergent optimization strategies that go beyond predefined rules.
Large-Scale Data Synthesis: The creation of the CUDA-Agent-Ops-6K dataset is a critical component, providing the breadth and quality of examples necessary for effective RL training. The synthesis pipeline ensures diversity and relevance.
Skill-Augmented Environment: The development environment is tailored for CUDA optimization, incorporating tools for coding, compilation, debugging, and profiling, all within a controlled and verifiable framework.
ReAct-Style Workflow: The agent's interaction with the environment is guided by a ReAct (Reasoning and Acting) paradigm, enabling it to reason about tasks and take appropriate actions, such as generating code or invoking tools.
Robust Verification and Profiling: The system emphasizes reliable feedback mechanisms. Correctness checks and synchronized warm-up profiling ensure that performance gains are genuine and reproducible.
Long-Horizon Training Stability: Techniques like staged training, RFT, and value pretraining are employed to overcome the challenges of training RL agents over extended sequences of actions.
State-of-the-Art Performance: CUDA Agent has demonstrated exceptional results on the KernelBench benchmark, significantly outperforming existing methods, including torch.compile and even strong proprietary models.

Performance Metrics and Use Cases:

CUDA Agent's effectiveness is quantified by impressive metrics on the KernelBench benchmark:

Overall Performance: Achieves a 98.8% pass rate, with a 96.8% faster rate compared to torch.compile and a 2.11x geomean speedup.
Level-3 Performance: Excels on the most challenging Level-3 split, reaching a 90% faster rate vs. compile, significantly outperforming proprietary baselines.
Comparison with Baselines: Outperforms leading proprietary models like Claude Opus 4.5 and Gemini 3 Pro in compile-relative performance, demonstrating a clear optimization gap.

Target Users:

The primary target users for CUDA Agent include:

Deep Learning Researchers: Those working on optimizing deep learning models and frameworks for GPU acceleration.
Performance Engineers: Professionals focused on maximizing the efficiency of GPU computations.
CUDA Developers: Engineers and programmers involved in writing and optimizing CUDA code.
Machine Learning Engineers: Practitioners seeking to improve the inference and training speed of their models.

Unique Selling Points:

Automated High-Performance Kernel Generation: Automates a traditionally manual and complex process, saving significant development time and expertise.
Agentic RL Approach: Utilizes cutting-edge AI techniques to discover novel and effective optimization strategies.
Comprehensive Dataset: Provides a valuable, high-quality dataset (CUDA-Agent-Ops-6K) for further research in RL-based code generation.
Demonstrated Superior Performance: Achieves state-of-the-art results on industry-standard benchmarks, proving its effectiveness.
Scalability: Designed for large-scale application, capable of handling complex optimization tasks.

In summary, CUDA Agent is a groundbreaking system that combines advanced AI techniques with a deep understanding of GPU architecture to automate and significantly enhance the process of CUDA kernel generation. Its innovative approach, robust training methodology, and demonstrated superior performance position it as a pivotal tool for anyone involved in high-performance computing and deep learning optimization.

Introduction

CUDA Agent: Revolutionizing High-Performance CUDA Kernel Generation with Large-Scale Agentic Reinforcement Learning

Also got a product to promote?

Nano Banana 2

Information

Categories

Tags

Alternative tools

Cursor Cloud Agents

PicoClaw

MaxClaw

CUDA Agent

Introduction

CUDA Agent: Revolutionizing High-Performance CUDA Kernel Generation with Large-Scale Agentic Reinforcement Learning

Also got a product to promote?

Nano Banana 2

Information

Categories

Tags

Alternative tools

Cursor Cloud Agents

PicoClaw

MaxClaw

Newsletter

Join the Community