AI & ML Efficiency Breakthrough

An autonomous agent loop that optimizes GPU kernels to outperform human-expert and compiler-generated baselines.

March 24, 2026

Original Paper

AutoKernel: Autonomous GPU Kernel Optimization via Iterative Agent-Driven Search

Jaber Jaber, Osama Jaber

arXiv · 2603.21331

The Takeaway

AutoKernel automates the hardest part of ML systems engineering—writing high-performance Triton/CUDA code. It consistently beats 'torch.compile' (max-autotune) by up to 3x, effectively democratizing high-performance kernel development for arbitrary model architectures.

From the abstract

Writing high-performance GPU kernels is among the most labor-intensive tasks in machine learning systems engineering. We present AutoKernel, an open-source framework that applies an autonomous agent loop to GPU kernel optimization for arbitrary PyTorch models. Given a model, AutoKernel profiles it to identify computational bottlenecks, ranks them by Amdahl's law impact, and iteratively refines Triton or CUDA C++ kernel implementations through hundreds of experiments without human intervention. A