R1Sim applies the 'Reasoning-RL' paradigm (popularized by DeepSeek-R1) to traffic simulation, achieving superior safety and diversity in multi-agent behaviors.
March 27, 2026
Original Paper
Learning Rollout from Sampling:An R1-Style Tokenized Traffic Simulation Model
arXiv · 2603.24989
The Takeaway
Instead of simple imitation learning, it uses motion token entropy and Group Relative Policy Optimization (GRPO) to explore high-uncertainty behaviors. This approach yields more realistic and safer traffic simulations for evaluating autonomous vehicles compared to standard supervised fine-tuning.
From the abstract
Learning diverse and high-fidelity traffic simulations from human driving demonstrations is crucial for autonomous driving evaluation. The recent next-token prediction (NTP) paradigm, widely adopted in large language models (LLMs), has been applied to traffic simulation and achieves iterative improvements via supervised fine-tuning (SFT). However, such methods limit active exploration of potentially valuable motion tokens, particularly in suboptimal regions. Entropy patterns provide a promising