WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving

Mingwang Xu^1* Jiahao Cui^1* Feipeng Cai^2* Hanlin Shang^1* Zhihao Zhu¹ Shan Luan¹

Yifang Xu¹ Neng Zhang² Yaoyi Li² Jia Cai² Siyu Zhu¹

¹Fudan University ²Yinwang Intelligent Technology Co., Ltd

## 📰 News - **`2025/02/01`**: 🎉🎉🎉 Release the pretrained models on [Huggingface](https://huggingface.co/fudan-generative-ai/WAM-Diff). - **`2025/12/06`**: 🎉🎉🎉 Paper submitted on [Arxiv](https://arxiv.org/pdf/2512.11872). ## 📅️ Roadmap | Status | Milestone | ETA | | :----: | :----------------------------------------------------------------------------------------------------: | :--------: | | ✅ | **[Release the inference source code](https://github.com/fudan-generative-vision/WAM-Diff)** | 2025.12.21 | | ✅ | **[Release the SFT and inf code](https://github.com/fudan-generative-vision/WAM-Diff)** | 2025.12.21 | | ✅ | **[Release pretrained models on Huggingface](https://huggingface.co/fudan-generative-ai/WAM-Diff)** | 2026.02.01 | | 🚀 | **[Release NAVSIM evaluation code](https://huggingface.co/fudan-generative-ai/WAM-Diff)** | TBD | | 🚀 | **[Release the RL code](https://github.com/fudan-generative-vision/WAM-Diff)** | TBD | ## 🔧️ Framework ![framework](assets/main_arch.png) ## 🏆 Qualitative Results on NAVSIM ### NAVSIM-v1 benchmark results

### NAVSIM-v2 benchmark results

## Quick Inference Demo The WAM-Diff will be available on Hugging Face Hub soon. To quickly test the model, follow these simple steps: 1. **Clone the repository** ```bash git clone https://github.com/fudan-generative-vision/WAM-Diff cd WAM-Diff ``` 2. **Initialize the environment** If you prefer conda, run the environment setup script to install necessary dependencies: ```bash bash init_env.sh ``` Or you can use uv to create the environment: ```bash uv venv && uv sync ``` 3. **Prepare the Model** Download the pretrained [WAM-Diff](https://huggingface.co/fudan-generative-ai/WAM-Diff) model from Hugging Face to the `./model/WAM-Diff` directory: ``` https://huggingface.co/fudan-generative-ai/WAM-Diff ``` Download the pretrained Siglip2 model from Hugging Face to the `./model/siglip2-so400m-patch14-384` directory: ``` https://huggingface.co/google/siglip2-so400m-patch14-384 ``` 3. **Run the demo script** Execute the demo script to test WAM-Diff on an example image: ```bash bash inf.sh ``` ## Training To fine-tune WAM-Diff, please follow these steps: 1. **Set Up the Environment** Follow the same environment setup steps as in the Quick Inference Demo section. 2. **Prepare the Data** Prepare your training dataset in JSON format like ```json [ { "image": ["path/to/image1.png"], "conversations": [ { "from": "human", "value": "Here is front views of a driving vehicle:\n\nThe navigation information is: straight\nThe current position is (0.00,0.00)\nCurrent velocity is: (13.48,-0.29) and current accelerate is: (0.19,0.05)\nPredict the optimal driving action for the next 4 seconds with 8 new waypoints." }, { "from": "gpt", "value": "6.60,-0.01,13.12,-0.03,19.58,-0.04,25.95,-0.03,32.27,-0.03,38.56,-0.05,44.88,-0.06,51.16,-0.09" } ] }, ... ] ``` 3. **Run the Training Script** Execute the training script with the following command: ```bash cd train bash ./scripts/llada_v_finetune.sh ``` ## 📝 Citation If you find our work useful for your research, please consider citing the paper: ``` @article{xu2025wam, title={WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving}, author={Xu, Mingwang and Cui, Jiahao and Cai, Feipeng and Shang, Hanlin and Zhu, Zhihao and Luan, Shan and Xu, Yifang and Zhang, Neng and Li, Yaoyi and Cai, Jia and others}, journal={arXiv preprint arXiv:2512.11872}, year={2025} } ``` ## 🤗 Acknowledgements We gratefully acknowledge the contributors to the [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), repositories, whose commitment to open source has provided us with their excellent codebases and pretrained models.