WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving
1Fudan University 2Yinwang Intelligent Technology Co., Ltd
## 📰 News
- **`2025/02/01`**: 🎉🎉🎉 Release the pretrained models on [Huggingface](https://huggingface.co/fudan-generative-ai/WAM-Diff).
- **`2025/12/06`**: 🎉🎉🎉 Paper submitted on [Arxiv](https://arxiv.org/pdf/2512.11872).
## 📅️ Roadmap
| Status | Milestone | ETA |
| :----: | :----------------------------------------------------------------------------------------------------: | :--------: |
| ✅ | **[Release the inference source code](https://github.com/fudan-generative-vision/WAM-Diff)** | 2025.12.21 |
| ✅ | **[Release the SFT and inf code](https://github.com/fudan-generative-vision/WAM-Diff)** | 2025.12.21 |
| ✅ | **[Release pretrained models on Huggingface](https://huggingface.co/fudan-generative-ai/WAM-Diff)** | 2026.02.01 |
| 🚀 | **[Release NAVSIM evaluation code](https://huggingface.co/fudan-generative-ai/WAM-Diff)** | TBD |
| 🚀 | **[Release the RL code](https://github.com/fudan-generative-vision/WAM-Diff)** | TBD |
## 🔧️ Framework

## 🏆 Qualitative Results on NAVSIM
### NAVSIM-v1 benchmark results
### NAVSIM-v2 benchmark results
## Quick Inference Demo
The WAM-Diff will be available on Hugging Face Hub soon. To quickly test the model, follow these simple steps:
1. **Clone the repository**
```bash
git clone https://github.com/fudan-generative-vision/WAM-Diff
cd WAM-Diff
```
2. **Initialize the environment**
If you prefer conda, run the environment setup script to install necessary dependencies:
```bash
bash init_env.sh
```
Or you can use uv to create the environment:
```bash
uv venv && uv sync
```
3. **Prepare the Model**
Download the pretrained [WAM-Diff](https://huggingface.co/fudan-generative-ai/WAM-Diff) model from Hugging Face to the `./model/WAM-Diff` directory:
```
https://huggingface.co/fudan-generative-ai/WAM-Diff
```
Download the pretrained Siglip2 model from Hugging Face to the `./model/siglip2-so400m-patch14-384` directory:
```
https://huggingface.co/google/siglip2-so400m-patch14-384
```
3. **Run the demo script**
Execute the demo script to test WAM-Diff on an example image:
```bash
bash inf.sh
```
## Training
To fine-tune WAM-Diff, please follow these steps:
1. **Set Up the Environment**
Follow the same environment setup steps as in the Quick Inference Demo section.
2. **Prepare the Data**
Prepare your training dataset in JSON format like
```json
[
{
"image": ["path/to/image1.png"],
"conversations": [
{
"from": "human",
"value": "Here is front views of a driving vehicle:\n\nThe navigation information is: straight\nThe current position is (0.00,0.00)\nCurrent velocity is: (13.48,-0.29) and current accelerate is: (0.19,0.05)\nPredict the optimal driving action for the next 4 seconds with 8 new waypoints."
},
{
"from": "gpt",
"value": "6.60,-0.01,13.12,-0.03,19.58,-0.04,25.95,-0.03,32.27,-0.03,38.56,-0.05,44.88,-0.06,51.16,-0.09"
}
]
},
...
]
```
3. **Run the Training Script**
Execute the training script with the following command:
```bash
cd train
bash ./scripts/llada_v_finetune.sh
```
## 📝 Citation
If you find our work useful for your research, please consider citing the paper:
```
@article{xu2025wam,
title={WAM-Diff: A Masked Diffusion VLA Framework with MoE and Online Reinforcement Learning for Autonomous Driving},
author={Xu, Mingwang and Cui, Jiahao and Cai, Feipeng and Shang, Hanlin and Zhu, Zhihao and Luan, Shan and Xu, Yifang and Zhang, Neng and Li, Yaoyi and Cai, Jia and others},
journal={arXiv preprint arXiv:2512.11872},
year={2025}
}
```
## 🤗 Acknowledgements
We gratefully acknowledge the contributors to the [LLaDA-V](https://github.com/ML-GSAI/LLaDA-V), repositories, whose commitment to open source has provided us with their excellent codebases and pretrained models.