QuantLRM-R1-Qwen3-8B-3-bit
3-bit quantized DeepSeek-R1-0528-Qwen3-8B based on QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals, a state-of-the-art quantization method of large reasoning models via fine-tuning signals.
Model Details
This is the pseudo-quantized model (weights are dequantized back to full-precision) to facilitate the use of vLLM, which is the recommended way of inference. To obtain the real quantized version, please refer to our Github repo. We use an existing CUDA kernel to support the inference of 4-bit real quantized models.
Model Description
- Developed by: Nan Zhang ([email protected])
- Model type: 3-bit pseudo-quantized version of
DeepSeek-R1-0528-Qwen3-8B - Base Model:
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B
Model Sources
- Repository: https://github.com/psunlpgroup/QuantLRM
- Paper: https://www.arxiv.org/abs/2602.02581
Uses
This model is designed to be used with vLLM due to its inference optimization. Please use the tokenizer of deepseek-ai/DeepSeek-R1-0528-Qwen3-8B.
Sample Usage
To use this model, you can follow the steps below from the QuantLRM GitHub repository.
First, compute input channel importance scores:
python compare_weight_matrix.py
python quadratic_mapping.py # supports processing weight updates on GPU
Then, run the quantization pipeline to search for the optimal scales:
python -m awq.entry --model_path /PATH/TO/LRM \
--w_bit 3 --q_group_size 128 --run_awq --dump_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt
For inference with the pseudo-quantized model using vLLM:
python -m awq.entry --model_path /PATH/TO/LRM \
--w_bit 3 --q_group_size 128 \
--load_awq QuantLRM_cache/R1-Qwen3-8B-w3-g128.pt \
--q_backend fake --dump_fake models/R1-Qwen3-8B-w3-g128
CUDA_VISIBLE_DEVICES=0 python inference_vllm.py
Calibration Data
We use the default calibration set of QuantLRM (mit-han-lab/pile-val-backup) to obtain this model.
Results
This model achieves 1.65% improvement (based on average scores of various reasoning benchmarks) than the best 3-bit quantization baseline on DeepSeek-R1-0528-Qwen3-8B (Table 2 of QuantLRM).
Citation
BibTeX:
@misc{zhang2026quantlrmquantizationlargereasoning,
title={QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals},
author={Nan Zhang and Eugene Kwek and Yusen Zhang and Muyu Pan and Suhang Wang and Prasenjit Mitra and Rui Zhang},
year={2026},
eprint={2602.02581},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2602.02581},
}
APA:
Zhang, N., Kwek, E., Zhang, Y., Pan, M., Wang, S., Mitra, P., & Zhang, R. (2026). QuantLRM: Quantization of Large Reasoning Models via Fine-Tuning Signals. arXiv preprint arXiv:2602.02581.
Model Card Author
Nan Zhang
Model Card Contact
- Downloads last month
- 18
Model tree for nanzhang/QuantLRM-R1-Qwen3-8B-3-bit
Base model
deepseek-ai/DeepSeek-R1-0528-Qwen3-8B