Configuration Parsing Warning: Invalid JSON for config file config.json

Quantization

Model quantized using NVIDIA Model Optimizer. Weights quantized to NVFP4, KV Cache quantized to FP8.

Serving with VLLM

Use VLLM version 0.12.0 or later. Set following environment variables prior to serving the model.

export VLLM_USE_FLASHINFER_MOE_FP4=1
export VLLM_FLASHINFER_MOE_BACKEND="throughput"

Safetensors

Model size

16B params

Tensor type

F32

BF16

F8_E4M3

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Base model

Finetuned

(18)

this model