NuExtract-2.0-8B-FP8-Dynamic

Quantization

Quantized with llm-compressor v0.9.0.1.

We used the original qwen2.5-vl example compression script and adapted it to a FP8-Dynamic compression recipe.

vLLM inference

docker run --rm --name 'NuExtract-2.0-8B' -e HF_TOKEN -v '/srv/cache:/root/.cache' -p 8000:8000 -e LD_LIBRARY_PATH='/lib/x86_64-linux-gnu:/usr/local/cuda/lib64' 'vllm/vllm-openai:v0.15.1-cu130' 'ig1/NuExtract-2.0-8B-FP8-Dynamic' --served-model-name 'NuExtract-2.0-8B' --trust-remote-code --limit-mm-per-prompt '{"image": 6, "video": 0}' --chat-template-content-format 'openai' --max-model-len 'auto' --kv-cache-memory-bytes '7G'
  • -e LD_LIBRARY_PATH=/lib64:/usr/local/cuda/lib64 is only needed if your host have a recent driver version (with native CUDA 13.0 or 13.1). See #32373 for more info.
  • Adapt /srv/cache to your liking, this will contains all cache data you want to keep for faster startup:
    • dirs like huggingface, torch, vllm, flashinfer, etc...
  • To avoid eating up all the GPU VRAM, the --kv-cache-memory-bytes '7G' is set (it allows max model len by 1.02x)
    • Feel free to adjust (or remove the flag and switch back to --gpu-memory-utilization 0.9) to increase or decrease KV cache to your liking

Check original project readme for openai like chat template use in requests.

Downloads last month
31
Safetensors
Model size
8B params
Tensor type
BF16
·
F8_E4M3
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ig1/NuExtract-2.0-8B-FP8-Dynamic

Quantized
(6)
this model