| | --- |
| | license: apache-2.0 |
| | language: |
| | - zh |
| | - en |
| | tags: |
| | - dream-coder |
| | - diffusion |
| | - dlm |
| | --- |
| | |
| | # Dream-Coder GGUF Q8_0 Quantization Guide |
| | |
| | This guide is specifically designed for GGUF Q8_0 quantization of the Dream-Coder v0-Instruct-7B model. |
| |
|
| | ## Quick Start |
| |
|
| | ### 1. Environment Setup |
| |
|
| | ```bash |
| | # 1. Clone and compile llama.cpp |
| | git clone https://github.com/ggerganov/llama.cpp |
| | cd llama.cpp |
| | make -j$(nproc) |
| | |
| | # 2. Install Python dependencies |
| | pip install transformers>=4.46.2 torch safetensors numpy |
| | ``` |
| |
|
| | ### 2. Execute Quantization |
| |
|
| | #### Method 1: Use the provided script |
| |
|
| | ```bash |
| | # Set llama.cpp path |
| | export LLAMA_CPP_PATH=/path/to/llama.cpp |
| | |
| | # Run quantization script |
| | ./quantize_example.sh |
| | ``` |
| |
|
| | #### Method 2: Manual execution |
| |
|
| | ```bash |
| | python quantize_dream_q8_0.py \ |
| | --model_path /path/to/Dream-Coder-v0-Instruct-7B \ |
| | --llama_cpp_path /path/to/llama.cpp \ |
| | --output_dir ./gguf_output \ |
| | --keep_f16 |
| | ``` |
| |
|
| | ### 3. Parameter Description |
| |
|
| | - `--model_path`: Dream-Coder model path (default: current directory) |
| | - `--llama_cpp_path`: llama.cpp project path (required) |
| | - `--output_dir`: Output directory (default: ./gguf_output) |
| | - `--keep_f16`: Keep F16 intermediate files |
| |
|
| | ## Architecture Adaptation |
| |
|
| | ### Dream-Coder Special Configuration Handling |
| |
|
| | This quantization script specifically handles the following special configurations of Dream-Coder: |
| |
|
| | 1. **Architecture Mapping**: DreamModel → LlamaForCausalLM (compatibility) |
| | 2. **Special Token IDs**: |
| | - `mask_token_id`: 151666 (critical diffusion token) |
| | - `bos_token_id`: 151665 |
| | - `eos_token_id`: 151643 |
| | - `pad_token_id`: 151643 |
| |
|
| | 3. **Model Parameters**: |
| | - Vocabulary size: 152,064 |
| | - Hidden dimension: 3,584 |
| | - Attention heads: 28 (4 key-value heads) |
| | - Layers: 28 |
| | - Context length: 32,768 |
| |
|
| | 4. **Diffusion Features**: |
| | - Preserve `mask_token_id` metadata |
| | - RoPE theta: 1,000,000.0 |
| | - Activation function: SiLU |
| |
|
| | ## Output Description |
| |
|
| | ### File Structure |
| | ``` |
| | gguf_output/ |
| | ├── dream-coder-7b-f16.gguf # F16 intermediate file (optionally kept) |
| | └── dream-coder-7b-q8_0.gguf # Final Q8_0 quantized file |
| | ``` |
| |
|
| | ### Performance Expectations |
| |
|
| | | Metric | Original (BF16) | Q8_0 | |
| | |--------|-----------------|------| |
| | | Memory Usage | ~14GB | ~6.7GB | |
| | | Inference Speed | 1.0x | 1.2-1.5x | |
| | | Precision Loss | 0% | <0.1% | |
| | |
| | ## Usage |
| | |
| | ### llama.cpp Command Line |
| | |
| | Since Dream-Coder is a diffusion-based model, you need to use the dedicated `llama-diffusion-cli` tool: |
| | |
| | ```bash |
| | # Basic usage |
| | ./llama.cpp/build/bin/llama-diffusion-cli \ |
| | -m gguf_output/dream-coder-7b-q8_0.gguf \ |
| | -p "def quicksort(arr):" \ |
| | -n 512 \ |
| | -c 2048 \ |
| | --diffusion-steps 128 |
| | |
| | # Advanced parameters |
| | ./llama.cpp/build/bin/llama-diffusion-cli \ |
| | -m gguf_output/dream-coder-7b-q8_0.gguf \ |
| | -p "Write a binary search function" \ |
| | -n 256 \ |
| | -c 2048 \ |
| | --temp 0.1 \ |
| | --top-p 0.95 \ |
| | --repeat-penalty 1.1 \ |
| | --diffusion-steps 128 \ |
| | --diffusion-algorithm 4 \ |
| | --diffusion-alg-temp 0.0 \ |
| | -t 8 |
| | |
| | # Visualize generation process |
| | ./llama.cpp/build/bin/llama-diffusion-cli \ |
| | -m gguf_output/dream-coder-7b-q8_0.gguf \ |
| | -p "def fibonacci(n):" \ |
| | -n 256 \ |
| | --diffusion-steps 64 \ |
| | --diffusion-visual |
| | ``` |
| | |
| | #### Diffusion Parameter Description |
| | |
| | - `--diffusion-steps N`: Diffusion denoising steps (default: 128) |
| | - `--diffusion-algorithm N`: Algorithm selection: |
| | - 0 = ORIGIN (original algorithm) |
| | - 1 = ENTROPY_BASED (entropy-based) |
| | - 2 = MARGIN_BASED (margin-based) |
| | - 3 = RANDOM (random) |
| | - 4 = LOW_CONFIDENCE (low confidence, default) |
| | - `--diffusion-alg-temp F`: Algorithm temperature (default: 0.0) |
| | - `--diffusion-visual`: Enable visualization mode, show generation progress |
| | - `--diffusion-eps F`: Time step epsilon value |
| |
|
| | ### Python (llama-cpp-python) |
| |
|
| | ```bash |
| | pip install llama-cpp-python |
| | ``` |
| |
|
| | ```python |
| | from llama_cpp import Llama |
| | |
| | # Load model |
| | llm = Llama( |
| | model_path="gguf_output/dream-coder-7b-q8_0.gguf", |
| | n_ctx=2048, |
| | n_threads=8, |
| | n_gpu_layers=0 # CPU inference, set >0 to enable GPU acceleration |
| | ) |
| | |
| | # Generate code |
| | output = llm( |
| | "def fibonacci(n):", |
| | max_tokens=512, |
| | temperature=0.1, |
| | top_p=0.95, |
| | repeat_penalty=1.1 |
| | ) |
| | |
| | print(output['choices'][0]['text']) |
| | ``` |
| |
|
| | ### With GPU Acceleration |
| |
|
| | If compiled with CUDA support: |
| |
|
| | ```bash |
| | # Compile CUDA version |
| | cd llama.cpp |
| | make clean |
| | make LLAMA_CUBLAS=1 -j$(nproc) |
| | |
| | # Use GPU acceleration (partial layers) |
| | ./build/bin/llama-diffusion-cli \ |
| | -m gguf_output/dream-coder-7b-q8_0.gguf \ |
| | -p "def quicksort(arr):" \ |
| | -n 512 \ |
| | --diffusion-steps 128 \ |
| | -ngl 20 # Number of GPU layers |
| | ``` |
| |
|
| | ## Troubleshooting |
| |
|
| | ### Common Issues |
| |
|
| | 1. **Conversion Failure**: |
| | - Ensure llama.cpp is compiled correctly |
| | - Check Python dependency versions |
| | - Verify model file integrity |
| |
|
| | 2. **Quantization Failure**: |
| | - Check disk space (~20GB temporary space needed) |
| | - Ensure sufficient memory (32GB+ recommended) |
| |
|
| | 3. **Inference Errors**: |
| | - Verify GGUF file integrity |
| | - Check context length settings |
| | - Try reducing `n_gpu_layers` |
| |
|
| | ### Model Validation |
| |
|
| | ```bash |
| | # File integrity check |
| | ls -lh gguf_output/dream-coder-7b-q8_0.gguf |
| | |
| | # Simple inference test |
| | echo "def hello():" | ./llama.cpp/build/bin/llama-diffusion-cli -m gguf_output/dream-coder-7b-q8_0.gguf -n 20 --diffusion-steps 64 |
| | ``` |
| |
|
| | ## Performance Optimization |
| |
|
| | ### CPU Optimization |
| | - Use `-t` parameter to set thread count |
| | - Enable AVX2/AVX512 compilation options |
| | - Adjust batch size (`-b` parameter) |
| |
|
| | ### GPU Optimization |
| | - Use CUDA/OpenCL compilation |
| | - Adjust GPU layer count (`-ngl`) |
| | - Monitor GPU memory usage |
| |
|
| | ### Memory Optimization |
| | - Use `--mmap` to enable memory mapping |
| | - Adjust `--mlock` parameter |
| | - Set appropriate context length |
| |
|
| | ## Important Notes |
| |
|
| | 1. **Diffusion Features**: Dream-Coder uses diffusion generation, different from traditional autoregressive models |
| | 2. **Dedicated Tool**: Must use `llama-diffusion-cli` instead of the regular `main` tool |
| | 3. **Special Tokens**: Maintain correct handling of `mask_token_id` (151666) |
| | 4. **Context Length**: Supports maximum 32K tokens, but 2K-4K recommended for optimal performance |
| | 5. **Generation Parameters**: Recommend using lower temperature (0.1-0.3) and appropriate top_p (0.9-0.95) |
| | 6. **Diffusion Steps**: Recommend 64-128 steps, more steps may improve quality but increase inference time |
| | |
| | ## Technical Support |
| | |
| | If you encounter issues, please check: |
| | 1. llama.cpp version and compilation status |
| | 2. Python dependency version compatibility |
| | 3. Model file integrity |
| | 4. System resources (memory/disk) |
| | |
| | For more information, refer to: |
| | - [llama.cpp GitHub](https://github.com/ggerganov/llama.cpp) |
| | - [GGUF Format Documentation](https://github.com/ggerganov/ggml/blob/master/docs/gguf.md) |