Significant generation degradation and repetition loops when enabling KV-cache for Qwen3-VL

Hi everyone,

I am encountering a serious issue when using Qwen3-VL (specifically the KV-cache mechanism). I noticed a significant discrepancy between the generation results when KV-cache is enabled versus when it is disabled.

When using KV-cache, the generated output quality degrades noticeably, often resulting in severe repetition loops and incoherent sentences. In contrast, the output is normal and high-quality when KV-cache is disabled.

Here is a comparison of the results:

1. With KV-cache (Degraded):

['This is a close-up photograph of a gray tabby a cat with a Scottish Fold cat with a cat, a gray cat with a gray cat, with large, its face, its face, with large, its eyes, its eyes, its eyes, its eyes, and, and, and, and a large, and a ls eyes, its eyes, its eyes, and, and, and, and a large, and a large, and a large, and a large, and a large, and a large, and a, and a, and a, and a, and a, and a, and a, and a, and a, and a, and a, and a, and a, and a, and a']curious look. Its pink tongue is sticking out, and it appears t

2. Without KV-cache (Normal):

["This is a close-up, eye-level photograph of a grey cat with a very expressive face. The cat has large, round, golden-yellow eyes that are wide open, giving it a startled or sh Fold cat. The background is blurred, which makes the clear curious look. Its pink tongue is sticking out, and it appears to be licking its nose. The cat's fur is a mix of grey and white, with a distinct pattern of stripes and spots. Its ears are folded forward, a characteristic of a Scottish Fold cat. The background is blurred, which makes the cat the clear focus of the image. The overall mood of the image is playful and endearing.\n\n"]

Code Snippet: Here is a sample of the code I am using:

import torch
from transformers import Qwen3VLForConditionalGeneration as QwenVLForConditionalGeneration

from transformers import AutoProcessor

model_path = "Qwen/Qwen3-VL-2B-Instruct"

def greedy_generate_with_kv_cache(
    inputs: dict,
    max_new_tokens: int = 128,
    device: str = "cuda",
):
    dtype = torch.bfloat16 if device == "cuda" else torch.float32

    model = QwenVLForConditionalGeneration.from_pretrained(
        model_path,
        torch_dtype=dtype,
    ).to(device)
    model.eval()

    input_ids = inputs["input_ids"]
    attention_mask = inputs.get("attention_mask", torch.ones_like(input_ids))
    static_inputs = {k: v for k, v in inputs.items() if k not in ("input_ids", "attention_mask")}

    eos_token_id = None
    if hasattr(model, "config") and getattr(model.config, "eos_token_id", None) is not None:
        eos_token_id = model.config.eos_token_id

    past_key_values = None
    generated = []

    next_input_ids = input_ids
    for step in range(max_new_tokens):
        with torch.inference_mode():
            if step == 0:
                outputs = model(
                    input_ids=next_input_ids,
                    use_cache=True,
                    past_key_values=None,
                    **static_inputs,
                )
            else:
                outputs = model(
                    input_ids=next_input_ids,
                    use_cache=True,
                    past_key_values=past_key_values,
                )

        logits = outputs.logits[:, -1, :]
        next_token = torch.argmax(logits, dim=-1, keepdim=True)  # greedy
        generated.append(next_token)

        past_key_values = outputs.past_key_values

        attention_mask = torch.cat([attention_mask, torch.ones_like(next_token)], dim=-1)

        next_input_ids = next_token

        if eos_token_id is not None and torch.all(next_token == eos_token_id):
            break

    return torch.cat(generated, dim=-1)

def greedy_generate(
    inputs: dict,
    max_new_tokens: int = 128,
    device: str = "cuda",
):
    dtype = torch.bfloat16 if device == "cuda" else torch.float32

    model = QwenVLForConditionalGeneration.from_pretrained(
        model_path,
        torch_dtype=dtype,
    ).to(device)
    model.eval()

    input_ids = inputs["input_ids"]
    attention_mask = inputs.get("attention_mask", torch.ones_like(input_ids))
    static_inputs = {k: v for k, v in inputs.items() if k not in ("input_ids", "attention_mask")}

    eos_token_id = None
    if hasattr(model, "config") and getattr(model.config, "eos_token_id", None) is not None:
        eos_token_id = model.config.eos_token_id

    generated = []

    next_input_ids = input_ids
    for step in range(max_new_tokens):
        with torch.inference_mode():
            outputs = model(
                input_ids=next_input_ids,
                **static_inputs,
            )

        logits = outputs.logits[:, -1, :]
        next_token = torch.argmax(logits, dim=-1, keepdim=True)  # greedy
        generated.append(next_token)

        attention_mask = torch.cat([attention_mask, torch.ones_like(next_token)], dim=-1)

        next_input_ids = torch.cat([next_input_ids, next_token], dim=-1)

        if eos_token_id is not None and torch.all(next_token == eos_token_id):
            break

    return torch.cat(generated, dim=-1)


def main():
    device = "cuda" if torch.cuda.is_available() else "cpu"

    processor = AutoProcessor.from_pretrained(model_path)

    messages = [
        {
            "role": "user",
            "content": [
                {"type": "image", "image": "./cat.png"},
                {"type": "text", "text": "Describe this image."},
            ],
        }
    ]
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt",
    )
    inputs = {k: v.to(device) for k, v in inputs.items()}

    tokens_with_cache = greedy_generate_with_kv_cache(inputs, max_new_tokens=128, device=device)
    tokens_wo_cache = greedy_generate(inputs, max_new_tokens=128, device=device)

    text_with_cache = processor.batch_decode(tokens_with_cache, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    text_wo_cache = processor.batch_decode(tokens_wo_cache, skip_special_tokens=True, clean_up_tokenization_spaces=False)
    print(text_with_cache)
    print(text_wo_cache)

if __name__ == "__main__":
    main()

Environment:

  • transformers version: 4.51.3

  • torch version: 2.8.0+cu128

  • Model: Qwen/Qwen3-VL-2B-Instruct

I am unsure if this is caused by an incorrect implementation of the KV-cache on my part or if there is a potential bug in the model/library. Has anyone else experienced this issue?

Any insights would be appreciated. Thanks!

1 Like

Transformers version mismatch?


This is almost certainly your loop, not “KV-cache makes Qwen3-VL worse”.

For greedy decoding, cached and uncached should match token-for-token when the model sees the same effective context. When cache-on collapses into repetition, it usually means the model is decoding with wrong positions or wrong attention mask, so the KV cache is being read as if the current token were at position 0 or 1 forever.

Qwen3-VL is especially sensitive because its multimodal positional scheme relies on “prefill once, then text-only decode with precomputed RoPE deltas”. (Hugging Face Forums)


1) Your environment is out of spec for Qwen3-VL

Qwen’s official Qwen3-VL repo states plainly: Qwen3-VL requires transformers >= 4.57.0. (GitHub)

Your reported transformers==4.51.3 is below that. Even if things “import”, you are very likely running into partially-supported caching APIs or missing model-specific generation glue.

What to do:

  • Upgrade Transformers to a supported release (4.57.1+ is the safe baseline people use in issues, and the Qwen team’s requirement is 4.57.0+). (GitHub)
  • If you want maximum correctness for brand-new model code, HF’s Qwen3-VL docs page you are likely following is the “main” docs and notes it expects install-from-source behavior. (Hugging Face)

This matters because caching behavior and required kwargs (like cache_position) changed materially across Transformers versions.


2) The biggest functional bug: you grow attention_mask but you never pass it in cached steps

In your cached loop, after step 0 you call:

outputs = model(input_ids=next_input_ids, use_cache=True, past_key_values=past_key_values)

No attention_mask.

So at step > 0, the model sees:

  • input_ids of shape (B, 1) (one token)
  • and either a default mask of length 1, or no mask handling consistent with “past + present”

But the cache contains the whole prefix. The model needs a mask shaped like (B, past_len + 1) to compute attention correctly.

Hugging Face’s caching docs explicitly warn that in a custom cached loop, your attention mask must match past_kv_length + new_tokens_length. This is normally handled by generate(), but not in your manual loop. (Hugging Face)

You do concatenate attention_mask, but it has no effect because it is never fed back into the model.

Result: attention scores and positional handling become inconsistent. Repetition loops are a common visible symptom.


3) The second functional bug: you never pass cache_position

In a manual cache loop, you also must track and pass cache_position. HF docs: cache_position tells the model where the new token sits in the absolute sequence, independent of padding. (Hugging Face)

Why it is extra important for Qwen-VL models:

  • Qwen3-VL (and Qwen2-VL / Qwen2.5-VL) use the pattern “multimodal prefill computes RoPE indices once, decode reuses precomputed RoPE deltas”. (Hugging Face Forums)
  • In decode mode, if cache_position is missing or wrong, the model can apply the wrong “delta” for rotary positions. That is exactly the kind of error that yields incoherence and looping.

You can see the same mechanism clearly in a Qwen2.5-VL bug report: decode uses cache_position[0] + self.rope_deltas and falls back to 0 when cache_position is None. (GitHub)
Qwen3-VL follows the same conceptual split (prefill vs decode with precomputed deltas). (Hugging Face Forums)

So in your cached steps, you are effectively telling the model “this token is always at position 0” (or letting it infer something equally wrong). That is a direct route to degeneration.


4) Qwen3-VL-specific detail: images are prefill-only once decoding starts

Even if you did pass pixels again, Qwen3-VL-style models treat images/videos as “prefill-only”; once cache_position is non-zero, the generation path is supposed to be text-only. (Hugging Face Forums)

This is why the correct fix is not “keep passing pixel_values every step”.
The correct fix is: do a correct prefill once (with vision inputs), then do correct decode steps (with cache_position and attention_mask updated).


5) Smaller but real issues in your snippet

A) token_type_ids

HF’s Qwen3-VL docs explicitly pop token_type_ids in the example. You should do the same to match the intended usage and avoid silent incompatibilities. (Hugging Face)

B) EOS handling is probably wrong

Qwen3-VL configs can use multiple EOS token ids (list). A HF thread shows a Qwen3-VL run printing eos_token_id: [151645, 151643]. (Hugging Face Forums)
Your torch.all(next_token == eos_token_id) assumes a scalar id. That can prevent stopping, making loops look worse.


6) What to do: three levels of fixes

Level 1 (recommended): use generate() for correctness, then optimize

This removes 90 percent of KV-cache footguns because generate() manages mask growth and cache positions for you. HF’s Qwen3-VL docs show exactly this pattern. (Hugging Face)

If you want greedy:

  • set do_sample=False
  • keep use_cache=True (default for most generation configs)

Level 2: keep your manual greedy loop, but implement cache correctly

Hugging Face provides a reference loop with:

  • DynamicCache
  • attention mask concatenation
  • cache_position = cache_position[-1:] + 1 each step (Hugging Face)

Below is a corrected version that matches the HF cache contract and works with multimodal prefill → text-only decode.

# deps:
#   transformers >= 4.57.0 (Qwen3-VL requirement)
#   torch >= 2.1 (you have 2.8)

import torch
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration, DynamicCache

MODEL_ID = "Qwen/Qwen3-VL-2B-Instruct"

@torch.inference_mode()
def greedy_generate_kv_fixed(model, inputs, max_new_tokens=128):
    # Qwen3-VL docs show this can exist and should be removed if present
    inputs.pop("token_type_ids", None)  # see HF Qwen3-VL doc example

    input_ids = inputs["input_ids"]
    attention_mask = inputs.get("attention_mask", torch.ones_like(input_ids))

    # keep vision inputs for prefill only
    static_inputs = {k: v for k, v in inputs.items() if k not in ("input_ids", "attention_mask")}

    # EOS can be list
    eos_ids = getattr(model.generation_config, "eos_token_id", None) or getattr(model.config, "eos_token_id", None)
    if isinstance(eos_ids, int):
        eos_ids = [eos_ids]

    cache = DynamicCache(config=model.config)

    # cache_position for the prefill is [0..prompt_len-1]
    cache_position = torch.arange(input_ids.shape[1], device=input_ids.device, dtype=torch.long)

    # 1) Prefill: full prompt + vision tensors
    outputs = model(
        input_ids=input_ids,
        attention_mask=attention_mask,
        past_key_values=cache,
        cache_position=cache_position,
        use_cache=True,
        **static_inputs,
    )

    generated = []

    # 2) Decode: token-by-token, text-only, but with correct mask and cache_position
    for _ in range(max_new_tokens):
        next_token = outputs.logits[:, -1:, :].argmax(dim=-1)
        generated.append(next_token)

        if eos_ids is not None and next_token.numel() == 1 and int(next_token.item()) in eos_ids:
            break

        # grow attention mask to (past_len + 1)
        attention_mask = torch.cat(
            [attention_mask, attention_mask.new_ones((attention_mask.shape[0], 1))],
            dim=-1,
        )
        # advance cache_position by 1
        cache_position = cache_position[-1:] + 1

        outputs = model(
            input_ids=next_token,
            attention_mask=attention_mask,
            past_key_values=outputs.past_key_values,
            cache_position=cache_position,
            use_cache=True,
        )

    return torch.cat(generated, dim=-1)

Why this fixes your exact failure mode:

  • It passes the full-length attention mask every decode step. Required by HF cache docs. (Hugging Face)
  • It advances and passes cache_position every step. Required by HF cache docs. (Hugging Face)
  • It respects the Qwen3-VL prefill/decode split: vision tensors only in prefill, consistent with how these models are designed. (Hugging Face Forums)

Level 3: debug like a systems engineer (find the first divergence)

If you want to be certain:

  1. Run 5 decode steps in both methods.
  2. Compare logits for the same step.
  3. The first step where logits differ tells you what state you failed to carry.

In your current code, the divergence should happen immediately at step 1 because the cached call has a 1-token attention mask and no cache_position.


7) Is there also a model-side repetition issue?

There is an open Qwen3-VL repo issue reporting “infinite repetition” in some cases. (GitHub)
But your “cache-on breaks, cache-off fine” pattern is much more consistent with incorrect cache bookkeeping than with an inherent model repetition tendency.

Treat it as two separate phenomena:

  • Phenomenon A (your bug): cached loop missing mask and positions → deterministic corruption.
  • Phenomenon B (model behavior): some prompts can cause repetition even in normal generation; mitigated by sampling or penalties.

Fix A first. Then evaluate whether B remains.


8) Minimal checklist for your case

  • Upgrade Transformers to a Qwen3-VL-supported version. (GitHub)

  • In cached decode steps, always pass:

  • Pop token_type_ids like the official Qwen3-VL docs do. (Hugging Face)

  • Handle eos_token_id as a list. (Hugging Face Forums)

  • Prefer generate() unless you have a strong reason to reimplement it. (Hugging Face)


Summary

  • Your cached loop is missing the two required pieces of state: full attention_mask and cache_position. HF caching docs call both mandatory for correct custom loops. (Hugging Face)
  • Qwen3-VL decode depends on precomputed RoPE deltas, so wrong or missing cache_position is especially damaging. (Hugging Face Forums)
  • Your Transformers version is below Qwen3-VL’s stated requirement. Upgrade first. (GitHub)
1 Like