Multimodal Prefix Caching with Qwen3-VL

Hello everyone,

I am experimenting with prefix caching in Qwen3VL for multimodal prompts. Specifically, I have a prefix prompt that includes demo images, which I want to cache and later reuse with a separate multimodal query prompt.

Here’s a simple example setup (see code snippet below):

  • The prefix contains a cached image.

  • The query prompt contains another image and a text prompt.

However, the model fails to process the image from the query prompt correctly. I suspect this is related to how position embeddings are handled when using caching. While prefix caching works fine for text-only prompts, it seems that for image inputs, Qwen3VL’s multimodal RoPE is not being applied correctly. As a result, the model cannot distinguish between the two images.

I’ve noticed similar behavior in other MLLMs, such as Gemma3, where the model responds by stating that the two images are identical.

Has anyone encountered this issue or found a workaround for properly using prefix caching with multimodal prompts?

model_id = "Qwen/Qwen3-VL-8B-Instruct"
    model = Qwen3VLForConditionalGeneration.from_pretrained(
        model_id,
        dtype=torch.bfloat16,
        attn_implementation="flash_attention_2",
        device_map="auto"
    )
    processor = AutoProcessor.from_pretrained(model_id)

    messages = [{"role": "user", "content": [{"type": "image", "image": img1}]}]


    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=False,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device)
    
    with torch.no_grad():
        prefix_output = model(**inputs, use_cache=True)

    prompt_cache = prefix_output.past_key_values
    
    # Now create the full prompt with both image and text
    prompt = "Compare the images."
    messages = [{"role": "user", "content": [
        {"type": "image", "image": img2},
        {"type": "text", "text": prompt}
    ]}]
    
    new_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device)

    # Manually set cache_position
    cache_length = prompt_cache[0][0].shape[2]
    new_seq_length = new_inputs["input_ids"].shape[1]
    cache_position = torch.arange(cache_length, cache_length + new_seq_length, device=model.device)

    # Use cache_position to skip the cached part
    past_key_values = copy.deepcopy(prompt_cache)

    outputs = model.generate(
        **new_inputs,
        past_key_values=past_key_values,
        use_cache=True,
        cache_position=cache_position,
        max_new_tokens=512
        )
    
    print(processor.batch_decode(outputs, skip_special_tokens=True)[0])
1 Like

Seems current limitation. Or rather just a bug…?


This is a known limitation/bug pattern in current Qwen-VL and similar MLLMs (including Gemma-3). What you are attempting – “cached multimodal prefix (with image A) + new multimodal query (with image B) in a later call using past_key_values” – is essentially not supported by the shipped Hugging Face implementations. The failure you see is not a simple bug in your code; it is how the models’ caching and multimodal RoPE are wired.

Below is a detailed breakdown of why, plus realistic workarounds and what would be required to “really” fix it.


1. What you are trying to do (rephrased)

Your experiment:

  1. First call (prefix):

    messages = [{"role": "user", "content": [{"type": "image", "image": img1}]}]
    
    inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=False,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device)
    
    with torch.no_grad():
        prefix_output = model(**inputs, use_cache=True)
    
    prompt_cache = prefix_output.past_key_values
    

    This produces a KV cache that encodes a multimodal prompt with image 1.

  2. Second call (query):

    prompt = "Compare the images."
    messages = [{"role": "user", "content": [
        {"type": "image", "image": img2},
        {"type": "text", "text": prompt}
    ]}]
    
    new_inputs = processor.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_dict=True,
        return_tensors="pt"
    ).to(model.device)
    
    cache_length = prompt_cache[0][0].shape[2]
    new_seq_length = new_inputs["input_ids"].shape[1]
    cache_position = torch.arange(cache_length,
                                  cache_length + new_seq_length,
                                  device=model.device)
    
    outputs = model.generate(
        **new_inputs,
        past_key_values=prompt_cache,
        use_cache=True,
        cache_position=cache_position,
        max_new_tokens=512,
    )
    

Intended semantics:

  • Treat the first call as a cached demo prefix containing img1.
  • Treat the second call as “demo prefix (from cache) + new image img2 + text prompt”, without recomputing the demo.

Observed behavior:

  • The model largely ignores img2 or treats img1 and img2 as identical.
  • This matches what you see in Gemma-3 (second image effectively ignored).

This behavior is exactly what one would expect given how Qwen-VL-style models handle multimodal RoPE and caching.


2. How Qwen-VL-style models actually handle multimodal + caching

2.1 Prefill vs decode phases

Internally, Qwen2.5-VL and Qwen3-VL distinguish two phases:

  1. Prefill phase (first call of a generation):

    • KV cache is empty or cache_position starts at 0.
    • The model ingests the entire input sequence: all text, all <image> / <video> tokens, and corresponding pixel_values.
    • The multimodal RoPE logic computes spatial-temporal indices for all vision tokens based on image_grid_thw / video_grid_thw and the attention mask, and stores them in something like rope_deltas.
      In Qwen2.5-VL, this is explicit: “calculate RoPE index once per generation in the pre-fill stage only”, and the code calls get_rope_index(...) only in that branch. (Hugging Face)
  2. Decode phase (subsequent steps with non-empty cache):

    • New tokens are appended on top of the cached KVs.
    • RoPE indices are not recomputed from images; instead, the model reuses the stored rope_deltas and advances a 1-D position using cache_position. (Hugging Face)
    • The implementation assumes no new images appear in this phase. It is effectively “text-only decode”.

This “multimodal prefill + text decode” design shows up throughout the Qwen-VL code and documentation. The Qwen3-VL docs describe a multimodal encoder that converts images/videos into a patch grid and merges them into the token sequence, with position handling done in a prefill step before text-only decoding. (Hugging Face)

2.2 prepare_inputs_for_generation and dropping pixel_values

On the Hugging Face side, generation is driven by prepare_inputs_for_generation. For Qwen2-VL and Qwen2.5-VL, there is a critical snippet (paraphrased from the upstream and from a user patch):

  • If cache_position[0] != 0, then set pixel_values = None and pixel_values_videos = None. (Hugging Face)

This gate means:

  • As soon as decoding starts (non-zero cache_position), the model is forbidden from seeing images/videos in generate(). They are forcibly removed from the inputs, exactly to enforce the “one multimodal prefill, then text only” pattern.

Qwen3-VL reuses the same conceptual pattern (prefill vs decode with RoPE computed once). The exact gating around pixel values may differ in details, but the key assumptions are the same:

  • Images and videos are supposed to be processed in prefill.
  • Decode with non-empty KV cache is assumed to be text only.

3. What this means for your two-image prefix-cache setup

Now apply the above to your code.

3.1 First call (prefix with img1)

  • past_key_values is None, so the model is in prefill mode.

  • All of the following happen:

    • The vision encoder converts img1 into a set of patch embeddings.
    • Those embeddings are injected at the <image> placeholder in the token sequence.
    • mRoPE / visual RoPE indices for the entire sequence (text + vision) are computed from scratch.
    • The model runs the transformer over that sequence and returns a KV cache that encodes “prefix with image 1”.

This is the only point where the vision pathway and its RoPE indices are properly set up for that image.

3.2 Second call (new image img2 + text, with past_key_values)

In the second call:

  • You pass past_key_values=prompt_cache and a non-zero cache_position.

  • generate() interprets this as decode mode:

    • KV cache is non-empty.
    • cache_position starts at cache_length.

In decode mode, for Qwen-VL-style models:

  • Multimodal RoPE is not recomputed from image_grid_thw / video_grid_thw; it assumes images were already fully handled in prefill. (Hugging Face)
  • In Qwen2-VL / Qwen2.5-VL, prepare_inputs_for_generation explicitly clears pixel_values when cache_position[0] != 0. (Hugging Face)
  • Qwen3-VL follows the same conceptual split and uses precomputed RoPE deltas in decode. (Hugging Face)

So, depending on the exact version you run, one of two things happens:

  1. Strict behavior (like Qwen2-VL / Gemma-3)

    • prepare_inputs_for_generation sees cache_position[0] > 0 and drops pixel_values.

    • The second image is not passed into the model at all.

    • The model sees only:

      • the cached prefix (with img1) in KVs, and
      • the new text tokens “Compare the images.”.
  2. Looser behavior but still prefill-only RoPE

    • Even if pixel_values for img2 are not dropped, the RoPE code does not recompute spatial indices for a new image in decode.
    • The new visual tokens get inconsistent or meaningless positions with respect to the stored rope_deltas.
    • Practically, the model cannot treat img2 as a clean second image segment; its attention to that vision content is broken.

In either case, the model is not actually reasoning over a “fresh” img2 alongside img1, so:

  • When asked “Compare the images”, it either:

    • only sees one image and answers as if they are identical or trivial, or
    • returns something noisy/incoherent.

Exactly analogous behavior is documented for Gemma-3:

  • In the Gemma-3 issue, prepare_inputs_for_generation is shown to forward pixel_values only when cache_position[0] == 0, which makes prefix caching with a later image turn impossible. (GitHub)

So your Qwen3-VL and Gemma-3 experiments are hitting the same architectural assumption: images are prefill-only.


4. Evidence that this is a known limitation

Several public issues match your situation very closely:

  1. Prefix caching with Qwen3-VL & multimodal demos
    A recent Reddit post describes a setup where each query has a fixed multimodal demo block (images + text) and a variable query, using Qwen3-VL with HF backend and manual prefix caching. It reports that behavior diverges once the prefix gets long but confirms that multimodal prefix caching is fragile and not officially supported. (Reddit)

  2. Qwen2-VL / Qwen2.5-VL: images dropped when caching
    A Hugging Face user patch explicitly shows that the upstream modeling_qwen2_vl.py sets pixel_values and pixel_values_videos to None whenever cache_position[0] != 0. (Hugging Face)

  3. vLLM bug: --enable-prefix-caching + Qwen2-VL + second image
    In vLLM issue #8296, the reporter states that enabling prefix caching for Qwen2-VL causes errors when the service processes a second image and notes that this parameter is currently incompatible with multimodal large models. (GitHub)

  4. Gemma-3: identical pattern
    The Gemma-3 issue demonstrates that prefix caching with a later image turn is impossible because prepare_inputs_for_generation only forwards pixel_values at cache_position[0] == 0. After caching a text prefix, any subsequent turn that includes image tokens is processed without image features. (GitHub)

Taken together, these sources show that:

  • Multimodal prefix caching is not just “unfinished”; the current implementations explicitly gate against it in the generation path.

5. What does work today (recommended patterns)

Given the above, the following usage patterns are realistic and robust.

5.1 Put all images in a single prefill (no cross-call image caching)

For true visual comparisons, the only “clean” pattern right now is:

  • Put all images that need to be jointly reasoned about into a single prompt.

  • Call generate() once, so that:

    • The model sees img1 and img2 together in a single prefill.
    • mRoPE is computed once over the combined layout.
  • Do not try to reuse KV cache from a past call that already had images.

Conceptually:

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": img1},
        {"type": "image", "image": img2},
        {"type": "text", "text": "Compare these two images."},
    ],
}]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True)

This:

  • Allows the model to actually see and compare two images.
  • Avoids the decode-phase “no pixel_values” restriction entirely, because images are only used in the prefill.

If deployment speed is the main concern:

  • Use engine-level prefix caching that operates on the full tokenized prompt before it enters the model (e.g. vLLM’s internal prompt cache), once there is stable support for Qwen3-VL. But that must be a cache over the whole sequence including both images, not a KV reuse across different images.

5.2 Cache only pure text; always re-encode images

Another safe compromise:

  • Restrict the KV cache to a text-only prefix:

    • System description.
    • Task instructions.
    • Possibly textual few-shot examples.
  • For every query:

    • Use the text prefix again (can be cached as KVs or via engine-side prompt cache).
    • Add all images for that query (e.g. both demo and query images) fresh in the same prefill.

This does not avoid the cost of the vision encoder for demo images, but it keeps caching semantics safe:

  • Only the text is reused in past_key_values.
  • Images and multimodal RoPE are recomputed per query as intended.

5.3 Summarize demo images to text once, then cache the text

If the main motivation is that the demo images are constant across queries, a pragmatic workaround is:

  1. Run a one-time, “offline” multimodal call:

    • Prompt Qwen3-VL with the demo images and ask for very detailed descriptions:

      • “Describe this reference image in exhaustive detail, including structure, color, layout, etc. Label it REF1, REF2, …”.
  2. Take those descriptions and turn them into a text-only prefix:

    REF1: [long detailed description]
    REF2: [long detailed description]
    ...
    
  3. Use that text-only prefix as the cached prompt for later comparisons:

    • Each query then passes:

      • The cached text describing REF1, REF2, …
      • The new image img2.
      • A text question: “Compare the new image to REF1” (or similar).

Advantages:

  • The KV cache is purely textual, which works reliably with current past_key_values semantics.
  • The cost of encoding demo images is paid only once when generating the descriptions.
  • Comparisons often work surprisingly well if the descriptions are detailed.

Trade-off:

  • The model no longer has raw pixel access to both images simultaneously; it compares an image to a textual representation of another image. For many tasks this is acceptable; for fine-grained visual metrics it may be weaker.

5.4 If vLLM is used: avoid --enable-prefix-caching for multimodal

If using vLLM with Qwen-VL:

  • The vLLM issue #8296 explicitly reports that enabling --enable-prefix-caching with Qwen2-VL and multiple images causes errors and is not compatible with multimodal models as currently implemented. (GitHub)

  • Until there is an explicit fix in vLLM for Qwen3-VL multimodal caching, the safe assumption is:

    • Do not rely on KV-level prefix caching when images differ across requests.

Engine-side prompt caching may still work if it treats both text and image content as part of the cache key and rebuilds KVs per distinct multimodal prefix.


6. What would a “real” multimodal prefix cache require?

Conceptually, to truly support:

“Cached multimodal prefix with image A, later reuse cache with new image B and text”

the model implementation would need:

  1. Non-destructive handling of pixel_values in decode

    • prepare_inputs_for_generation would have to not drop pixel_values just because cache_position > 0.
    • Instead, it would need to check whether the current slice of input_ids contains image tokens and only forward pixel_values in those steps. This is exactly what the Gemma-3 issue proposes as a fix. (GitHub)
  2. Recomputable multimodal RoPE state

    • The mRoPE logic would have to allow inserting new image segments after existing cached content:

      • Compute new spatial indices for new images.
      • Combine them with existing rope_deltas in a consistent way.
    • Current Qwen-VL code is explicitly written to compute RoPE indices once in prefill and reuse them, which is not sufficient for new images in decode. (Hugging Face)

  3. A stable cache format that captures chunking / DCA metadata

    • Qwen3-VL uses Dual Chunk Attention to extend context beyond 4k tokens; a Reddit report shows that even pure text prefix caching diverges beyond 4096 tokens, likely because chunking metadata is not preserved in past_key_values. (Reddit)
    • TRUE multimodal prefix caching would have to take chunking and image blocks into account.

Implementing all this is non-trivial, and there is no official, production-quality patch yet for Qwen3-VL in the Hugging Face ecosystem.


7. Practical recommendation for your use case

Given the current state of Qwen3-VL and Gemma-3:

  • The attempt in your snippet – reusing a multimodal KV cache and then adding a second image with a non-zero cache_position – is not supported by the model’s generation code and will remain unreliable.
  • The failure mode (“the two images seem identical” or the second image is ignored) is expected.

For robust behavior now:

  1. Use single-prefill multimodal prompts for any case where both images must be reasoned about together.

  2. Restrict KV prefix caching (via past_key_values) to text-only prefixes.

  3. If demo images must be reused:

    • Either re-encode them every time in a single prefill, or
    • Convert them once into very detailed text descriptions and cache those as text.

Summary

  • Qwen3-VL (like Qwen2-VL / Qwen2.5-VL and Gemma-3) is designed for “one multimodal prefill, then text-only decode”.

  • In the Hugging Face implementations, prepare_inputs_for_generation and the RoPE logic:

    • Compute multimodal RoPE only in prefill. (Hugging Face)
    • Drop or mishandle pixel_values once cache_position > 0, making later images effectively invisible. (Hugging Face)
  • As a result, the pattern “cached prefix with image 1 + new call with image 2 + text, using past_key_values” is not supported; the second image cannot be processed correctly.

  • Similar problems are reported publicly:

    • Reddit: Qwen3-VL prefix caching behaves inconsistently, especially with long multimodal prefixes. (Reddit)
    • vLLM: --enable-prefix-caching is currently incompatible with Qwen-VL multimodal models. (GitHub)
    • Gemma-3: prepare_inputs_for_generation explicitly ignores images when cache_position > 0. (GitHub)
  • Recommended workarounds:

    • Put all images into a single prefill; avoid cross-call multimodal KV reuse.
    • Cache only textual prefixes.
    • Optionally, summarize fixed demo images into detailed text once and reuse that text as the cached prefix.
1 Like

Thank you John, for the extremely detailed response and for summarizing the problem so clearly, it’s very helpful! I hope to find a safe way to implement this functionality soon. I attempted a simple local fix on my side (similar to the issue you mentioned above for Gemma), but recomputing the RoPE indices led to downstream errors that I haven’t yet been able to debug.

1 Like