20 Questions StarPO - Qwen3-4B

This model is a reinforcement learning fine-tuned version of Qwen3-4B for the 20 Questions task, trained with StarPO (a variant of GRPO adapted for multi-turn settings). Released as part of the paper "Intrinsic Credit Assignment for Long Horizon Interaction".

Overview

The model plays the role of a Questioner in a game of 20 Questions: it asks up to 20 yes-or-no questions to deduce a secret word (a common English noun). This model was trained with RL starting from the SFT checkpoint and serves as a baseline in the paper.

Training

  • Base model: Qwen3-4B-SFT (SFT on Qwen3-4B)
  • Method: StarPO (multi-turn GRPO / Group Relative Policy Optimization)
  • Training data: 1,000 words from the COCA+ RL training set (no overlap with test set)
  • Test set: 433 held-out words from the Gemini test split
  • Judge/Oracle: Qwen3-14B (with chain-of-thought reasoning)
  • Framework: VERL (Volcano Engine Reinforcement Learning)

Intended Use

This model is intended for:

  • Playing 20 Questions as a questioner agent
  • Research on multi-turn interactive language agents and RL for LLMs
  • Comparison baseline for credit assignment methods in multi-step RL

How to Use

from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Klingspor/StarPO-4B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

system_prompt = """You are the Questioner in a game of 20 Questions, and your goal is to determine the secret word.
The secret is randomly drawn from the most frequent nouns of the English language.

Ask clear, concise, and strategic yes/no questions that will help you narrow down the possibilities.
Consider previous answers to inform your subsequent questions, and keep track of the information you gather.
Focus on deductive reasoning, start with a broad question and refine your queries as you progress."""

user_prompt = """Ask a question to gain additional information about the secret or guess what the secret is.

Instructions:
1. Ask a question that can be answered with "Yes" or "No" to help you deduce the secret word.
2. Your answer must be a single question. Do not provide any additional commentary or reasoning.

Ask your question: """

messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": user_prompt},
]

text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(**inputs, max_new_tokens=128)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Links

Downloads last month
221
Safetensors
Model size
4B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Klingspor/StarPO-4B

Base model

Qwen/Qwen3-4B-Base
Finetuned
Qwen/Qwen3-4B
Finetuned
(2)
this model
Quantizations
2 models

Collection including Klingspor/StarPO-4B