| | --- |
| | license: mit |
| | --- |
| | |
| |
|
| | # Audio-Reasoner |
| |
|
| | We implemented inference scaling on **Audio-Reasoner**, a large audio language model, enabling **deepthink** and **structured chain-of-thought (COT) reasoning** for multimodal understanding and reasoning. To achieve this, we constructed CoTA, a high-quality dataset with **1.2M reasoning-rich samples** using structured COT techniques. Audio-Reasoner achieves state-of-the-art results on **MMAU-mini(+25.42%)** and **AIR-Bench-Chat(+14.57%)** benchmarks. |
| |
|
| | <p align="center"> |
| | Audio-Reasoner-7B <a href="https://huggingface.co/zhifeixie/Audio-Reasoner/tree/main">🤗</a> | CoTA Dataset <a href="https://huggingface.co"></a> 🤗 (coming soon)<br> |
| | Paper <a href="https://arxiv.org/abs/2503.02318"> 📑</a> | Wechat <a href="https://github.com/xzf-thu/Audio-Reasoner/blob/main/assets/wechat.jpg">💭</a> | Code <a href="https://github.com/xzf-thu/Audio-Reasoner"> ⚙️</a> |
| | <br> |
| | <a href="#demo"> Demo</a> • <a href="#install">Install</a> • <a href="#quick-start">Quick Start</a> • <a href="#faq">FAQ</a> • <a href="#contact">Contact us</a><br> |
| | <br> |
| | If you like us, pls give us a star⭐ ! |
| | </p> |
| |
|
| |
|
| |
|
| | ## Main Results |
| |
|
| |
|
| | ## News and Updates |
| | - **2025.03.05:** ✅**Audio-Reasoner-7B checkpoint is released on HuggingFace<a href="https://huggingface.co/zhifeixie/Audio-Reasoner/tree/main">🤗</a> !** |
| | - **2025.03.05:** ✅**Audio-Reasoner Paper is uploaded to arXiv<a href="https://arxiv.org/abs/2503.02318"> 📑</a>.** |
| | - **2025.03.04:** ✅**Demos, inference code and evaluation results have been released.** |
| | - **2025.03.04:** ✅**Create this repo.** |
| |
|
| | ## Roadmap |
| | - **2025.03:** **🔜Upload CoTA dataset to HuggingFace🤗.** |
| |
|
| | - **2025.04:** **🔜Open-source data systhesis pipeline and training code**. |
| |
|
| | ## Demo |
| | <p align="center" width="80%"> |
| | <video controls src="https://github.com/user-attachments/assets/d50f75e7-288b-454b-92a3-c6f058be231b" title="v" width="100%"></video> |
| | </p> |
| |
|
| | ## Features |
| | ✅ Audio-Reasoner enables **deep reasoning and inference scaling** in audio-based tasks, built on Qwen2-Audio-Instruct with structured CoT training. |
| |
|
| | ✅ CoTA offers **1.2M** high-quality captions and QA pairs across domains for structured reasoning and enhanced pretraining. |
| |
|
| | ✅ Pretrained model and dataset encompassing various types of audio including sound, music, and speech, has achieved state-of-the-art results across multiple benchmarks. Refer to our <a href="https://arxiv.org/abs/2503.02318">paper</a> for details. |
| |
|
| |
|
| | ## Install |
| |
|
| | **Clone and install** |
| |
|
| | - Clone the repo |
| | ``` sh |
| | git clone https://github.com/xzf-thu/Audio-Reasoner.git |
| | |
| | cd Audio-Reasoner |
| | ``` |
| |
|
| | - Install the required packages |
| | ```sh |
| | conda create -n Audio-Reasoner python=3.10 |
| | conda activate Audio-Reasoner |
| | |
| | pip install -r requirements.txt |
| | pip install transformers==4.49.1 |
| | ``` |
| |
|
| | ## Quick Start |
| |
|
| | **Chat using ms-swift** |
| | ```sh |
| | import os |
| | import re |
| | from typing import List, Literal |
| | from swift.llm import InferEngine, InferRequest, PtEngine, RequestConfig, load_dataset, get_template |
| | from swift.plugin import InferStats |
| | |
| | |
| | def infer_stream(engine: 'InferEngine', infer_request: 'InferRequest'): |
| | request_config = RequestConfig(max_tokens=2048, temperature=0, stream=True) |
| | metric = InferStats() |
| | gen = engine.infer([infer_request], request_config, metrics=[metric]) |
| | query = infer_request.messages[0]['content'] |
| | output = "" |
| | print(f'query: {query}\nresponse: ', end='') |
| | for resp_list in gen: |
| | if resp_list[0] is None: |
| | continue |
| | print(resp_list[0].choices[0].delta.content, end='', flush=True) |
| | output += resp_list[0].choices[0].delta.content |
| | print() |
| | print(f'metric: {metric.compute()}') |
| | return output |
| | |
| | |
| | def get_message(audiopath, prompt): |
| | messages = [ |
| | {"role": "system", "content": system}, |
| | { |
| | 'role': |
| | 'user', |
| | 'content': [{ |
| | 'type': 'audio', |
| | 'audio': audiopath |
| | }, { |
| | 'type': 'text', |
| | 'text': prompt |
| | }] |
| | }] |
| | return messages |
| | |
| | system = 'You are an audio deep-thinking model. Upon receiving a question, please respond in two parts: <THINK> and <RESPONSE>. The <THINK> section should be further divided into four parts: <PLANNING>, <CAPTION>, <REASONING>, and <SUMMARY>.' |
| | infer_backend = 'pt' |
| | model = 'qwen2_audio' |
| | last_model_checkpoint = "" #Please replace it with the path to checkpoint |
| | engine = PtEngine(last_model_checkpoint, max_batch_size=64, model_type = model) |
| | |
| | def audioreasoner_gen(audiopath, prompt): |
| | return infer_stream(engine, InferRequest(messages=get_message(audiopath, prompt))) |
| | |
| | def main(): |
| | #Please replace it with your test aduio |
| | audiopath = "assets/test.wav" |
| | #Please replace it with your questions about the test aduio |
| | prompt = "Which of the following best describes the rhythmic feel and time signature of the song?" |
| | audioreasoner_gen(audiopath, prompt) |
| | |
| | if __name__ == '__main__': |
| | main() |
| | ``` |
| |
|
| | **Local test** |
| |
|
| | ```sh |
| | conda activate Audio-Reasoner |
| | cd Audio-Reasoner |
| | # test run the preset audio samples and questions |
| | python inference.py |
| | ``` |
| |
|
| | ## FAQ |
| |
|
| | **1. What kind of audio can Audio - Reasoner understand and what kind of thinking does it perform?** |
| | Audio - Reasoner can understand various types of audio, including sound, music, and speech. It conducts in - depth thinking in four parts: **planning, caption, reasoning, and summary**. |
| |
|
| | **2. Why is transformers installed after 'ms-swift' in the environment configuration?** |
| | The version of transformers has a significant impact on the performance of the model. We have tested that version `transformers==4.49.1` is one of the suitable versions. Installing ms-swift first may ensure a more stable environment for the subsequent installation of transformers to avoid potential version conflicts that could affect the model's performance. |
| |
|
| | ## Contact |
| |
|
| | If you have any questions, please feel free to contact us via `zhifei001@e.ntu.edu.sg`. |
| |
|
| | ## Citation |
| | Please cite our paper if you find our model and detaset useful. Thanks! |
| | ``` |
| | @misc{xie2025audioreasonerimprovingreasoningcapability, |
| | title={Audio-Reasoner: Improving Reasoning Capability in Large Audio Language Models}, |
| | author={Zhifei Xie and Mingbao Lin and Zihang Liu and Pengcheng Wu and Shuicheng Yan and Chunyan Miao}, |
| | year={2025}, |
| | eprint={2503.02318}, |
| | archivePrefix={arXiv}, |
| | primaryClass={cs.SD}, |
| | url={https://arxiv.org/abs/2503.02318}, |
| | } |
| | ``` |
| |
|