audio
updated
SoundCTM: Uniting Score-based and Consistency Models for Text-to-Sound
Generation
Paper
• 2405.18503
• Published
• 9
DITTO-2: Distilled Diffusion Inference-Time T-Optimization for Music
Generation
Paper
• 2405.20289
• Published
• 11
LiveSpeech: Low-Latency Zero-shot Text-to-Speech via Autoregressive
Modeling of Audio Discrete Codes
Paper
• 2406.02897
• Published
• 16
Audio Mamba: Bidirectional State Space Model for Audio Representation
Learning
Paper
• 2406.03344
• Published
• 22
GAMA: A Large Audio-Language Model with Advanced Audio Understanding and
Complex Reasoning Abilities
Paper
• 2406.11768
• Published
• 24
Towards Robust Speech Representation Learning for Thousands of Languages
Paper
• 2407.00837
• Published
• 11
FoleyCrafter: Bring Silent Videos to Life with Lifelike and Synchronized
Sounds
Paper
• 2407.01494
• Published
• 15
PicoAudio: Enabling Precise Timestamp and Frequency Controllability of
Audio Events in Text-to-audio Generation
Paper
• 2407.02869
• Published
• 21
FunAudioLLM: Voice Understanding and Generation Foundation Models for
Natural Interaction Between Humans and LLMs
Paper
• 2407.04051
• Published
• 40
Video-to-Audio Generation with Hidden Alignment
Paper
• 2407.07464
• Published
• 17
Masked Generative Video-to-Audio Transformers with Enhanced
Synchronicity
Paper
• 2407.10387
• Published
• 8
Qwen2-Audio Technical Report
Paper
• 2407.10759
• Published
• 64
Audio Conditioning for Music Generation via Discrete Bottleneck Features
Paper
• 2407.12563
• Published
• 7
Paper
• 2407.14358
• Published
• 26
Efficient Audio Captioning with Encoder-Level Knowledge Distillation
Paper
• 2407.14329
• Published
• 5
MusiConGen: Rhythm and Chord Control for Transformer-Based Text-to-Music
Generation
Paper
• 2407.15060
• Published
• 9
Towards Achieving Human Parity on End-to-end Simultaneous Speech
Translation via LLM Agent
Paper
• 2407.21646
• Published
• 18
Open-Vocabulary Audio-Visual Semantic Segmentation
Paper
• 2407.21721
• Published
• 9
MuChoMusic: Evaluating Music Understanding in Multimodal Audio-Language
Models
Paper
• 2408.01337
• Published
• 11
AVESFormer: Efficient Transformer Design for Real-Time Audio-Visual
Segmentation
Paper
• 2408.01708
• Published
• 4
Facing the Music: Tackling Singing Voice Separation in Cinematic Audio
Source Separation
Paper
• 2408.03588
• Published
• 8
MulliVC: Multi-lingual Voice Conversion With Cycle Consistency
Paper
• 2408.04708
• Published
• 8
PeriodWave: Multi-Period Flow Matching for High-Fidelity Waveform
Generation
Paper
• 2408.07547
• Published
• 9
Accelerating High-Fidelity Waveform Generation via Adversarial Flow
Matching Optimization
Paper
• 2408.08019
• Published
• 11
WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio
Language Modeling
Paper
• 2408.16532
• Published
• 50
The VoxCeleb Speaker Recognition Challenge: A Retrospective
Paper
• 2408.14886
• Published
• 11
Paper
• 2409.00587
• Published
• 33
Density Adaptive Attention-based Speech Network: Enhancing Feature
Understanding for Mental Health Disorders
Paper
• 2409.00391
• Published
• 5
FastVoiceGrad: One-step Diffusion-Based Voice Conversion with
Adversarial Conditional Diffusion Distillation
Paper
• 2409.02245
• Published
• 10
LLaMA-Omni: Seamless Speech Interaction with Large Language Models
Paper
• 2409.06666
• Published
• 60
SongCreator: Lyrics-based Universal Song Generation
Paper
• 2409.06029
• Published
• 22
Draw an Audio: Leveraging Multi-Instruction for Video-to-Audio Synthesis
Paper
• 2409.06135
• Published
• 16
Seed-Music: A Unified Framework for High Quality and Controlled Music
Generation
Paper
• 2409.09214
• Published
• 53
EzAudio: Enhancing Text-to-Audio Generation with Efficient Diffusion
Transformer
Paper
• 2409.10819
• Published
• 18
PDMX: A Large-Scale Public Domain MusicXML Dataset for Symbolic Music
Processing
Paper
• 2409.10831
• Published
• 6
Takin: A Cohort of Superior Quality Zero-shot Speech Generation Models
Paper
• 2409.12139
• Published
• 12
SoloAudio: Target Sound Extraction with Language-oriented Audio
Diffusion Transformer
Paper
• 2409.08425
• Published
• 10
CLAIR-A: Leveraging Large Language Models to Judge Audio Captions
Paper
• 2409.12962
• Published
• 2
MuCodec: Ultra Low-Bitrate Music Codec
Paper
• 2409.13216
• Published
• 22
Temporally Aligned Audio for Video with Autoregression
Paper
• 2409.13689
• Published
• 9
Distilling an End-to-End Voice Assistant Without Instruction Training
Data
Paper
• 2410.02678
• Published
• 23
Roadmap towards Superhuman Speech Understanding using Large Language
Models
Paper
• 2410.13268
• Published
• 33
MuVi: Video-to-Music Generation with Semantic Alignment and Rhythmic
Synchronization
Paper
• 2410.12957
• Published
• 8
Ichigo: Mixed-Modal Early-Fusion Realtime Voice Assistant
Paper
• 2410.15316
• Published
• 12
Continuous Speech Synthesis using per-token Latent Diffusion
Paper
• 2410.16048
• Published
• 29
MaskGCT: Zero-Shot Text-to-Speech with Masked Generative Codec
Transformer
Paper
• 2409.00750
• Published
• 5
Acoustic Volume Rendering for Neural Impulse Response Fields
Paper
• 2411.06307
• Published
• 6
PerceiverS: A Multi-Scale Perceiver with Effective Segmentation for
Long-Term Expressive Symbolic Music Generation
Paper
• 2411.08307
• Published
• 7
Video-Guided Foley Sound Generation with Multimodal Controls
Paper
• 2411.17698
• Published
• 10
Multimodal Music Generation with Explicit Bridges and Retrieval
Augmentation
Paper
• 2412.09428
• Published
• 7
Whisper-GPT: A Hybrid Representation Audio Large Language Model
Paper
• 2412.11449
• Published
• 4
RLDG: Robotic Generalist Policy Distillation via Reinforcement Learning
Paper
• 2412.09858
• Published
• 2
Taming Multimodal Joint Training for High-Quality Video-to-Audio
Synthesis
Paper
• 2412.15322
• Published
• 20
How "Real" is Your Real-Time Simultaneous Speech-to-Text Translation
System?
Paper
• 2412.18495
• Published
• 9
TangoFlux: Super Fast and Faithful Text to Audio Generation with Flow
Matching and Clap-Ranked Preference Optimization
Paper
• 2412.21037
• Published
• 24
HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial
Network for High-Fidelity Speech Super-Resolution
Paper
• 2501.10045
• Published
• 10
Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based
Speech Synthesis
Paper
• 2502.04128
• Published
• 27
Soundwave: Less is More for Speech-Text Alignment in LLMs
Paper
• 2502.12900
• Published
• 86
SongGen: A Single Stage Auto-regressive Transformer for Text-to-Song
Generation
Paper
• 2502.13128
• Published
• 41
Mind the Gap! Static and Interactive Evaluations of Large Audio Models
Paper
• 2502.15919
• Published
• 4
LLMVoX: Autoregressive Streaming Text-to-Speech Model for Any LLM
Paper
• 2503.04724
• Published
• 72
Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding
and Expert Reasoning Abilities
Paper
• 2503.03983
• Published
• 27
YuE: Scaling Open Foundation Models for Long-Form Music Generation
Paper
• 2503.08638
• Published
• 72
Quantization for OpenAI's Whisper Models: A Comparative Analysis
Paper
• 2503.09905
• Published
• 7
From TOWER to SPIRE: Adding the Speech Modality to a Text-Only LLM
Paper
• 2503.10620
• Published
• 7
MegaTTS 3: Sparse Alignment Enhanced Latent Diffusion Transformer for
Zero-Shot Speech Synthesis
Paper
• 2502.18924
• Published
• 16
Kimi-Audio Technical Report
Paper
• 2504.18425
• Published
• 20
Voila: Voice-Language Foundation Models for Real-Time Autonomous
Interaction and Voice Role-Play
Paper
• 2505.02707
• Published
• 85
LLaMA-Omni2: LLM-based Real-time Spoken Chatbot with Autoregressive
Streaming Speech Synthesis
Paper
• 2505.02625
• Published
• 23
VITA-Audio: Fast Interleaved Cross-Modal Token Generation for Efficient
Large Speech-Language Model
Paper
• 2505.03739
• Published
• 9
MiniMax-Speech: Intrinsic Zero-Shot Text-to-Speech with a Learnable
Speaker Encoder
Paper
• 2505.07916
• Published
• 134
Fast Text-to-Audio Generation with Adversarial Post-Training
Paper
• 2505.08175
• Published
• 25
Efficient Speech Language Modeling via Energy Distance in Continuous
Latent Space
Paper
• 2505.13181
• Published
• 9
Learning to Highlight Audio by Watching Movies
Paper
• 2505.12154
• Published
• 3
From Tens of Hours to Tens of Thousands: Scaling Back-Translation for
Speech Recognition
Paper
• 2505.16972
• Published
• 9
OWSM v4: Improving Open Whisper-Style Speech Models via Data Scaling and
Cleaning
Paper
• 2506.00338
• Published
• 10
FusionAudio-1.2M: Towards Fine-grained Audio Captioning with Multimodal
Contextual Fusion
Paper
• 2506.01111
• Published
• 31
Auto-Regressive vs Flow-Matching: a Comparative Study of Modeling
Paradigms for Text-to-Music Generation
Paper
• 2506.08570
• Published
• 33
Discrete Audio Tokens: More Than a Survey!
Paper
• 2506.10274
• Published
• 32
EmoNet-Voice: A Fine-Grained, Expert-Verified Benchmark for Speech
Emotion Detection
Paper
• 2506.09827
• Published
• 21
SonicVerse: Multi-Task Learning for Music Feature-Informed Captioning
Paper
• 2506.15154
• Published
• 9
CultureMERT: Continual Pre-Training for Cross-Cultural Music
Representation Learning
Paper
• 2506.17818
• Published
• 3
USAD: Universal Speech and Audio Representation via Distillation
Paper
• 2506.18843
• Published
• 12
Re-Bottleneck: Latent Re-Structuring for Neural Audio Autoencoders
Paper
• 2507.07867
• Published
• 2
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large
Audio Language Models
Paper
• 2507.08128
• Published
• 13
Paper
• 2507.13264
• Published
• 32
IndexTTS: An Industrial-Level Controllable and Efficient Zero-Shot
Text-To-Speech System
Paper
• 2502.05512
• Published
• 7
OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder
Paper
• 2507.14129
• Published
• 11
STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for
Spoken Language Models
Paper
• 2507.15375
• Published
• 30
Step-Audio 2 Technical Report
Paper
• 2507.16632
• Published
• 74
DMOSpeech 2: Reinforcement Learning for Duration Prediction in
Metric-Optimized Speech Synthesis
Paper
• 2507.14988
• Published
• 8
SonicMaster: Towards Controllable All-in-One Music Restoration and
Mastering
Paper
• 2508.03448
• Published
• 6
Marco-Voice Technical Report
Paper
• 2508.02038
• Published
• 16
NVSpeech: An Integrated and Scalable Pipeline for Human-Like Speech
Modeling with Paralinguistic Vocalizations
Paper
• 2508.04195
• Published
• 1
Representing Speech Through Autoregressive Prediction of Cochlear Tokens
Paper
• 2508.11598
• Published
• 17
Evaluating Podcast Recommendations with Profile-Aware LLM-as-a-Judge
Paper
• 2508.08777
• Published
• 15
Advances in Speech Separation: Techniques, Challenges, and Future Trends
Paper
• 2508.10830
• Published
• 16
LLaSO: A Foundational Framework for Reproducible Research in Large
Language and Speech Model
Paper
• 2508.15418
• Published
• 8
TaDiCodec: Text-aware Diffusion Speech Tokenizer for Speech Language
Modeling
Paper
• 2508.16790
• Published
• 10
VibeVoice Technical Report
Paper
• 2508.19205
• Published
• 143
AudioStory: Generating Long-Form Narrative Audio with Large Language
Models
Paper
• 2508.20088
• Published
• 21
AHELM: A Holistic Evaluation of Audio-Language Models
Paper
• 2508.21376
• Published
• 9
EchoX: Towards Mitigating Acoustic-Semantic Gap via Echo Training for
Speech-to-Speech LLMs
Paper
• 2509.09174
• Published
• 61
VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
Paper
• 2509.09716
• Published
• 12
WhisTLE: Deeply Supervised, Text-Only Domain Adaptation for Pretrained
Speech Recognition Transformers
Paper
• 2509.10452
• Published
• 2
Cross-Attention is Half Explanation in Speech-to-Text Models
Paper
• 2509.18010
• Published
• 6
Efficient Audio-Visual Speech Separation with Discrete Lip Semantics and
Multi-Scale Global-Local Attention
Paper
• 2509.23610
• Published
• 14
UniMoE-Audio: Unified Speech and Music Generation with Dynamic-Capacity
MoE
Paper
• 2510.13344
• Published
• 63
Step-Audio-EditX Technical Report
Paper
• 2511.03601
• Published
• 29
Step-Audio-R1 Technical Report
Paper
• 2511.15848
• Published
• 58
SAM Audio: Segment Anything in Audio
Paper
• 2512.18099
• Published
• 24
Qwen3-TTS Technical Report
Paper
• 2601.15621
• Published
• 69
Qwen3-ASR Technical Report
Paper
• 2601.21337
• Published
• 36
VIBEVOICE-ASR Technical Report
Paper
• 2601.18184
• Published
• 20
MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models
Paper
• 2602.10934
• Published
• 49