WavFlow — Audio Generation in Waveform Space

Overview

Generation in Raw Waveform Space

A simpler, more scalable alternative to latent-space audio generation.

Modern audio generation predominantly relies on latent-space compression, which introduces complexity and potential information loss. WavFlow challenges this paradigm by generating high-fidelity audio directly in raw waveform space. By utilizing waveform patchifying and amplitude lifting, WavFlow enables stable flow matching via direct x-prediction, bypassing intermediate representations entirely. Achieving highly competitive results across multiple benchmarks, WavFlow demonstrates that latent compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal generation.

Quantitative Results

State-of-the-Art Performance

Evaluated on VGGSound-Test and MovieGen-Audio-Bench against leading open-source and proprietary baselines.

Evaluation on VGGSound-Test

VT2A · Reference-based

Comparison with state-of-the-art methods on VGGSound-Test for video-text-to-audio (VT2A) generation. Lower is better for FD, KL and DeSync; higher is better for IS, IB and CLAP.

best second-best

Method	FD_PANNs↓	FD_PaSST↓	KL_PANNs↓	IS_PANNs↑	IB↑	DeSync↓	CLAP↑	Params
Frieren^†	11.45	106.10	2.73	12.25	0.23	0.85	0.11	159M
V2A-Mapper^†	8.40	84.57	2.69	12.47	0.23	1.23	0.11	229M
HunyuanVideo-Foley^∗	10.53	97.85	2.02	14.99	0.32	0.54	0.23	—
MMAudio-L-44.1kHz^†	4.72	60.60	1.65	17.40	0.33	0.44	0.22	1.03B
WavFlow-M-16kHz	6.37	62.64	1.68	17.24	0.30	0.47	0.21	624M
WavFlow-L-16kHz	5.86	59.98	1.66	17.40	0.31	0.44	0.22	1.03B
WavFlow-L-44.1kHz	5.25	55.82	1.73	15.05	0.31	0.46	0.19	1.03B

All methods are evaluated on the same VGGSound test split from the MMAudio benchmark, utilizing original videos and native class labels as captions to ensure a fair comparison. Due to the difference in semantic granularity (sparse labels vs. dense captions), we exclude direct comparisons with models relying on LLM-refined captions. ^† results taken from the MMAudio paper. ^∗ reproduced using their open-source checkpoints on the same test set.

Evaluation on MovieGen-Audio-Bench

Reference-free

Reference-free evaluation on MovieGen-Audio-Bench (no ground-truth audio).

best second-best

Method	Params	Training Data	IS↑	CLAP↑	IB-score↑	DeSync↓
WavFlow (Ours)	1.03B	~11.1K h	8.95	0.28	0.24	0.77
MMAudio	1.03B	~8.2K h	8.40	0.28	0.27	0.77
MovieGen	13B	~1,000K h	8.89	0.29	0.36	1.00

Citation

BibTeX

wavflow.bib

@misc{zhou2026wavflowaudiogenerationwaveform,
      title={WavFlow: Audio Generation in Waveform Space}, 
      author={Feiyan Zhou and Luyuan Wang and Shoufa Chen and Zhe Wang and Zhiheng Liu and Yuren Cong and Xiaohui Zhang and Fanny Yang and Belinda Zeng},
      year={2026},
      eprint={2605.18749},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2605.18749}, 
}

WavFlow: Audio Generation in
Waveform Space

Generation in Raw Waveform Space

Architecture

Hear It Raw

Comparison with Baselines

State-of-the-Art Performance

Evaluation on VGGSound-Test

Evaluation on MovieGen-Audio-Bench

BibTeX