WavFlow: Audio Generation in
Waveform Space

High-fidelity audio synthesized directly in raw waveform space โ€” no VAE, no latent compression.

1 Meta AI 2 Northeastern University

*Corresponding author

Overview

Generation in Raw Waveform Space

A simpler, more scalable alternative to latent-space audio generation.

Modern audio generation predominantly relies on latent-space compression, which introduces complexity and potential information loss. WavFlow challenges this paradigm by generating high-fidelity audio directly in raw waveform space. By utilizing waveform patchifying and amplitude lifting, WavFlow enables stable flow matching via direct x-prediction, bypassing intermediate representations entirely. Achieving highly competitive results across multiple benchmarks, WavFlow demonstrates that latent compression is not a prerequisite for high-quality synthesis, offering a simpler and more scalable alternative for multimodal generation.

WavFlow overview
Method

Architecture

Waveform patchifying + amplitude lifting enable stable flow matching with direct x-prediction in waveform space.

WavFlow architecture
Demos

Hear It Raw

Six categories of everyday Foley sounds synthesized end-to-end by WavFlow โ€” pure raw-waveform output, no VAE.

Forest
Rain
Thunder
Fire
Frog
Cat
Duck
Wolf
Dizi (Chinese Flute)
Drum
Guitar
Piano
Bicycle
Jeep
Ship
Train
Skateboard
Diving
Skiing
Swimming
Stirring Water
Drilling
Pouring Cereal
Water Flow
MovieGen Benchmark

Comparison with Baselines

Click any scene to load the full side-by-side with WavFlow (Ours), MMAudio, and MovieGen Audio.

WavFlow (Ours)
MMAudio
MovieGen Audio
Quantitative Results

State-of-the-Art Performance

Evaluated on VGGSound-Test and MovieGen-Audio-Bench against leading open-source and proprietary baselines.

Evaluation on VGGSound-Test

VT2A ยท Reference-based

Comparison with state-of-the-art methods on VGGSound-Test for video-text-to-audio (VT2A) generation. Lower is better for FD, KL and DeSync; higher is better for IS, IB and CLAP.

best second-best
Method FDPANNs FDPaSST KLPANNs ISPANNs IB↑ DeSync↓ CLAP↑ Params
Frieren 11.45106.102.7312.250.230.850.11159M
V2A-Mapper 8.4084.572.6912.470.231.230.11229M
HunyuanVideo-Foley 10.5397.852.0214.990.320.540.23
MMAudio-L-44.1kHz 4.7260.601.6517.400.330.440.221.03B
WavFlow-M-16kHz 6.3762.641.6817.240.300.470.21624M
WavFlow-L-16kHz 5.8659.981.6617.400.310.440.221.03B
WavFlow-L-44.1kHz 5.2555.821.7315.050.310.460.191.03B

All methods are evaluated on the same VGGSound test split from the MMAudio benchmark, utilizing original videos and native class labels as captions to ensure a fair comparison. Due to the difference in semantic granularity (sparse labels vs. dense captions), we exclude direct comparisons with models relying on LLM-refined captions. results taken from the MMAudio paper. reproduced using their open-source checkpoints on the same test set.

Evaluation on MovieGen-Audio-Bench

Reference-free

Reference-free evaluation on MovieGen-Audio-Bench (no ground-truth audio).

best second-best
Method Params Training Data IS↑ CLAP↑ IB-score↑ DeSync↓
WavFlow (Ours) 1.03B ~11.1K h 8.95 0.28 0.24 0.77
MMAudio 1.03B ~8.2K h 8.40 0.28 0.27 0.77
MovieGen 13B ~1,000K h 8.89 0.29 0.36 1.00
Citation

BibTeX

wavflow.bib
@misc{zhou2026wavflowaudiogenerationwaveform,
      title={WavFlow: Audio Generation in Waveform Space}, 
      author={Feiyan Zhou and Luyuan Wang and Shoufa Chen and Zhe Wang and Zhiheng Liu and Yuren Cong and Xiaohui Zhang and Fanny Yang and Belinda Zeng},
      year={2026},
      eprint={2605.18749},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2605.18749}, 
}