Please note: unmute the audio and listen with headphones for best experience.
We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation, we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e., images and depth) into neural acoustic field models. Additionally, we demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques.
Our data collection pipeline involves separate (a) audio and (b) visual capturing to ensure clean signals, with camera and RIR data aligned to the same coordinate system.
We capture audio data using our novel "Earful Tower" microphone tower equipped with 36 omnidirectional microphones and the customized height-adjustable loudspeaker. Impulse responses and real speech recordings are densely captured by moving to various walkable positions in the room.
To achieve high-fidelity visual reconstruction and synthesize appearance from any viewpoint, we adopt the VR-NeRF method, employing the multi-camera rig, "Eyeful Tower". The rig is moved across the floor area for dense capture of static scenes.
We also propose a sim2real approach that significantly improve few-shot RIR synthesis by pretraining neural fields on dense synthetic data and funetuning on sparse real-world samples.
@inproceedings{chen2024RAF,
author = { Chen, Ziyang and
Gebru, Israel D. and
Richardt, Christian and
Kumar, Anurag and
Laney, William and
Owens, Andrew and
Richard, Alexander},
title = {Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark},
journal = {The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)},
year = {2024},
}