Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark

CVPR 2024 (Highlight)

1 University of Michigan    2 Codec Avatars Lab, Pittsburgh PA, Meta
3 Reality Labs Research, Meta
(* Work done during an internship at Meta)

Real Acoustic Fields dataset captures high-quality and dense room impulse response data paired with multi-view images with our "Earful Tower" microphone rig and "Eyeful Tower" camera rig.

Demo Videos

Please note: unmute the audio and listen with headphones for best experience.

Abstract

We present a new dataset called Real Acoustic Fields (RAF) that captures real acoustic room data from multiple modalities. The dataset includes high-quality and densely captured room impulse response data paired with multi-view images, and precise 6DoF pose tracking data for sound emitters and listeners in the rooms. We used this dataset to evaluate existing methods for novel-view acoustic synthesis and impulse response generation which previously relied on synthetic data. In our evaluation, we thoroughly assessed existing audio and audio-visual models against multiple criteria and proposed settings to enhance their performance on real-world data. We also conducted experiments to investigate the impact of incorporating visual data (i.e., images and depth) into neural acoustic field models. Additionally, we demonstrated the effectiveness of a simple sim2real approach, where a model is pre-trained with simulated data and fine-tuned with sparse real-world data, resulting in significant improvements in the few-shot learning approach. RAF is the first dataset to provide densely captured room acoustic data, making it an ideal resource for researchers working on audio and audio-visual neural acoustic field modeling techniques.

Data Capture

Our data collection pipeline involves separate (a) audio and (b) visual capturing to ensure clean signals, with camera and RIR data aligned to the same coordinate system.

Audio Capturing

We capture audio data using our novel "Earful Tower" microphone tower equipped with 36 omnidirectional microphones and the customized height-adjustable loudspeaker. Impulse responses and real speech recordings are densely captured by moving to various walkable positions in the room.


Visual Capturing

To achieve high-fidelity visual reconstruction and synthesize appearance from any viewpoint, we adopt the VR-NeRF method, employing the multi-camera rig, "Eyeful Tower". The rig is moved across the floor area for dense capture of static scenes.

Benchmark

We benchmark existing acoustic field models, including NAF, INRAS, NACF and AV-NeRF. We also introduce two improved baselines NAF++ and INRAS++.


Sim2real for Few-Shot RIR Synthesis

We also propose a sim2real approach that significantly improve few-shot RIR synthesis by pretraining neural fields on dense synthetic data and funetuning on sparse real-world samples.

BibTeX

@inproceedings{chen2024RAF,
      author    = { Chen, Ziyang and 
                    Gebru, Israel D. and 
                    Richardt, Christian and 
                    Kumar, Anurag and
                    Laney, William and
                    Owens, Andrew and 
                    Richard, Alexander},
      title     = {Real Acoustic Fields: An Audio-Visual Room Acoustics Dataset and Benchmark},
      journal   = {The IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR)},
      year      = {2024},
    }