Filtered noise shaping for time domain room impulse
response estimation from reverberant speech


Christian J. Steinmetz1*   Vamsi Krishna Ithapu2   Paul Calamia2

1Centre for Digital Music, Queen Mary University of London, London, UK
2Facebook Reality Labs Research, Redmond, Washington, USA

Abstract


Deep learning approaches have emerged that aim to transform an audio signal so that it sounds as if it was recorded in the same room as a reference recording, with applications both in audio post-production and augmented reality. In this work, we propose FiNS, a Filtered Noise Shaping network that directly estimates the time domain room impulse response (RIR) from reverberant speech. Our domain-inspired architecture features a time domain encoder and a filtered noise shaping decoder that models the RIR as a summation of decaying filtered noise signals, along with direct sound and early reflection components. Previous methods for acoustic matching utilize either large models to transform audio to match the target room or predict parameters for algorithmic reverberators. Instead, blind estimation of the RIR enables efficient and realistic transformation with a single convolution. An evaluation demonstrates our model not only synthesizes RIRs that match parameters of the target room, such as the T60 and DRR, but also more accurately reproduces perceptual characteristics of the target room, as shown in a listening test when compared to deep learning baselines.


* Work done during an internship at Facebook Reality Labs Research.

FiNS: Filtered noise shaping RIR synthesis network architecture featuring a time domain encoder and a masking decoder along with a learnable FIR filterbank.

Results


Here we include examples from the listening test using speech utterances from VCTK. The first row contains reverberant speech examples produced by convolving an RIR from each method with the original clean speech. The Reference corresponds to the original reverberant speech signal generated with the measured RIR of the target room. The second row for each utterance includes the raw RIR produced from each method for comparison.


Utterance Clean Reference Anchor Wave-U-Net FiNS (D) FiNS
F1 VCTK

Speech

RIR

F2 VCTK

Speech

RIR

M1 VCTK

Speech

RIR

M2 VCTK

Speech

RIR

Speech recordings are reproduced from CSTR VCTK Corpus (version 0.92). Creative Commons License
This work is licensed under a Creative Commons Attribution 4.0 International License.


Citation


                
    @inproceedings{steinmetz2021fins,
          title={Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech},
          author={Steinmetz, Christian J. and Ithapu, Vamsi Krishna and Calamia, Paul},
          booktitle={IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
          year={2021}
        }