Christian J. Steinmetz1* Vamsi Krishna Ithapu2 Paul Calamia2
1Centre for Digital Music, Queen Mary University of London, London, UK
2Facebook Reality Labs Research, Redmond, Washington, USA
Deep learning approaches have emerged that aim to transform an audio signal so that it sounds as if it was recorded in the same room as a reference recording, with applications both in audio post-production and augmented reality. In this work, we propose FiNS, a Filtered Noise Shaping network that directly estimates the time domain room impulse response (RIR) from reverberant speech. Our domain-inspired architecture features a time domain encoder and a filtered noise shaping decoder that models the RIR as a summation of decaying filtered noise signals, along with direct sound and early reflection components. Previous methods for acoustic matching utilize either large models to transform audio to match the target room or predict parameters for algorithmic reverberators. Instead, blind estimation of the RIR enables efficient and realistic transformation with a single convolution. An evaluation demonstrates our model not only synthesizes RIRs that match parameters of the target room, such as the T60 and DRR, but also more accurately reproduces perceptual characteristics of the target room, as shown in a listening test when compared to deep learning baselines.
FiNS: Filtered noise shaping RIR synthesis network architecture featuring a time domain encoder and a masking decoder along with a learnable FIR filterbank.
Here we include examples from the listening test using speech utterances from VCTK. The first row contains reverberant speech examples produced by convolving an RIR from each method with the original clean speech. The Reference corresponds to the original reverberant speech signal generated with the measured RIR of the target room. The second row for each utterance includes the raw RIR produced from each method for comparison.
Utterance | Clean | Reference | Anchor | Wave-U-Net | FiNS (D) | FiNS |
---|---|---|---|---|---|---|
F1 VCTK Speech |
||||||
RIR |
||||||
F2 VCTK Speech |
||||||
RIR |
||||||
M1 VCTK Speech |
||||||
RIR |
||||||
M2 VCTK Speech |
||||||
RIR |
Speech recordings are reproduced from CSTR VCTK Corpus (version 0.92).
This
work is licensed under a Creative
Commons Attribution 4.0 International License.
@inproceedings{steinmetz2021fins,
title={Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech},
author={Steinmetz, Christian J. and Ithapu, Vamsi Krishna and Calamia, Paul},
booktitle={IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
year={2021}
}