Filtered noise shaping for time domain room impulse
response estimation from reverberant speech

Christian J. Steinmetz^1* Vamsi Krishna Ithapu² Paul Calamia²

¹Centre for Digital Music, Queen Mary University of London, London, UK
²Facebook Reality Labs Research, Redmond, Washington, USA

arXiv

Abstract

Deep learning approaches have emerged that aim to transform an audio signal so that it sounds as if it was recorded in the same room as a reference recording, with applications both in audio post-production and augmented reality. In this work, we propose FiNS, a Filtered Noise Shaping network that directly estimates the time domain room impulse response (RIR) from reverberant speech. Our domain-inspired architecture features a time domain encoder and a filtered noise shaping decoder that models the RIR as a summation of decaying filtered noise signals, along with direct sound and early reflection components. Previous methods for acoustic matching utilize either large models to transform audio to match the target room or predict parameters for algorithmic reverberators. Instead, blind estimation of the RIR enables efficient and realistic transformation with a single convolution. An evaluation demonstrates our model not only synthesizes RIRs that match parameters of the target room, such as the T60 and DRR, but also more accurately reproduces perceptual characteristics of the target room, as shown in a listening test when compared to deep learning baselines.

* Work done during an internship at Facebook Reality Labs Research.

FiNS: Filtered noise shaping RIR synthesis network architecture featuring a time domain encoder and a masking decoder along with a learnable FIR filterbank.

Results

Here we include examples from the listening test using speech utterances from VCTK. The first row contains reverberant speech examples produced by convolving an RIR from each method with the original clean speech. The Reference corresponds to the original reverberant speech signal generated with the measured RIR of the target room. The second row for each utterance includes the raw RIR produced from each method for comparison.

Utterance	Clean	Reference	Anchor	Wave-U-Net	FiNS (D)	FiNS
F1 VCTK Speech
	RIR
F2 VCTK Speech
	RIR
M1 VCTK Speech
	RIR
M2 VCTK Speech
	RIR

Speech recordings are reproduced from CSTR VCTK Corpus (version 0.92).
This work is licensed under a Creative Commons Attribution 4.0 International License.

Citation

                
    @inproceedings{steinmetz2021fins,
          title={Filtered Noise Shaping for Time Domain Room Impulse Response Estimation From Reverberant Speech},
          author={Steinmetz, Christian J. and Ithapu, Vamsi Krishna and Calamia, Paul},
          booktitle={IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)},
          year={2021}
        }

Filtered noise shaping for time domain room impulse response estimation from reverberant speech

Abstract

Results

Citation

Filtered noise shaping for time domain room impulse
response estimation from reverberant speech