Real Time Speech Enhancement in the Waveform Domain

We present here audio samples for the causal Demucs model trained on the DNS challenge dataset as presented in the paper Real Time Speech Enhancement in the Waveform Domain. We used the causal Demucs with H=64, Revecho augmentation with partial dereverberation (10% of reverb kept), and adding back 1% of the dry signal.

We used a specific causal implementation for evaluation, which feed to model with audio frames of 40ms, strided by 16ms. The model outputs a prediction for the left-most 16ms of the input frame. On a quad-core Intel i7-8565U CPU (2.0 GHz, up to AVX2 instruction set), it takes just about 16ms to evaluate, allowing for real time speech enhancement on laptop. The model weights 135MB, with future work planned on quantization.

Real life samples

The following samples are taken from the authors daily life, in real noisy conditions. We feature different languages than English to test how well the model adapts.

Noisy (French):