Real Time Speech Enhancement in the Waveform Domain

We present here audio samples for the causal Demucs model trained on the DNS challenge dataset as presented in the paper Real Time Speech Enhancement in the Waveform Domain. We used the causal Demucs with H=64, Revecho augmentation with partial dereverberation (10% of reverb kept), and adding back 1% of the dry signal.

We used a specific causal implementation for evaluation, which feed to model with audio frames of 40ms, strided by 16ms. The model outputs a prediction for the left-most 16ms of the input frame. On a quad-core Intel i7-8565U CPU (2.0 GHz, up to AVX2 instruction set), it takes just about 16ms to evaluate, allowing for real time speech enhancement on laptop. The model weights 135MB, with future work planned on quantization.

Real life samples

The following samples are taken from the authors daily life, in real noisy conditions. We feature different languages than English to test how well the model adapts.
Noisy (French): Enhanced:
Noisy (Hebrew): Enhanced:
Noisy (Hebrew): Enhanced:
Noisy (French): Enhanced:

Samples from the blind test set of the DNS dataset

The following samples are taken from the blind test set of the DNS challenge dataset.

Category "reverb"

The following samples are artificial mixtures with reverb added, from the DNS blind test set.

Noisy: Enhanced:
Noisy: Enhanced:

Category "real rec"

The following samples are real recordings from the DNS blind test set.

Noisy: Enhanced:
Noisy: Enhanced:
Noisy: Enhanced:

Real-Time demo

Watch the video in the following link for a real-time demo presentation: Demo.