Speech Enhancement Baseline Models

Baseline Models

Detailed information about baseline models, their implementations, and performance results.

Model Architectures

SE-Conformer Best Performance

A speech enhancement model combining convolutional neural networks with the Conformer architecture.

Model Parameters

  • • Input dimension: 512
  • • FFN dimension: 64
  • • Attention heads: 4
  • • Convolution kernel: 15
  • • Conformer depth: 4

Key Features

  • • Multi-head self-attention
  • • Convolution modules
  • • Feed-forward networks
  • • Layer normalization

Demucs

Multi-layer convolutional encoder-decoder architecture with U-Net-style skip connections.

Model Parameters

  • • Kernel size: 8
  • • Hidden channels: 64
  • • Stride: 2
  • • Depth: 5

Architecture

  • • Convolutional encoder-decoder
  • • Bi-directional LSTM
  • • GLU activation
  • • Skip connections

TSTNN

Transformer-based model for end-to-end speech enhancement in the time domain.

Components

  • • Encoder with dilated dense blocks
  • • Two-stage transformer module
  • • Masking module
  • • Decoder with sub-pixel convolution

Processing

  • • Local-global feature processing
  • • Multi-head attention
  • • GRU layers
  • • Group normalization

Performance Results

Evaluation Metrics

Model PESQ STOI CER (%) WER (%)
Throat Mic 1.22 0.70 84.4 92.2
TSTNN 2.00 0.89 25.7 54.3
Demucs 1.86 0.89 17.1 47.8
SE-Conformer 2.08 0.90 12.5 43.3

Implementation Details

Training Configuration

  • • Epochs: 200
  • • Optimizer: Adam (lr=3e-4)
  • • Batch size: 16
  • • Segment length: 2 seconds

Loss Functions

  • • L1 loss on waveform
  • • Multi-resolution STFT loss
  • • Time-frequency domain loss

Code and Resources