Speech Enhancement Baseline Models

Baseline Models

Detailed information about baseline models, their implementations, and performance results.

Model Architectures

SE-Conformer Best Performance

A speech enhancement model combining convolutional neural networks with the Conformer architecture.

Model Parameters

• Input dimension: 512
• FFN dimension: 64
• Attention heads: 4
• Convolution kernel: 15
• Conformer depth: 4

Key Features

• Multi-head self-attention
• Convolution modules
• Feed-forward networks
• Layer normalization

Demucs

Multi-layer convolutional encoder-decoder architecture with U-Net-style skip connections.

Model Parameters

• Kernel size: 8
• Hidden channels: 64
• Stride: 2
• Depth: 5

Architecture

• Convolutional encoder-decoder
• Bi-directional LSTM
• GLU activation
• Skip connections

TSTNN

Transformer-based model for end-to-end speech enhancement in the time domain.

Components

• Encoder with dilated dense blocks
• Two-stage transformer module
• Masking module
• Decoder with sub-pixel convolution

Processing

• Local-global feature processing
• Multi-head attention
• GRU layers
• Group normalization

Performance Results

Evaluation Metrics

Model	PESQ	STOI	CER (%)	WER (%)
Throat Mic	1.22	0.70	84.4	92.2
TSTNN	2.00	0.89	25.7	54.3
Demucs	1.86	0.89	17.1	47.8
SE-Conformer	2.08	0.90	12.5	43.3

Implementation Details

Training Configuration

• Epochs: 200
• Optimizer: Adam (lr=3e-4)
• Batch size: 16
• Segment length: 2 seconds

Loss Functions

• L1 loss on waveform
• Multi-resolution STFT loss
• Time-frequency domain loss

Code and Resources

Access our implementation and experiment with the baseline models:

GitHub Repository - Complete Implementation Pre-trained Model Weights Documentation and Usage Guide