Speech Enhancement
Baseline Models
Baseline Models
Detailed information about baseline models, their implementations, and performance results.
Model Architectures
SE-Conformer Best Performance
A speech enhancement model combining convolutional neural networks with the Conformer architecture.
Model Parameters
- • Input dimension: 512
- • FFN dimension: 64
- • Attention heads: 4
- • Convolution kernel: 15
- • Conformer depth: 4
Key Features
- • Multi-head self-attention
- • Convolution modules
- • Feed-forward networks
- • Layer normalization
Demucs
Multi-layer convolutional encoder-decoder architecture with U-Net-style skip connections.
Model Parameters
- • Kernel size: 8
- • Hidden channels: 64
- • Stride: 2
- • Depth: 5
Architecture
- • Convolutional encoder-decoder
- • Bi-directional LSTM
- • GLU activation
- • Skip connections
TSTNN
Transformer-based model for end-to-end speech enhancement in the time domain.
Components
- • Encoder with dilated dense blocks
- • Two-stage transformer module
- • Masking module
- • Decoder with sub-pixel convolution
Processing
- • Local-global feature processing
- • Multi-head attention
- • GRU layers
- • Group normalization
Performance Results
Evaluation Metrics
Model | PESQ | STOI | CER (%) | WER (%) |
---|---|---|---|---|
Throat Mic | 1.22 | 0.70 | 84.4 | 92.2 |
TSTNN | 2.00 | 0.89 | 25.7 | 54.3 |
Demucs | 1.86 | 0.89 | 17.1 | 47.8 |
SE-Conformer | 2.08 | 0.90 | 12.5 | 43.3 |
Implementation Details
Training Configuration
- • Epochs: 200
- • Optimizer: Adam (lr=3e-4)
- • Batch size: 16
- • Segment length: 2 seconds
Loss Functions
- • L1 loss on waveform
- • Multi-resolution STFT loss
- • Time-frequency domain loss
Code and Resources
Access our implementation and experiment with the baseline models: