Speech Enhancement
Task Overview
This task focuses on converting low-quality throat microphone recordings into high-quality acoustic-like speech. The challenge lies in reconstructing the missing high-frequency components and correcting distortion due to skin and tissue filtering.
For more detailed information, please refer to the TAPS paper on arXiv.
Baseline Models
Three baseline models were evaluated on this task:
1. SE-Conformer
Combines CNN and Conformer blocks for time-frequency modeling. Achieved the best overall performance among the baseline models. Paper
2. Demucs
U-Net-style convolutional encoder-decoder with skip connections and BiLSTM modules for temporal modeling of audio features. Paper
3. TSTNN
Transformer-based model with time-domain processing and two-stage attention mechanism for enhancing throat microphone speech. Paper
Evaluation Metrics
We evaluated the performance of throat microphone speech enhancement models using the following metrics:
Speech Quality
- PESQ (Perceptual Evaluation of Speech Quality): Measures the quality of enhanced speech
- STOI (Short-Time Objective Intelligibility): Evaluates speech intelligibility
Speech Content
- CER (Character Error Rate): Measures the accuracy of speech content at character level
- WER (Word Error Rate): Measures the accuracy of speech content at word level
Speech content metrics (CER and WER) were assessed using the Whisper large-v3-turbo automatic speech recognition (ASR) model, which was fine-tuned on the Korean Zeroth dataset to quantify the recovery of speech information.
Model Performance
The table below shows the performance comparison of baseline models against the unprocessed throat microphone recordings. Higher PESQ and STOI values indicate better quality, while lower CER and WER values indicate better content preservation.
Model | PESQ | STOI | CER (%) | WER (%) |
---|---|---|---|---|
Throat Mic. | 1.22 | 0.70 | 84.4 | 92.2 |
TSTNN | 1.90 | 0.88 | 32.0 | 60.3 |
Demucs | 1.79 | 0.88 | 28.6 | 57.4 |
SE-Conformer | 1.97 | 0.89 | 24.3 | 53.0 |
Audio Samples
Below are audio samples that demonstrate the performance comparison of throat microphone speech enhancement models. Each sample includes the original throat microphone (TM) recording, reference microphone (AM) recording, and the processed results from three different models (TSTNN, Demucs, SE-Conformer).