Task Overview

This task focuses on converting low-quality throat microphone recordings into high-quality acoustic-like speech. The challenge lies in reconstructing the missing high-frequency components and correcting distortion due to skin and tissue filtering.

For more detailed information, please refer to the TAPS paper on arXiv.

Baseline Models

Three baseline models were evaluated on this task:

1. SE-Conformer

Combines CNN and Conformer blocks for time-frequency modeling. Achieved the best overall performance among the baseline models. Paper

2. Demucs

U-Net-style convolutional encoder-decoder with skip connections and BiLSTM modules for temporal modeling of audio features. Paper

3. TSTNN

Transformer-based model with time-domain processing and two-stage attention mechanism for enhancing throat microphone speech. Paper

Evaluation Metrics

We evaluated the performance of throat microphone speech enhancement models using the following metrics:

Speech Quality

  • PESQ (Perceptual Evaluation of Speech Quality): Measures the quality of enhanced speech
  • STOI (Short-Time Objective Intelligibility): Evaluates speech intelligibility

Speech Content

  • CER (Character Error Rate): Measures the accuracy of speech content at character level
  • WER (Word Error Rate): Measures the accuracy of speech content at word level

Speech content metrics (CER and WER) were assessed using the Whisper large-v3-turbo automatic speech recognition (ASR) model, which was fine-tuned on the Korean Zeroth dataset to quantify the recovery of speech information.

Model Performance

The table below shows the performance comparison of baseline models against the unprocessed throat microphone recordings. Higher PESQ and STOI values indicate better quality, while lower CER and WER values indicate better content preservation.

ModelPESQSTOICER (%)WER (%)
Throat Mic.1.220.7084.492.2
TSTNN1.900.8832.060.3
Demucs1.790.8828.657.4
SE-Conformer1.970.8924.353.0

Audio Samples

Below are audio samples that demonstrate the performance comparison of throat microphone speech enhancement models. Each sample includes the original throat microphone (TM) recording, reference microphone (AM) recording, and the processed results from three different models (TSTNN, Demucs, SE-Conformer).

Sample 1 (p62/u00, Female)

Throat Mic.

Acoustic Mic.

TSTNN

Demucs

SE-Conformer

Sample 2 (p63/u00, Female)

Throat Mic.

Acoustic Mic.

TSTNN

Demucs

SE-Conformer

Sample 3 (p65/u00, Male)

Throat Mic.

Acoustic Mic.

TSTNN

Demucs

SE-Conformer

Sample 4 (p70/u00, Male)

Throat Mic.

Acoustic Mic.

TSTNN

Demucs

SE-Conformer