Documentation Post Processing

Post Processing

Description of signal processing techniques and data preparation procedures used in the TAPS dataset.

Data Mismatch Correction

One of the key challenges in creating the TAPS dataset was addressing the timing differences between throat microphone and acoustic microphone signals. These mismatches occur due to several factors:

  • Variations in speakers' larynx and oral structures
  • Differences in phoneme production locations
  • Distance between the acoustic microphone and speaker's lips

Technical Details

For detailed information about data mismatch analysis and correction methods, including:

  • Mismatch variation analysis
  • Impact of microphone distance
  • Speaker-dependent variations
  • Phoneme-dependent timing differences

Please refer to the "Technical Validation" section of our paper [Link to paper].

Background Noise Reduction

To ensure high-quality reference signals, we applied noise reduction to the acoustic microphone recordings using the Demucs speech enhancement model.

Process Details

  • • Used Demucs pretrained causal version
  • • Applied to acoustic microphone signals only
  • • Preserved original signal characteristics
  • • Minimal impact on speech content

Results

  • • Reduced background noise
  • • Enhanced signal clarity
  • • Improved reference quality
  • • Maintained natural speech characteristics

Note

The noise reduction was applied only to minimize minor background noise in the acoustic microphone recordings. The throat microphone signals were preserved in their original form to maintain the authenticity of the dataset.

Complete Processing Pipeline

  1. High-pass Filtering

    Applied 5th-order Butterworth high-pass filter with 50 Hz cut-off frequency to reduce gravitational acceleration effects

  2. Timing Alignment

    Corrected timing differences between throat and acoustic microphone signals

  3. Noise Reduction

    Applied Demucs-based noise reduction to acoustic microphone recordings

  4. Trimming

    Removed silent segments at the beginning and end of recordings

  5. Quality Verification

    Manual review of evaluation set utterances for accuracy