TAPS Dataset - Post Processing

Data Mismatch Correction

One of the key challenges in creating the TAPS dataset was addressing the timing differences between throat microphone and acoustic microphone signals. These mismatches occur due to several factors:

Variations in speakers' larynx and oral structures
Differences in phoneme production locations
Distance between the acoustic microphone and speaker's lips

Technical Details

For detailed information about data mismatch analysis and correction methods, including:

Mismatch variation analysis
Impact of microphone distance
Speaker-dependent variations
Phoneme-dependent timing differences

Please refer to the "Technical Validation" section of our paper [Link to paper].

Background Noise Reduction

To ensure high-quality reference signals, we applied noise reduction to the acoustic microphone recordings using the Demucs speech enhancement model.

Process Details

• Used Demucs pretrained causal version
• Applied to acoustic microphone signals only
• Preserved original signal characteristics
• Minimal impact on speech content

Results

• Reduced background noise
• Enhanced signal clarity
• Improved reference quality
• Maintained natural speech characteristics

Note

The noise reduction was applied only to minimize minor background noise in the acoustic microphone recordings. The throat microphone signals were preserved in their original form to maintain the authenticity of the dataset.

Complete Processing Pipeline

High-pass Filtering
Applied 5th-order Butterworth high-pass filter with 50 Hz cut-off frequency to reduce gravitational acceleration effects
Timing Alignment
Corrected timing differences between throat and acoustic microphone signals
Noise Reduction
Applied Demucs-based noise reduction to acoustic microphone recordings
Trimming
Removed silent segments at the beginning and end of recordings
Quality Verification
Manual review of evaluation set utterances for accuracy