Raw signals are almost never what your model needs. Whether you're working with audio, accelerometer data, EEG, radar, or vibration sensors, the path from raw waveform to useful model input runs through signal processing. Here's the toolkit I use.
Why Raw Waveforms Are Difficult
A one-second audio clip at 44.1kHz is a 44,100-dimensional vector. Most of those dimensions are correlated, redundant, or irrelevant to your task. Feeding raw waveforms to standard classifiers is possible but sample-inefficient. DSP gives you compact, interpretable, task-relevant representations.
The Spectrogram: Your First Tool
The Short-Time Fourier Transform (STFT) slides a window over your signal and computes the frequency content at each position. The result is a 2D time-frequency representation called a spectrogram. This is the go-to representation for audio ML tasks.
import librosa
import numpy as np
y, sr = librosa.load('audio.wav', sr=22050)
# Mel spectrogram — frequency axis perceptually scaled
mel_spec = librosa.feature.melspectrogram(y=y, sr=sr, n_mels=128)
log_mel = librosa.power_to_db(mel_spec, ref=np.max)
print(log_mel.shape) # (128, time_frames)
MFCCs: Compact Audio Fingerprints
Mel-Frequency Cepstral Coefficients (MFCCs) apply a DCT to the mel spectrogram to produce a compact representation — typically 13 to 40 numbers per frame. MFCCs dominated speech recognition for decades and remain useful for tasks where you need low-dimensional audio features.
mfccs = librosa.feature.mfcc(y=y, sr=sr, n_mfcc=13)
# Common practice: use mean and std across time as feature vector
mfcc_features = np.concatenate([mfccs.mean(axis=1), mfccs.std(axis=1)])
Wavelet Transforms for Multi-Scale Analysis
FFT gives you global frequency content. But for non-stationary signals — where the frequency content changes over time — wavelets are more powerful. The Continuous Wavelet Transform (CWT) provides time-frequency resolution that adapts to scale: high temporal resolution at high frequencies, high frequency resolution at low frequencies.
Wavelets are to non-stationary signals what the FFT is to stationary ones — the right tool for the job when your signal's character changes over time.
Sensor Data: IMU and Vibration
For accelerometer and gyroscope data, useful features include RMS energy, zero-crossing rate, dominant frequency bands, spectral entropy, and kurtosis. Kurtosis is particularly useful for detecting impulsive events like mechanical faults — it's sensitive to spike patterns that energy-based metrics miss.
from scipy import stats
def sensor_features(segment):
return {
'rms': np.sqrt(np.mean(segment**2)),
'kurtosis': stats.kurtosis(segment),
'zero_crossing_rate': np.mean(np.diff(np.sign(segment)) != 0),
'peak_frequency': np.argmax(np.abs(np.fft.rfft(segment))),
}
End-to-End vs. Handcrafted Features
Modern deep learning can learn features from raw signals end-to-end — especially if you have large datasets and compute. But handcrafted DSP features still win when you have limited data, need interpretability, or want to constrain the hypothesis space with domain knowledge. In practice, the best systems often combine both: a handcrafted frontend feeding into a learned backend.
Key Takeaways
- Use mel spectrograms as your default audio representation
- Use MFCCs when you need low-dimensional, fast-to-compute audio features
- Reach for wavelets when your signal is non-stationary
- For vibration/sensor data, kurtosis and spectral entropy are underused gold
- Combine handcrafted and learned features for data-scarce problems