One of the first problems presented to students of deep learning is to classify handwritten digits in the MNIST dataset. This was recently ported to the web thanks to deeplearn.js. The web version has distinct educational advantages over the relatively dry TensorFlow tutorial. You can immediately get a feeling for the model, and start building intuition for what works and what doesn’t. Let’s preserve this interactivity, but change domains to audio. This post sets the scene for the auditory equivalent of MNIST. Rather than recognize handwritten digits, we will focus on recognizing spoken commands. We’ll do this by converting sounds like this:
Into images like this, called log-mel spectrograms, and in the next post, feed these images into the same types of models that do handwriting recognition so well:
The audio feature extraction technique I discuss here is generic enough to work for all sorts of audio, not just human speech. The rest of the post explains how. If you don’t care and just want to see the code, or play with some live demos, be my guest!
Neural networks are having quite a resurgence, and for good reason. Computers are beating humans at many challenging tasks, from identifying faces and images, to playing Go. The basic principles of neural nets is relatively simple, but the details can get quite complex. Luckily non-AI experts can get a feeling for what can be done because a lot of output is quite engaging. Unfortunately, these demos are mostly visual in nature, either examples of computer vision, or generate images or video as their main output. And few of these examples are interactive.
Pre-processing audio sounds hard, do we have to?
Raw audio is a pressure wave sampled at tens of thousands times per second and stored as an array of numbers. It’s quite a bit of data, but there are neural networks that can ingest it directly. Wavenet does speech to text and text to speech using raw audio sequences, without any explicit feature extraction. Unfortunately, it’s slow: running speech recognition on a 2s example took 30s on my laptop. Doing this in real-time, in a web browser isn’t quite ready yet.
Convolutional Neural Networks (CNNs) are a big reason why there has been so much interesting work done in computer vision recently. These networks are designed to work on matrices representing 2D images, so a natural idea is to take our raw audio and generate an image from it. Generating these images from audio is sometimes called a frontend in speech recognition papers. Just to hammer the point home, here’s a diagram explaining why we need to do this step:
The standard way of generating images from audio is by looking at the audio chunk-by-chunk, and analyzing it in the frequency domain, and then applying various techniques to massage that data into a form that is well suited to machine learning. This is a common technique in sound and speech processing, and there are great implementations in Python. TensorFlow even has a custom op for extracting spectrograms from audio.
On the web, these tools are lacking. The Web Audio API can almost do this, using the
AnalyserNode, as I’ve shown in the past, but there is an important limitation in the context of data processing:
RealtimeAnalyser) is only for real-time analysis. You can setup an
OfflineAudioContext and run your audio through the analyser, but you will get unreliable results.
Audio feature extraction
I found an audio feature extraction tutorial, which I followed closely when implementing this feature extractor in TypeScript. What follows can be a useful companion to that tutorial.
Let’s begin with an audio example (a man saying the word “left”):
Here’s that raw waveform plotted as pressure as a function of time:
We could take the FFT over the whole signal, but it changes a lot over time. In our example above, the “left” utterance only takes about 200 ms, and most of the signal is silence. Instead, we break up the raw audio signal into overlapping buffers, spaced a hop length apart. Having our buffers overlap ensures that we don’t miss out on any interesting details happening at the buffer boundaries. There is an art to picking the right buffer and hop lengths:
- Pick too small a buffer, and you end up with an overly detailed image, and risk your neural net training on some irrelevant minutia, missing the forest for the trees.
- Pick too large a buffer, and you end up with an image too coarse to be useful.
In the illustration below, you can see five full buffers that overlap one another by 50%. For illustration purposes only, the buffer and hop durations are large (400 ms and 200ms respectively). In practice, we tend to use much shorter buffers (eg. 20-40 ms), and often even shorter hop lengths to capture minute changes in audio signal.
Then, we consider each buffer in the frequency domain. We can do this using an Fast Fourier Transform (FFT) algorithm. This algorithm gives us complex values from which we can extract magnitudes or energies. For example, here are the FFT energies of one of the buffers, approximately the second one in the above image, where the speaker begins saying the “le” syllable of “left”:
Now imagine we do this for every buffer we generated in the previous step, take each FFT arrays and instead of showing energy as a function of frequency, stack the array vertically so that y-axis represents frequency and color represents energy. We end up with a spectrogram:
We could feed this image into our neural network, but you’ll agree that it looks pretty sparse. We have wasted so much space, and there’s not much signal there for a neural network to train on.
Let’s jump back to the FFT plot to zoom our image into our area of interest. The frequencies in this plot are bunched up below 5 KHz since the speaker isn’t producing particularly high-frequency sound. Human audition tends to be logarithmic, so we can view the same range on a log-plot:
Let’s generate new spectrograms as we did in an earlier step, but rather than using a linear plot of energies, use can a log-plot of FFT energies:
Looks a bit better, but there is room for improvement. Humans are much better at discerning small changes in pitch at low frequencies than at high frequencies. The Mel scale relates pitch of a pure tone to its actual measured frequency. To go from frequencies to Mels, we create a triangular filter bank:
Each colorful triangle above is a window that we can apply to the frequency representation of the sound. Applying each window to the FFT energies we generated earlier will give us the Mel spectrum, in this case an array of 20 values:
Plotting this as a spectrogram, we get our feature, the log-mel spectrogram:
The 1s images above are generated using audio feature extraction software written in TypeScript, which I’ve released publicly. Here’s a demo that lets you run the feature extractor on your own audio, and the code on github.
Handling real-time audio input
By default the feature extractor frontend takes a fixed buffer of audio as input. But to make an interactive audio demo, we need to process a continuous stream of audio data. So we will need to generate new images as new audio comes in. Luckily we don’t need to recompute the whole log-mel spectrogram every time, just the new parts of the image. We can then add the new parts of spectrogram on the right, and remove the old parts, resulting in a movie that feeds from the right to the left. The
StreamingFeatureExtractor class implements this important optimization.
But there is one caveat: it currently relies on
ScriptProcessorNode, which is notorious for dropping samples. I’ve tried to mitigate this as much as possible by using a large input buffer size, but the real solution will be to use AudioWorklets when they are available.
An implementation note: here is a comparison of JS FFT libraries which suggests the Emscripten-compiled KissFFT is the fastest (but still 2-5x slower than native), and the one I used.
The images resulting from the three implementations are similar, which is a good sanity check, but they are not identical. I haven’t found the time yet, but it would be very worthwhile to have a consistent cross platform audio feature extractor, so that models trained in Python/C++ could run directly on the web, and vice versa.
I should also mention that although log-mel features are commonly used by serious audio researchers, this is an active area of research. Another audio feature extraction technique called Per-Channel Energy Normalization (PCEN) appears to perform better at least in some cases, like processing far field audio. I haven’t had time to delve into the details yet, but understanding it and porting it to the web also seems like a worthy task.
Ok, so to recap, we’ve generated log-mel spectrogram images from streaming audio that are ready to feed into a neural network. Oh yeah, the actual machine learning part? That’s the next post coming up next week.
This is a post by Boris Smus, originally from Boris’ website, posted to XRDS with permission of the author.