Web-based voice command recognition

Last time we converted audio buffers into images. This time we’ll take these images and train a neural network using deeplearn.js. The result is a browser-based demo that lets you speak a command (“yes” or “no”), and see the output of the classifier in real-time, like this:

Curious to play with it, see whether or not it recognizes yay or nay in addition to yes and noTry it out live. You will quickly see that the performance is far from perfect. But that’s ok with me: this example is intended to be a reasonable starting point for doing all sorts of audio recognition on the web. Now, let’s dive into how this works.

Quick start: training and testing a command recognizer

Here’s how you can train your own yes/no classifier:

  1. Go to the model training page. It will take a bit of time to download the training data from the server.
  2. Click the train button, and you’ll see a graph showing training progress. Once you are ready (this will take a while, perhaps 500 iterations or 2 minutes, depending on your hardware), stop training, and press the save weights (file) button. This will download a JSON file.


  3. Then go to the inference demo page, press the load weights (file) button and select the downloaded JSON file to load the trained model.
  4. Flip the switch, grant access to the microphone and try saying “yes” or “no”. You’ll see microphone and confidence levels indicated at the bottom of the page.


The above is a mechanistic account of how the training example works. If you are interested in learning about the gory (and interesting) details, read on.

Data pre-processing and loading

Training a neural net requires a lot of training data. In practice, millions of examples may be required, but the dataset we’ll be using is small by modern standards, with just 65,000 labeled examples. Each example is a separate wav file, with the label in the filename.

Loading each training wav as a separate request turned out to be quite slow. The overhead from each request is small, but when compounded over a few thousand times, really starts to be felt. An easy optimization to load data more quickly is to put all examples with the same label into one long audio file. Decoding audio files is pretty fast, and so is splitting them into one second long buffers. A further optimization is to use a compressed audio format, such as mp3. scripts/preprocess.py will do this concatenation for you, producing this mesmerising result.

After we “rehydrate” our raw audio examples, we process buffers of raw data into features. We do this using the Audio Feature extractor I mentioned in the last post, which takes in raw audio, and produces a log-mel spectrogram. This is relatively slow, and accounts for most of the time spent loading the dataset.

Model training considerations

For the yes/no recognizer, we have only two commands that we care about: “yes”, and “no”. But we also want to detect the lack of any such utterances, as well as silence. We include a set of random utterances as the “other” category (none of which are yes or no). This example is also generated by the preprocessing script.

Since we’re dealing with real microphones, we never expect to hear pure silence. Instead, “silence” is some level of ambient noise compounded by crappy microphone quality. Luckily, the training data also includes background noise which we mix with our training examples at various volumes. We also generate a set of silence examples, which includes only the background audio. Once we’ve prepared our samples, we generate our final spectrograms as our input.

To generate these final spectrograms, we need to decide on buffer and hop length. A reasonable buffer length is 1024, and a hop length of 512. Since we are dealing with sample rate of 16000 Hz, it works out to a window duration of about 60ms, sampled every 30ms.

Once we have labeled spectrograms, we need to convert inputs and labels into deeplearn arrays. Label strings “yes”, “no”, “other”, and “silence” will be one-hot encoded as a Array1Ds of four integers, meaning that “yes” corresponds to [1, 0, 0, 0], and “no” to [0, 1, 0, 0]. Spectrograms from the feature extractor need to be converted into an Array3D, which can be fed as input for the model.

The model we are training consists of two convolution layers, and one fully connected layer. I took this architecture directly from the MNIST example of deeplearn.js, and hasn’t been customized for dealing with spectrograms at all. As a result, performance is a far cry from state of the art speech recognition. To see even more mis-classifications, try out MNIST for audio which recognizes spoken digits (eg. “zero” through “ten”). I am confident that we could do better by following this paper. A real-world speech recognizer might not use convolution at all, instead opting for an LSTM, which is better suited to process time-series data.

Lastly, we want to tell the machine learning framework how to train the model. In ML parlance, we need to set the hyperparameters, which includes setting the learning rate (how much to follow the gradient at each step) and batch size (how many examples to ingest at a time). And we’re off to the races:

2-training-graph (1)

During training, the gradient descent algorithm tries to minimize cost, which you can see in blue. We also plot accuracy in orange, which is occasionally calculated by running inference on a test set. We use a random subset of the test set because inference takes time, and we’d like to train as quickly as possible.

Once we are happy with the test accuracy, we can save the model weights and use them to infer results.

Saving and loading model weights

A model is defined by its architecture and the weights of its weight-bearing nodes. Weights are the values that are learned during the process of model training, and not all nodes have weights. ReLUs and flatten nodes don’t. But convolution and fully connected nodes have both weights and biases. These weights are tensors of arbitrary shapes. To save and load models, we need to be able to save both graphs and their weights.

Saving & loading models is important for a few reasons:

  1. Model training takes time, so you might want to train a bit, save weights, take a break, and then resume from where you left off. This is called checkpointing.
  2. For inference, it’s useful to have a self-contained model that you can just load and run.

At the time of writing, deeplearn.js didn’t have facilities to serialize models and model weights. For this example, I’ve implemented a way to load and save weights, assuming that the model architecture itself is hard-coded. The GraphSaverLoader class can save & load from a local store (IndexedDB), or from a file. Ultimately, we will need a non-hacky way of saving and loading models and their corresponding weights, and I’m excited for the near future of improved ML developer ergonomics.

Wrapping up

Many thanks to Nikhil and Daniel for their hard work on deeplearn.js, and willingness to answer my barrages of stupid little questions. Also, toPete, who is responsible for creating and releasing the dataset I used in this post. And thank you dear reader, for reading this far.

I’m stoked to see how this kind of browser based audio recognition tech can be applied to exciting, educational ML projects like Teachable Machine. How cool would it be if you could make a self-improving system, which trains on every additional spoken utterance? The ability to train these kinds of models in the browser allows us to entertain such possibilities in a privacy preserving way, without sending anything to any server.

So there you have it! This has been an explanation of voice command recognition on the web. We covered feature extraction in the previous post, and this time, dug a little bit into model training and real-time inference entirely in the browser.

If you build on this example, please drop me a note on twitter.

This is a post by Boris Smus, originally from Boris’ website, posted to XRDS with permission of the author.

Audio features for web-based ML

One of the first problems presented to students of deep learning is to classify handwritten digits in the MNIST dataset. This was recently ported to the web thanks to deeplearn.js. The web version has distinct educational advantages over the relatively dry TensorFlow tutorial. You can immediately get a feeling for the model, and start building intuition for what works and what doesn’t. Let’s preserve this interactivity, but change domains to audio. This post sets the scene for the auditory equivalent of MNIST. Rather than recognize handwritten digits, we will focus on recognizing spoken commands. We’ll do this by converting sounds like this:

Into images like this, called log-mel spectrograms, and in the next post, feed these images into the same types of models that do handwriting recognition so well:


The audio feature extraction technique I discuss here is generic enough to work for all sorts of audio, not just human speech. The rest of the post explains how. If you don’t care and just want to see the code, or play with some live demos, be my guest!


Neural networks are having quite a resurgence, and for good reason. Computers are beating humans at many challenging tasks, from identifying faces and images, to playing Go. The basic principles of neural nets is relatively simple, but the details can get quite complex. Luckily non-AI experts can get a feeling for what can be done because a lot of output is quite engaging. Unfortunately, these demos are mostly visual in nature, either examples of computer vision, or generate images or video as their main output. And few of these examples are interactive.

Pre-processing audio sounds hard, do we have to?

Raw audio is a pressure wave sampled at tens of thousands times per second and stored as an array of numbers. It’s quite a bit of data, but there are neural networks that can ingest it directly. Wavenet does speech to text and text to speech using raw audio sequences, without any explicit feature extraction. Unfortunately, it’s slow: running speech recognition on a 2s example took 30s on my laptop. Doing this in real-time, in a web browser isn’t quite ready yet.

Convolutional Neural Networks (CNNs) are a big reason why there has been so much interesting work done in computer vision recently. These networks are designed to work on matrices representing 2D images, so a natural idea is to take our raw audio and generate an image from it. Generating these images from audio is sometimes called a frontend in speech recognition papers. Just to hammer the point home, here’s a diagram explaining why we need to do this step:


The standard way of generating images from audio is by looking at the audio chunk-by-chunk, and analyzing it in the frequency domain, and then applying various techniques to massage that data into a form that is well suited to machine learning. This is a common technique in sound and speech processing, and there are great implementations in Python. TensorFlow even has a custom op for extracting spectrograms from audio.

On the web, these tools are lacking. The Web Audio API can almost do this, using the AnalyserNode, as I’ve shown in the past, but there is an important limitation in the context of data processing: AnalyserNode (nee RealtimeAnalyser) is only for real-time analysis. You can setup an OfflineAudioContext and run your audio through the analyser, but you will get unreliable results.

The alternative is to do this without the Web Audio API, and there are many signal processing JavaScript libraries that might help. None of them are quite adequate, for reasons of incompleteness or abandonment. But here’s an illustrated take on extracting Mel features from raw audio.

Audio feature extraction

I found an audio feature extraction tutorial, which I followed closely when implementing this feature extractor in TypeScript. What follows can be a useful companion to that tutorial.

Let’s begin with an audio example (a man saying the word “left”):


Here’s that raw waveform plotted as pressure as a function of time:


We could take the FFT over the whole signal, but it changes a lot over time. In our example above, the “left” utterance only takes about 200 ms, and most of the signal is silence. Instead, we break up the raw audio signal into overlapping buffers, spaced a hop length apart. Having our buffers overlap ensures that we don’t miss out on any interesting details happening at the buffer boundaries. There is an art to picking the right buffer and hop lengths:

  • Pick too small a buffer, and you end up with an overly detailed image, and risk your neural net training on some irrelevant minutia, missing the forest for the trees.
  • Pick too large a buffer, and you end up with an image too coarse to be useful.

In the illustration below, you can see five full buffers that overlap one another by 50%. For illustration purposes only, the buffer and hop durations are large (400 ms and 200ms respectively). In practice, we tend to use much shorter buffers (eg. 20-40 ms), and often even shorter hop lengths to capture minute changes in audio signal.


Then, we consider each buffer in the frequency domain. We can do this using an Fast Fourier Transform (FFT) algorithm. This algorithm gives us complex values from which we can extract magnitudes or energies. For example, here are the FFT energies of one of the buffers, approximately the second one in the above image, where the speaker begins saying the “le” syllable of “left”:


Now imagine we do this for every buffer we generated in the previous step, take each FFT arrays and instead of showing energy as a function of frequency, stack the array vertically so that y-axis represents frequency and color represents energy. We end up with a spectrogram:


We could feed this image into our neural network, but you’ll agree that it looks pretty sparse. We have wasted so much space, and there’s not much signal there for a neural network to train on.

Let’s jump back to the FFT plot to zoom our image into our area of interest. The frequencies in this plot are bunched up below 5 KHz since the speaker isn’t producing particularly high-frequency sound. Human audition tends to be logarithmic, so we can view the same range on a log-plot:


Let’s generate new spectrograms as we did in an earlier step, but rather than using a linear plot of energies, use can a log-plot of FFT energies:


Looks a bit better, but there is room for improvement. Humans are much better at discerning small changes in pitch at low frequencies than at high frequencies. The Mel scale relates pitch of a pure tone to its actual measured frequency. To go from frequencies to Mels, we create a triangular filter bank:


Each colorful triangle above is a window that we can apply to the frequency representation of the sound. Applying each window to the FFT energies we generated earlier will give us the Mel spectrum, in this case an array of 20 values:


Plotting this as a spectrogram, we get our feature, the log-mel spectrogram:


The 1s images above are generated using audio feature extraction software written in TypeScript, which I’ve released publicly. Here’s a demo that lets you run the feature extractor on your own audio, and the code on github.

Handling real-time audio input

By default the feature extractor frontend takes a fixed buffer of audio as input. But to make an interactive audio demo, we need to process a continuous stream of audio data. So we will need to generate new images as new audio comes in. Luckily we don’t need to recompute the whole log-mel spectrogram every time, just the new parts of the image. We can then add the new parts of spectrogram on the right, and remove the old parts, resulting in a movie that feeds from the right to the left. The StreamingFeatureExtractor class implements this important optimization.

But there is one caveat: it currently relies on ScriptProcessorNode, which is notorious for dropping samples. I’ve tried to mitigate this as much as possible by using a large input buffer size, but the real solution will be to use AudioWorklets when they are available.

Wrapping up

An implementation note: here is a comparison of JS FFT libraries which suggests the Emscripten-compiled KissFFT is the fastest (but still 2-5x slower than native), and the one I used.

Here is a sanity check comparing the output of my web-based feature extractor to that of other libraries, most notably librosa and from AudioSet:


The images resulting from the three implementations are similar, which is a good sanity check, but they are not identical. I haven’t found the time yet, but it would be very worthwhile to have a consistent cross platform audio feature extractor, so that models trained in Python/C++ could run directly on the web, and vice versa.

I should also mention that although log-mel features are commonly used by serious audio researchers, this is an active area of research. Another audio feature extraction technique called Per-Channel Energy Normalization (PCEN) appears to perform better at least in some cases, like processing far field audio. I haven’t had time to delve into the details yet, but understanding it and porting it to the web also seems like a worthy task.

Major thanks to Dick Lyon for pointing out a few bugs in my feature extraction code. Pick up his “Human and Machine Hearing” if you’re ready to delve deeper into sound understanding.

Ok, so to recap, we’ve generated log-mel spectrogram images from streaming audio that are ready to feed into a neural network. Oh yeah, the actual machine learning part? That’s the next post coming up next week.

This is a post by Boris Smus, originally from Boris’ website, posted to XRDS with permission of the author.

Natural Language Understanding : Let’s Play Dumb

What is the meaning of the word understanding? This was a question posed during  a particularly enlightening lecture given by Dr. Anupam Basu, a professor with the  Department of Computer Science Engineering at IIT Kharagpur, India.

Understanding something probably relates to being able to answer questions based on it, maybe form an image or a flow chart in your head. If you can make another human being comprehend the concept with the least amount of effort, well that means you do truly understand what you are talking about. But what about a computer? How does it understand? Continue reading

Notes from IIT Kharagpur ACM Summer School on ML and NLP


Entrance to library and academic area.

[This entry has been edited for clarity. An example given discussing the similarity of words in French and English was incorrect. The following sentence has been removed: “The next question addressed by Bhattacharya was the ambiguity that may arise in languages with similar origins, for example in French ‘magazine’ actually means shop while in English, well it is a magazine.”]

Today is June 14th, so I am 14 days into summer school; 7 more days left, and we are all already feeling saddened by the idea of leaving Kharagpur soon. In India, an IIT is a dream for 90% of the 12th graders who join IIT coaching classes. The competition is high so not everyone gets in. I’m one of those who didn’t get in. So when I saw there was an ACM Summer School opportunity at the largest and oldest IIT in India, obviously I grabbed it. By sheer luck, I was selected to actually attend the school. Over the course of 21 days, we have been tasked to learn about machine learning and natural language processing. Continue reading

Convolutional Neural Networks (CNNs): An Illustrated Explanation

Artificial Neural Networks (ANNs) are used everyday for tackling a broad spectrum of prediction and classification problems, and for scaling up applications which would otherwise require intractable amounts of data. ML has been witnessing a “Neural Revolution”1 since the mid 2000s, as ANNs found application in tools and technologies such as search engines, automatic translation, or video classification. Though structurally diverse, Convolutional Neural Networks (CNNs) stand out for their ubiquity of use, expanding the ANN domain of applicability from feature vectors to variable-length inputs.

Continue reading