Web-based voice command recognition

Last time we converted audio buffers into images. This time we’ll take these images and train a neural network using deeplearn.js. The result is a browser-based demo that lets you speak a command (“yes” or “no”), and see the output of the classifier in real-time, like this:

Curious to play with it, see whether or not it recognizes yay or nay in addition to yes and noTry it out live. You will quickly see that the performance is far from perfect. But that’s ok with me: this example is intended to be a reasonable starting point for doing all sorts of audio recognition on the web. Now, let’s dive into how this works.

Quick start: training and testing a command recognizer

Here’s how you can train your own yes/no classifier:

  1. Go to the model training page. It will take a bit of time to download the training data from the server.
  2. Click the train button, and you’ll see a graph showing training progress. Once you are ready (this will take a while, perhaps 500 iterations or 2 minutes, depending on your hardware), stop training, and press the save weights (file) button. This will download a JSON file.


  3. Then go to the inference demo page, press the load weights (file) button and select the downloaded JSON file to load the trained model.
  4. Flip the switch, grant access to the microphone and try saying “yes” or “no”. You’ll see microphone and confidence levels indicated at the bottom of the page.


The above is a mechanistic account of how the training example works. If you are interested in learning about the gory (and interesting) details, read on.

Data pre-processing and loading

Training a neural net requires a lot of training data. In practice, millions of examples may be required, but the dataset we’ll be using is small by modern standards, with just 65,000 labeled examples. Each example is a separate wav file, with the label in the filename.

Loading each training wav as a separate request turned out to be quite slow. The overhead from each request is small, but when compounded over a few thousand times, really starts to be felt. An easy optimization to load data more quickly is to put all examples with the same label into one long audio file. Decoding audio files is pretty fast, and so is splitting them into one second long buffers. A further optimization is to use a compressed audio format, such as mp3. scripts/preprocess.py will do this concatenation for you, producing this mesmerising result.

After we “rehydrate” our raw audio examples, we process buffers of raw data into features. We do this using the Audio Feature extractor I mentioned in the last post, which takes in raw audio, and produces a log-mel spectrogram. This is relatively slow, and accounts for most of the time spent loading the dataset.

Model training considerations

For the yes/no recognizer, we have only two commands that we care about: “yes”, and “no”. But we also want to detect the lack of any such utterances, as well as silence. We include a set of random utterances as the “other” category (none of which are yes or no). This example is also generated by the preprocessing script.

Since we’re dealing with real microphones, we never expect to hear pure silence. Instead, “silence” is some level of ambient noise compounded by crappy microphone quality. Luckily, the training data also includes background noise which we mix with our training examples at various volumes. We also generate a set of silence examples, which includes only the background audio. Once we’ve prepared our samples, we generate our final spectrograms as our input.

To generate these final spectrograms, we need to decide on buffer and hop length. A reasonable buffer length is 1024, and a hop length of 512. Since we are dealing with sample rate of 16000 Hz, it works out to a window duration of about 60ms, sampled every 30ms.

Once we have labeled spectrograms, we need to convert inputs and labels into deeplearn arrays. Label strings “yes”, “no”, “other”, and “silence” will be one-hot encoded as a Array1Ds of four integers, meaning that “yes” corresponds to [1, 0, 0, 0], and “no” to [0, 1, 0, 0]. Spectrograms from the feature extractor need to be converted into an Array3D, which can be fed as input for the model.

The model we are training consists of two convolution layers, and one fully connected layer. I took this architecture directly from the MNIST example of deeplearn.js, and hasn’t been customized for dealing with spectrograms at all. As a result, performance is a far cry from state of the art speech recognition. To see even more mis-classifications, try out MNIST for audio which recognizes spoken digits (eg. “zero” through “ten”). I am confident that we could do better by following this paper. A real-world speech recognizer might not use convolution at all, instead opting for an LSTM, which is better suited to process time-series data.

Lastly, we want to tell the machine learning framework how to train the model. In ML parlance, we need to set the hyperparameters, which includes setting the learning rate (how much to follow the gradient at each step) and batch size (how many examples to ingest at a time). And we’re off to the races:

2-training-graph (1)

During training, the gradient descent algorithm tries to minimize cost, which you can see in blue. We also plot accuracy in orange, which is occasionally calculated by running inference on a test set. We use a random subset of the test set because inference takes time, and we’d like to train as quickly as possible.

Once we are happy with the test accuracy, we can save the model weights and use them to infer results.

Saving and loading model weights

A model is defined by its architecture and the weights of its weight-bearing nodes. Weights are the values that are learned during the process of model training, and not all nodes have weights. ReLUs and flatten nodes don’t. But convolution and fully connected nodes have both weights and biases. These weights are tensors of arbitrary shapes. To save and load models, we need to be able to save both graphs and their weights.

Saving & loading models is important for a few reasons:

  1. Model training takes time, so you might want to train a bit, save weights, take a break, and then resume from where you left off. This is called checkpointing.
  2. For inference, it’s useful to have a self-contained model that you can just load and run.

At the time of writing, deeplearn.js didn’t have facilities to serialize models and model weights. For this example, I’ve implemented a way to load and save weights, assuming that the model architecture itself is hard-coded. The GraphSaverLoader class can save & load from a local store (IndexedDB), or from a file. Ultimately, we will need a non-hacky way of saving and loading models and their corresponding weights, and I’m excited for the near future of improved ML developer ergonomics.

Wrapping up

Many thanks to Nikhil and Daniel for their hard work on deeplearn.js, and willingness to answer my barrages of stupid little questions. Also, toPete, who is responsible for creating and releasing the dataset I used in this post. And thank you dear reader, for reading this far.

I’m stoked to see how this kind of browser based audio recognition tech can be applied to exciting, educational ML projects like Teachable Machine. How cool would it be if you could make a self-improving system, which trains on every additional spoken utterance? The ability to train these kinds of models in the browser allows us to entertain such possibilities in a privacy preserving way, without sending anything to any server.

So there you have it! This has been an explanation of voice command recognition on the web. We covered feature extraction in the previous post, and this time, dug a little bit into model training and real-time inference entirely in the browser.

If you build on this example, please drop me a note on twitter.

This is a post by Boris Smus, originally from Boris’ website, posted to XRDS with permission of the author.

Audio features for web-based ML

One of the first problems presented to students of deep learning is to classify handwritten digits in the MNIST dataset. This was recently ported to the web thanks to deeplearn.js. The web version has distinct educational advantages over the relatively dry TensorFlow tutorial. You can immediately get a feeling for the model, and start building intuition for what works and what doesn’t. Let’s preserve this interactivity, but change domains to audio. This post sets the scene for the auditory equivalent of MNIST. Rather than recognize handwritten digits, we will focus on recognizing spoken commands. We’ll do this by converting sounds like this:

Into images like this, called log-mel spectrograms, and in the next post, feed these images into the same types of models that do handwriting recognition so well:


The audio feature extraction technique I discuss here is generic enough to work for all sorts of audio, not just human speech. The rest of the post explains how. If you don’t care and just want to see the code, or play with some live demos, be my guest!


Neural networks are having quite a resurgence, and for good reason. Computers are beating humans at many challenging tasks, from identifying faces and images, to playing Go. The basic principles of neural nets is relatively simple, but the details can get quite complex. Luckily non-AI experts can get a feeling for what can be done because a lot of output is quite engaging. Unfortunately, these demos are mostly visual in nature, either examples of computer vision, or generate images or video as their main output. And few of these examples are interactive.

Pre-processing audio sounds hard, do we have to?

Raw audio is a pressure wave sampled at tens of thousands times per second and stored as an array of numbers. It’s quite a bit of data, but there are neural networks that can ingest it directly. Wavenet does speech to text and text to speech using raw audio sequences, without any explicit feature extraction. Unfortunately, it’s slow: running speech recognition on a 2s example took 30s on my laptop. Doing this in real-time, in a web browser isn’t quite ready yet.

Convolutional Neural Networks (CNNs) are a big reason why there has been so much interesting work done in computer vision recently. These networks are designed to work on matrices representing 2D images, so a natural idea is to take our raw audio and generate an image from it. Generating these images from audio is sometimes called a frontend in speech recognition papers. Just to hammer the point home, here’s a diagram explaining why we need to do this step:


The standard way of generating images from audio is by looking at the audio chunk-by-chunk, and analyzing it in the frequency domain, and then applying various techniques to massage that data into a form that is well suited to machine learning. This is a common technique in sound and speech processing, and there are great implementations in Python. TensorFlow even has a custom op for extracting spectrograms from audio.

On the web, these tools are lacking. The Web Audio API can almost do this, using the AnalyserNode, as I’ve shown in the past, but there is an important limitation in the context of data processing: AnalyserNode (nee RealtimeAnalyser) is only for real-time analysis. You can setup an OfflineAudioContext and run your audio through the analyser, but you will get unreliable results.

The alternative is to do this without the Web Audio API, and there are many signal processing JavaScript libraries that might help. None of them are quite adequate, for reasons of incompleteness or abandonment. But here’s an illustrated take on extracting Mel features from raw audio.

Audio feature extraction

I found an audio feature extraction tutorial, which I followed closely when implementing this feature extractor in TypeScript. What follows can be a useful companion to that tutorial.

Let’s begin with an audio example (a man saying the word “left”):


Here’s that raw waveform plotted as pressure as a function of time:


We could take the FFT over the whole signal, but it changes a lot over time. In our example above, the “left” utterance only takes about 200 ms, and most of the signal is silence. Instead, we break up the raw audio signal into overlapping buffers, spaced a hop length apart. Having our buffers overlap ensures that we don’t miss out on any interesting details happening at the buffer boundaries. There is an art to picking the right buffer and hop lengths:

  • Pick too small a buffer, and you end up with an overly detailed image, and risk your neural net training on some irrelevant minutia, missing the forest for the trees.
  • Pick too large a buffer, and you end up with an image too coarse to be useful.

In the illustration below, you can see five full buffers that overlap one another by 50%. For illustration purposes only, the buffer and hop durations are large (400 ms and 200ms respectively). In practice, we tend to use much shorter buffers (eg. 20-40 ms), and often even shorter hop lengths to capture minute changes in audio signal.


Then, we consider each buffer in the frequency domain. We can do this using an Fast Fourier Transform (FFT) algorithm. This algorithm gives us complex values from which we can extract magnitudes or energies. For example, here are the FFT energies of one of the buffers, approximately the second one in the above image, where the speaker begins saying the “le” syllable of “left”:


Now imagine we do this for every buffer we generated in the previous step, take each FFT arrays and instead of showing energy as a function of frequency, stack the array vertically so that y-axis represents frequency and color represents energy. We end up with a spectrogram:


We could feed this image into our neural network, but you’ll agree that it looks pretty sparse. We have wasted so much space, and there’s not much signal there for a neural network to train on.

Let’s jump back to the FFT plot to zoom our image into our area of interest. The frequencies in this plot are bunched up below 5 KHz since the speaker isn’t producing particularly high-frequency sound. Human audition tends to be logarithmic, so we can view the same range on a log-plot:


Let’s generate new spectrograms as we did in an earlier step, but rather than using a linear plot of energies, use can a log-plot of FFT energies:


Looks a bit better, but there is room for improvement. Humans are much better at discerning small changes in pitch at low frequencies than at high frequencies. The Mel scale relates pitch of a pure tone to its actual measured frequency. To go from frequencies to Mels, we create a triangular filter bank:


Each colorful triangle above is a window that we can apply to the frequency representation of the sound. Applying each window to the FFT energies we generated earlier will give us the Mel spectrum, in this case an array of 20 values:


Plotting this as a spectrogram, we get our feature, the log-mel spectrogram:


The 1s images above are generated using audio feature extraction software written in TypeScript, which I’ve released publicly. Here’s a demo that lets you run the feature extractor on your own audio, and the code on github.

Handling real-time audio input

By default the feature extractor frontend takes a fixed buffer of audio as input. But to make an interactive audio demo, we need to process a continuous stream of audio data. So we will need to generate new images as new audio comes in. Luckily we don’t need to recompute the whole log-mel spectrogram every time, just the new parts of the image. We can then add the new parts of spectrogram on the right, and remove the old parts, resulting in a movie that feeds from the right to the left. The StreamingFeatureExtractor class implements this important optimization.

But there is one caveat: it currently relies on ScriptProcessorNode, which is notorious for dropping samples. I’ve tried to mitigate this as much as possible by using a large input buffer size, but the real solution will be to use AudioWorklets when they are available.

Wrapping up

An implementation note: here is a comparison of JS FFT libraries which suggests the Emscripten-compiled KissFFT is the fastest (but still 2-5x slower than native), and the one I used.

Here is a sanity check comparing the output of my web-based feature extractor to that of other libraries, most notably librosa and from AudioSet:


The images resulting from the three implementations are similar, which is a good sanity check, but they are not identical. I haven’t found the time yet, but it would be very worthwhile to have a consistent cross platform audio feature extractor, so that models trained in Python/C++ could run directly on the web, and vice versa.

I should also mention that although log-mel features are commonly used by serious audio researchers, this is an active area of research. Another audio feature extraction technique called Per-Channel Energy Normalization (PCEN) appears to perform better at least in some cases, like processing far field audio. I haven’t had time to delve into the details yet, but understanding it and porting it to the web also seems like a worthy task.

Major thanks to Dick Lyon for pointing out a few bugs in my feature extraction code. Pick up his “Human and Machine Hearing” if you’re ready to delve deeper into sound understanding.

Ok, so to recap, we’ve generated log-mel spectrogram images from streaming audio that are ready to feed into a neural network. Oh yeah, the actual machine learning part? That’s the next post coming up next week.

This is a post by Boris Smus, originally from Boris’ website, posted to XRDS with permission of the author.

How to publish about your research results for academic and non-academic audiences

As a graduate student, one of our goals is to produce research that will be useful to the world, that will be known and used by other people. This usefulness can come in many forms; for example, our work can serve to inspire future research, which will take the topic one step further, or it can be used by people in the industry as part of their work. But for any of this to happen, the methods, results, and takeaways of our research need to be communicated to the world. Of course, most research programs require the student to write a thesis or dissertation, but the reality is that very few people will read it besides the evaluation committee. A thesis or dissertation might eventually be also read by other graduate students that are working on the same topic and want to know the existing literature in details. But other than that, most people would prefer to read a summarized version of the research instead of the whole thesis or dissertation.

Therefore, graduate researchers should also try to publish their results in other formats, so they become more accessible to the general public. Some graduate programs even include publication requirements as part of the students’ obligations, particularly when there is public funding involved. But even when it is not a requirement, publishing one’s research results is not only one of the best ways to ensure that it can be found and used by other people, but it is also a rich experience for the researcher. This especially relates to the involved writing, the publication, and the resulting networking with other people reading and mentioning your work.

There are many different ways, formats, and venues that can be used to publish original research. In general, we can split them into academic publications – whose primary audience is mainly formed by other researchers – and non-academic – which are more directed to the industry and the general public.

Continue reading

A world full of emojis

In 2010, a new trend emerged in electronic messages and web pages: emojis. There is an interesting journey behind these cute little images, and it is definitely worth to understand how and why they were initially created.

Emojis (less known as pictographs) are images encoded as text and exist in various genres: facial expressions 😜, common objects 📱, food 🍕, places ⛰️, activities ⛷, animals 🐵 and most of what you can think of 👾. The word comes from the Japanese (e ≅ picture) + (moji ≅ written character). 2823 emojis exist in total (as of today) and it is estimated that about 6 billion emojis are sent every single day.

Emojis as a mean to express yourself

Let’s have a quick overview of the importance of emojis and why we should blog about them. Emojis were created as a new mean of communication allowing people to express their emotions and feelings. Since written text can be vague, messy, and imprecise and people might lose the sub-meaning of it, emojis can help people express themselves by giving meaning and emotions to the text (as they say, a picture is worth a 1000 words). Emojis can go as far as allowing people to have an identity in the web. For example:
– The hijab emoji created by Rayouf Alhumedhi, the teen who wanted an emoji of her, helped the members of the muslim community to be represented in the digital world.
– The sauna emoji, proposed by the Ministry of Foreign Affair, promotes Finland’s cultural heritage making it the first nation in the world to have a national emoji.

The history of emojis

The first set of 176 emojis was created by Shigetaka Kurita in 1999 for the Japanese cell phone carrier NTT DoCoMo. It is important to notice that the first emojis were sent before the picture messaging was available and it was a basic 12×12 low-resolution pixel grid. Kurita created the first emojis based on observed human expressions and other objects in the city to facilitate electronic communication, and to create a distinguishing feature of NTT. Each symbol was specified as a unique 2-byte sequence, which corresponded to the unique code of the emoji. However, since a standard did not exist between the Japanese carriers, different codes were interpreted as different characters in different phones. This created some problems, like sending a from one carrier could be interpreted as on another. When Google decided to launch Gmail in Japan, it noticed that emojis were already very popular  and hence wanted  to include them into their emails. Apple also joined Google’s quest, together they approached the Unicode Consortium and requested the regulation of emojis universally.

Creating a new emoji

Everyone can submit a proposal for an emoji, you, me, literally everyone. The proposals for the emojis originate from two sources: 1. the committee itself that realizes there is a need for a specific emoji and 2. the public that would like to use a specific emoji. The public can suggest an emoji through a written report addressed to the Unicode Consortium as explained here.

First, a draft of the proposal is presented to the emoji sub-committee, which gives the applicants feedback on improvement. If they agree on the proposal, they forward it to the technical committee of the Unicode Consortium for the final decision.

The Unicode Consortium has then a meeting with the voting members that vote on the proposals acceptance. The committee that decides on the universal lexicon has 12 full voting members that are interested in text-processing standards. Nine of them are American multinational tech companies such as: Oracle, IBM, Microsoft, Adobe, Apple, Google, Facebook, Symantec, and Yahoo; and the other three companies are the German software company SAP, the Chinese telecom company Huawei, and the government of Oman.

If the proposal is approved, a representative glyph (black and white text representation) is released to the vendors that incarnate the glyph into a colorful emoji representation on their devices according to their own design guidelines, such as: colors, simplicity and dimensionality.

In order for the proposal to be approved, the committee has to be convinced that there is a need for this specific emoji. Thus, the proposal should contain some research facts that demonstrate the need for the emoji, such as: the number of hashtags on Instagram or the trend of similar icons. The icon proposed should be also distinctive from other similar emojis. While it should not be too specific, it should also not be too broad or vague to be applicable. Lastly, the emojis cannot be associated to logos or any specific brand. Thus, each emoji should be eligible, aesthetically pleasing and globally representative to be approved.

Fun facts about emojis

  • The Emoji Movie 🍿 was released in 2017, which despite the low rating, emphasis the cultural importance of emojis nowadays.
  • World Emoji Day was created by Emojipedia founder Jeremy Burge in 2014 and was set to 17th of July 📅 to celebrate emojis, because everyone loves emojis.
  • Oxford Dictionaries announced that 2015 Word of the Year was not a word at all, but an emoji — more specifically the “face with tears of joy” emoji 😂, which is one of the most frequently used emojis of all the time.
  • People have been really creative and used emojis not only for texting and social media, but also to represent song lyrics 🎶, press releases 📰,  movie subtitles 🎬 and even to write the emoji version of the Bible 📖 (yes, that’s exactly what I meant, you can now order the “Bible Emoji” online). (check video)

Last summer, when I wanted to post a picture on instagram of me snorkeling, I realized that no existing emoji relates to snorkeling or scuba diving. That’s how I became interested in how emojis are created and by whom. I got really excited when I read on the Unicode Consortium’s website that anyone can submit an emoji, so I thought that it would be interesting to go through the experience of creating an actual emoji. Below I attached the design of my scuba diving icons, and I am currently working on the proposal for the consortium. I hope that my emojis will be approved by the committee and if they don’t, I am happy I tried graphics design for the first time and went through the experience. In case, my scuba diving emoji proposal gets approved, it will be available for everyone by 2019. Until then, there is still a long way to go …

Scuba-diving Girl Emoji

Scuba-diving Girl Emoji

Scuba-diving Boy Emoji

Scuba-diving Boy Emoji


Thanks for reading my first blog guys, hope you enjoyed it and let me know if you decide to create your own emoji. I would love to hear your thoughts and feedback!

PHP for Backend Web Applications

Web Applications

It is reasonable to consider any website, whose functionality is entirely carried out by the client machine, to be a webpage. Alternatively, any website which requires communication with the server, after requesting a new page to display, could be considered a web application. PHP is one programming language which can be used on a web server in order to support web application functionality.
Continue reading