How AI(Deep learning) learned to distinguish between me saying “Hello World” and my mom saying “Hello World”.

6 min readApr 5, 2020

There is a feature in google home where you can train google assistant on your voice and it will trigger only if it recognises that instruction is coming from you. This is super cool but sometimes little tricky because if you are not there, google assistant won’t shut. LOL.

This post is my attempt to solve above mystery and see how deep learning can help us here.

Problem Statement : In this post, I have ten audio clips of me saying “hello world” and ten audio clips of my mom saying “hello world”. This way we have two human voices saying same words and our task is to distinguish between them.

Before diving into solving problem, some basic audio engineering 101

Audio Engineering 101

How one can visualise audio?

Question: Can we give this image to neural network? We can but it is continuous function and neural network needs fixed sized vectors. Let’s see what we can do.

Left is our starting image which represents amplitude vs time. Right image shows frequency and amplitude. Careful reader will sense this that right side image is not continuous but rather discrete(which is favourable for Neural Networks) and how that transformation is achieved is using Fast Fourier Transform. But what is Fast Fourier Transform dude? To keep it simple and short, FFT is function which transforms input from one domain to another. In our example it transformed from time to frequency domain. The reason I told you why to use it because it can represent continuous function into discrete output.

All good? let’s move forward. Can we have better image representation? Yes

Above image is formed using Mel-frequency cepstral coefficients(MFCC). But wait what is this MFCC dude? To keep it simple and short, it is sort of spectrogram. But what is spectrogram?

A spectrogram is a visual representation of the spectrum of frequencies of a signal as it varies with time.

Got it? Awesome! We will use these MFCC based images as input to our neural networks. For simplicity, I haven’t explained lot of electronics engineering concepts but if you want to learn more about it, drop comment below. I will explain😁.

Deep Learning

Import required packages. Librosa is library that has functions for Fast Fourier transform etc.

Let’s see our training data. 10 clips of me saying “ hello world” and 10 clips of my mom saying “ hello world”. Careful reader might have question that Sumit, don’t u think it will be less? I know it is less . But there are three reasons why I am fine with it. 1) when google assistant is trained on your voice for first time, it will make you say “ok google” for 3 times only. 2) We can apply multiple transforms and get different style of data from same sample further increasing size of training data. 3) If I ask my mom to say “ hello world” thousand times, she will beat hell out of me.

One major challenge in audio data is that it is not fix length so neural networks won’t accept it as it is. Hence we went with MFCC and we have to make it fixed length and below image shows it.

0 label is my voice , 1 label is my mom’s voice.

Prepare our X and Y.

Splitting X into training and testing data. Reshaping X into image dimensions so that we can use it to feed to Convolutional Neural Networks (CNN)

Building Model.

You can see I have used Batch Normalisation layer to fasten training. Dropout layer to avoid over fitting. Also used early stopping to stop model from over fitting. One can see convolutional layer as well. So below image defines heart of our deep learning model.

Model Training.

accuracy is parameter for training data and val_ accuracy for testing data. so if val_accuracy < accuracy means model understood training data well and it is working on understanding testing data. Careful reader can clearly see epochs are 1000 but models stops after 12th run because it felt running over this can lead to over fitting (early stopping mechanism got triggered here). Accuracy achieved is 75% which is fine with respect to amount of data we have collected.

Let’s see how model predicts on completely new data. In below image, you can see for test-sumit.mp3(me saying hello world) it gave 0 as output and Surekha-1.mp3(my mom saying hello world) it gave 1 as output. And if we recollect, 0 was for my voice and 1 was for my voice. Yay!🎉. Model has started to recognise. 😊

Just because I labelled test as sumit-test and it predicted 0 . should we trust it is working? NO. hahaha. Let me explain

One of the major challenges in AI field is to explain how model reached that conclusion. Normally, people treat machine learning models as black box so let’s see what our model saw in new test samples which we gave. Below image show different MFCC powered images for me saying hello world(top one) and my mom saying hello world (bottom one). One can clearly see different pixels pattern in two images and that is what model picked out of image. I hope this explains how model predicted.

I hope you enjoyed reading this article. One can improve this model with more data and I think more we speak with google assistant more it learns even though it takes only 3 samples while training in beginning.

One interesting idea you can extend this is automating stamping of podcasts. Let me explain, podcast is audio file with 40 or N mins and there are two or more speakers. Since everyone is busy and low on time, not everyone wants to listen to entire podcast. Author of podcast can stamp his podcast saying 3:00 — cooking tips 5:00 — travel blogging tips. So end user will pick those time stamp and only listen to ones which he/she is interested in. But for that stamping, author of podcast has to listen to audio again and mark those time stamps. how deep learning can help here is if author says moving to next topic whenever he moves to new topic , model can recognise him/her saying “moving to next topic” and give timestamp of it. This way author in the end has to just go through those timestamps and take note of it instead of running through entire audio file.(If you think idea is legit and want to work on it, do reach out to me✌️)

~ Happy Learning.

How AI(Deep learning) learned to distinguish between me saying “Hello World” and my mom saying “Hello World”.

Written by Sumeet More

No responses yet