Cainvas
Model Files
subModel.zip
TensorFlow
Model
deepSea Compiled Models
subModel.exe
deepSea
Ubuntu

Wakeword Detection w/ deepSea

Credit: AITS Cainvas Community

Photo by Admin on Crowd4Test

Pre-requisite: Dataset

Dataset Format

Dataset must be pre-uploaded.

Dataset folder titled "WakeWordDataset" (stored in top_dir variable) has two subdirectories--

  1. WakeWordDataset/hotword/
  2. WakeWordDataset/background/

Each directory contains .wav files containing the dataset. "background" contains speech that is not the wake word, while "hotword" contains wake word separated by 2 second increments.

Note

  • Quality of resulting model depends on quality of dataset. For small dataset, false positives are expected to occur more (model predicts background noise as wake word).
  • Quality of dataset can be further improved by having variety of speakers recording in different environments with different levels of background noise.
In [1]:
!wget -N "https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/WakeWordDataset.zip"
!unzip -o WakeWordDataset.zip
!rm WakeWordDataset.zip
--2020-09-10 11:04:20--  https://cainvas-static.s3.amazonaws.com/media/user_data/cainvas-admin/WakeWordDataset.zip
Resolving cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)... 52.219.62.72
Connecting to cainvas-static.s3.amazonaws.com (cainvas-static.s3.amazonaws.com)|52.219.62.72|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 41730667 (40M) [application/zip]
Saving to: ‘WakeWordDataset.zip’

WakeWordDataset.zip 100%[===================>]  39.80M  94.7MB/s    in 0.4s    

2020-09-10 11:04:20 (94.7 MB/s) - ‘WakeWordDataset.zip’ saved [41730667/41730667]

Archive:  WakeWordDataset.zip
   creating: WakeWordDataset/background/
  inflating: WakeWordDataset/background/D_bg.wav  
  inflating: WakeWordDataset/background/M_bg.wav  
  inflating: WakeWordDataset/background/R_bg.wav  
   creating: WakeWordDataset/hotword/
  inflating: WakeWordDataset/hotword/D_word.wav  
  inflating: WakeWordDataset/hotword/M_word.wav  
  inflating: WakeWordDataset/hotword/R_word.wav  
  inflating: WakeWordDataset/TestLongClip.wav  
  inflating: WakeWordDataset/TestLongClip3.wav  
In [2]:
import numpy as np
import tensorflow as tf
import os
from librosa.core import load as librosa_load
import IPython.display as ipd
from matplotlib import pyplot as plt
In [3]:
max_length = 2 #length (in seconds) of input
desired_sr = 8000 #sampling rate to use
desired_samples = max_length*desired_sr #total number of samples in input
In [4]:
#Processing the data

#Function to normalize input values
def normalize_sample(input_val):
  diff = np.max(input_val) - np.min(input_val)
  if (diff != 0):
    input_val /= diff
  return input_val

#Dataset storing audio samples for wake word and background
cainvas_dataset = np.empty((0, desired_samples))
noncainvas_dataset = np.empty((0, desired_samples))

top_dir = "WakeWordDataset"

background_dir = os.path.join(top_dir, "background")
word_dir = os.path.join(top_dir, "hotword")

for ds_dir in ([background_dir, word_dir]) :
    for file in os.listdir(ds_dir):
        file_path = os.path.join(ds_dir, file)

        print("adding ", file, "to audio dataset")
        X, sr = librosa_load(file_path, sr=desired_sr)
        X = normalize_sample(X)
        X = np.pad(X, (0,desired_samples - (X.shape[0]%desired_samples)))
        X_sub = np.array(np.split(X, int(len(X)*1.0/desired_samples)))

        if ( ds_dir == background_dir ):
            noncainvas_dataset = np.append(noncainvas_dataset, X_sub, axis=0)
        else:
            cainvas_dataset = np.append(cainvas_dataset, X_sub, axis=0)
adding  R_bg.wav to audio dataset
adding  M_bg.wav to audio dataset
adding  D_bg.wav to audio dataset
adding  D_word.wav to audio dataset
adding  R_word.wav to audio dataset
adding  M_word.wav to audio dataset
In [5]:
#Concatenating dataset into matrix of inputs and labels
total_len = cainvas_dataset.shape[0] + noncainvas_dataset.shape[0]
inputs = np.append(cainvas_dataset, noncainvas_dataset, axis=0)
labels = np.array([1. if i < cainvas_dataset.shape[0] else 0. for i in range(total_len)])
print(total_len)
371
In [6]:
#Adding background and silence
background = np.random.random((50, desired_samples))
silence = np.zeros((50,desired_samples))

inputs = np.append(inputs, background, axis=0)
inputs = np.append(inputs, silence, axis=0)

labels = np.append(labels, np.zeros(len(background) + len(silence)), axis=0)

total_len = len(labels)
print(total_len)
471
In [7]:
#Shuffling inputs and labels
shuffle_permutation = np.arange(total_len)
np.random.shuffle(shuffle_permutation)

inputs = inputs[shuffle_permutation]
labels = labels[shuffle_permutation]

#Splitting into train and test dataset
train_split = 0.9
cutoff = int(train_split*total_len)

inputs_train = inputs[:cutoff]
inputs_test = inputs[cutoff:]
labels_train = labels[:cutoff]
labels_test = labels[cutoff:]
In [8]:
#Selecting random index from test dataset
ind = int(np.random.uniform()*len(inputs_test))

#Displaying sample spectrogram and audio from test dataset
X = inputs_train[ind]
y = labels_train[ind]
print("Label is", "cainvas" if y==1 else "background")

spectrogram_out = tf.abs(tf.signal.stft(X, 200, 100, fft_length=128)).numpy()
spectrogram_out = np.swapaxes(spectrogram_out, 0, 1)

plt.imshow(spectrogram_out, cmap='hot', interpolation='nearest')
plt.show()

ipd.Audio(X, rate=desired_sr)
Label is background
Out[8]:

Building and Training the Model

In [9]:
model = tf.keras.Sequential()

def spectrogramOp(X):
  spectrogram_out = tf.abs(tf.signal.stft(X, 200, 25, fft_length=256))
  return spectrogram_out

lambda1 = tf.keras.layers.Lambda(spectrogramOp, name="lambda_spectrogram")
lambda15 = tf.keras.layers.Lambda(lambda x: tf.transpose(x, perm=(0,2,1)), input_shape=(633, 129), name="switch_hw")
lambda2 = tf.keras.layers.Lambda(lambda x: tf.reshape(x, (-1, 129, 633, 1)), name="add_channels")
conv2d1 = tf.keras.layers.Conv2D(4, (8, 129), strides=2, activation='relu', name="conv1", input_shape=(129, 633, 1))
conv2d2 = tf.keras.layers.Conv2D(8, (4, 4), strides=2, activation='relu', name="conv2")
conv2d3 = tf.keras.layers.Conv2D(8, (8, 8), strides=2, activation='relu', name="conv3")
flatten1 = tf.keras.layers.Flatten()
dense1 = tf.keras.layers.Dense(1)
activation1 = tf.keras.layers.Activation('sigmoid')

model.add(lambda1)
model.add(lambda15)
model.add(lambda2)
model.add(conv2d1)
model.add(conv2d2)
model.add(conv2d3)
model.add(flatten1)
model.add(dense1)
model.add(activation1)

model.compile(optimizer='adam', loss=tf.keras.losses.binary_crossentropy, metrics=['accuracy'])
model.fit(inputs_train, labels_train, batch_size=32, epochs=10, 
          validation_data=(inputs_test, labels_test))
Epoch 1/10
14/14 [==============================] - 49s 4s/step - loss: 0.4573 - accuracy: 0.7825 - val_loss: 0.5110 - val_accuracy: 0.8750
Epoch 2/10
14/14 [==============================] - 49s 4s/step - loss: 0.2703 - accuracy: 0.9456 - val_loss: 0.5722 - val_accuracy: 0.9167
Epoch 3/10
14/14 [==============================] - 49s 4s/step - loss: 0.2239 - accuracy: 0.9622 - val_loss: 0.3829 - val_accuracy: 0.9167
Epoch 4/10
14/14 [==============================] - 49s 4s/step - loss: 0.1942 - accuracy: 0.9669 - val_loss: 0.2428 - val_accuracy: 0.9583
Epoch 5/10
14/14 [==============================] - 49s 4s/step - loss: 0.1510 - accuracy: 0.9645 - val_loss: 0.2596 - val_accuracy: 0.9375
Epoch 6/10
14/14 [==============================] - 49s 4s/step - loss: 0.1003 - accuracy: 0.9645 - val_loss: 0.2884 - val_accuracy: 0.9375
Epoch 7/10
14/14 [==============================] - 49s 4s/step - loss: 0.0641 - accuracy: 0.9669 - val_loss: 0.5879 - val_accuracy: 0.8125
Epoch 8/10
14/14 [==============================] - 50s 4s/step - loss: 0.0619 - accuracy: 0.9740 - val_loss: 0.1655 - val_accuracy: 0.9167
Epoch 9/10
14/14 [==============================] - 49s 3s/step - loss: 0.0399 - accuracy: 0.9882 - val_loss: 0.2045 - val_accuracy: 0.9583
Epoch 10/10
14/14 [==============================] - 51s 4s/step - loss: 0.0280 - accuracy: 0.9905 - val_loss: 0.2689 - val_accuracy: 0.9583
Out[9]:
<tensorflow.python.keras.callbacks.History at 0x7f08ee300cf8>
In [10]:
#Viewing a summary of the model
model.summary()
Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
lambda_spectrogram (Lambda)  (None, 633, 129)          0         
_________________________________________________________________
switch_hw (Lambda)           (None, 129, 633)          0         
_________________________________________________________________
add_channels (Lambda)        (None, 129, 633, 1)       0         
_________________________________________________________________
conv1 (Conv2D)               (None, 61, 253, 4)        4132      
_________________________________________________________________
conv2 (Conv2D)               (None, 29, 125, 8)        520       
_________________________________________________________________
conv3 (Conv2D)               (None, 11, 59, 8)         4104      
_________________________________________________________________
flatten (Flatten)            (None, 5192)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 5193      
_________________________________________________________________
activation (Activation)      (None, 1)                 0         
=================================================================
Total params: 13,949
Trainable params: 13,949
Non-trainable params: 0
_________________________________________________________________
In [11]:
#Evaluating final model's performance on test dataset
score, acc = model.evaluate(inputs_test, labels_test, batch_size=32)
print('Test score:', score)
print('Test accuracy:', acc)
2/2 [==============================] - 0s 231ms/step - loss: 0.2689 - accuracy: 0.9583
Test score: 0.2688850164413452
Test accuracy: 0.9583333134651184

Testing Trained Model

Testing on random sample from test dataset

In [12]:
ind = int(np.random.uniform()*len(inputs_test))
X = inputs_test[ind]

ipd.Audio(X, rate=desired_sr)

y = labels_test[ind]
output = model.predict(np.array([inputs_test[ind]]))[0][0]

print("True label:", y)
print("Prediction:", output)
print("Label is", "cainvas" if y==1 else "background")

spectrogram_out = tf.abs(tf.signal.stft(X, 200, 100, fft_length=128)).numpy()
spectrogram_out = np.swapaxes(spectrogram_out, 0, 1)

plt.imshow(spectrogram_out, cmap='hot', interpolation='nearest')
plt.show()

ipd.Audio(X, rate=desired_sr)
/opt/tljh/user/lib/python3.7/site-packages/IPython/lib/display.py:172: RuntimeWarning: invalid value encountered in true_divide
  scaled = data / normalization_factor * 32767
True label: 0.0
Prediction: 0.02355808
Label is background
Out[12]:

Testing on background and silence

In [13]:
background = np.random.random((desired_samples))
silence = np.zeros((desired_samples))
background_out, silence_out = model.predict(np.array([background, silence]))
print("Background prediction", background_out)
print("Silence prediction", silence_out)
Background prediction [0.]
Silence prediction [0.02355808]

Testing on longer clip not in test dataset

In [14]:
path = "./WakeWordDataset/TestLongClip3.wav"

#loading the sample in
X, sr = librosa_load(path, sr=desired_sr, mono=True)
X = X.astype(np.float64)

assert sr == desired_sr #ensure sample rate is same as desired
assert len(X.shape) == 1 #ensure X is a mono signal

spectrogram_out = tf.abs(tf.signal.stft(X, 200, 100, fft_length=128)).numpy()
spectrogram_out = np.swapaxes(spectrogram_out, 0, 1)
plt.imshow(spectrogram_out, cmap='hot', interpolation='nearest')
plt.show()

win_len = 10000
stride_len = 1000
times = []
predictions = []
for n in range(0, len(X) - win_len, stride_len):
  X_wind = X[n:n + win_len]
  X_wind = np.pad(X_wind, ((0, desired_samples - len(X_wind))))

  test_pred = model.predict(np.array([X_wind]))

  times.append((n + win_len / 2) / float(desired_sr))
  predictions.append(test_pred.flatten()[0])

memory_stride = int(0.2 * float(desired_sr) / stride_len) #shift predictions by .2 seconds
memory_len = int(0.5 * float(desired_sr) / stride_len) #have each memory window at .5 seconds
time_diff = np.diff(times)

activating_times = []

#slide memory window through predictions
for n in range(0, len(predictions)-memory_len, memory_stride):
  prediction_window = predictions[n:n+memory_len]
  window_time = times[n+memory_len-1] - times[n]
  area = 0.
  #for the current memory window, find the riemann sum
  sum = 0.
  for i in range(0, memory_len-1):
    sum += times[n+i+1]-times[n+i]
    area += time_diff[n+i]*(prediction_window[i] + prediction_window[i+1])/2.
  if area > window_time*0.30:
    activating_times.append(times[n])

#Visualizing with matplotlib
fig, ax1 = plt.subplots()
white_color = "#fff"
red_color = "#f00"
ax1.set_xlabel("time (s)", color=white_color)
ax1.set_ylabel("wake word detection", color=white_color)

start_time = 0
end_time = 0

start_index = times.index(start_time) if start_time in times else 0
end_index = times.index(end_time) if end_time in times else len(times)

ax1.plot(times[start_index:end_index], 
  predictions[start_index:end_index], color=red_color)

ax1.tick_params(axis='x', labelcolor=white_color)
ax1.tick_params(axis='y', labelcolor=white_color)

for t in activating_times:
    ax1.axvline(t, color="blue", alpha=0.2)

ipd.Audio(path)
Out[14]:

Compile it for MCUs

Extracting submodel

In [15]:
#Extracting subgraph excluding stft operator
submodel = tf.keras.Sequential()
inputs = tf.keras.Input(shape=(633, 129))
lambda15_out = lambda15(inputs)
lambda2_out = lambda2(lambda15_out)
conv2d1_out = conv2d1(lambda2_out)
conv2d2_out = conv2d2(conv2d1_out)
conv2d3_out = conv2d3(conv2d2_out)
flatten1_out = flatten1(conv2d3_out)
dense1_out = dense1(flatten1_out)
activation1_out = activation1(dense1_out)

submodel = tf.keras.Model(inputs=inputs, outputs=activation1_out, name="submodel")
submodel.summary()
submodel.save("subModel")
Model: "submodel"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 633, 129)]        0         
_________________________________________________________________
switch_hw (Lambda)           (None, 129, 633)          0         
_________________________________________________________________
add_channels (Lambda)        (None, 129, 633, 1)       0         
_________________________________________________________________
conv1 (Conv2D)               (None, 61, 253, 4)        4132      
_________________________________________________________________
conv2 (Conv2D)               (None, 29, 125, 8)        520       
_________________________________________________________________
conv3 (Conv2D)               (None, 11, 59, 8)         4104      
_________________________________________________________________
flatten (Flatten)            (None, 5192)              0         
_________________________________________________________________
dense (Dense)                (None, 1)                 5193      
_________________________________________________________________
activation (Activation)      (None, 1)                 0         
=================================================================
Total params: 13,949
Trainable params: 13,949
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From /opt/tljh/user/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Model.state_updates (from tensorflow.python.keras.engine.training) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
WARNING:tensorflow:From /opt/tljh/user/lib/python3.7/site-packages/tensorflow/python/training/tracking/tracking.py:111: Layer.updates (from tensorflow.python.keras.engine.base_layer) is deprecated and will be removed in a future version.
Instructions for updating:
This property should not be used in TensorFlow 2.0, as updates are applied automatically.
INFO:tensorflow:Assets written to: subModel/assets
In [16]:
#Testing whether submodel works
tf_model = tf.saved_model.load("./subModel")
infer = tf_model.signatures["serving_default"]
input_val = np.full((1, 633, 129), 3e-3)
output = infer(tf.constant(input_val, dtype=tf.float32))
tf_out = output[list(output.keys())[0]].numpy().flatten()[0]
print("tf out", tf_out)
tf out 0.0968889

Compile submodel

In [ ]:
!deepCC --format tensorflow ./subModel
reading [tensorflow model] from './subModel'
Saved 'subModel.onnx'
reading onnx model from file  subModel.onnx
Model info:
  ir_vesion :  4 
  doc       : 
WARN (ONNX): terminal (input/output) input_1_0's shape is less than 1.
             changing it to 1.
WARN (ONNX): terminal (input/output) Identity_0's shape is less than 1.
             changing it to 1.
running DNNC graph sanity check ... passed.
Writing C++ file  subModel_deepC/subModel.cpp
INFO (ONNX): model files are ready in dir subModel_deepC
g++ -std=c++11 -O3 -I. -I/opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/include -isystem /opt/tljh/user/lib/python3.7/site-packages/deepC-0.13-py3.7-linux-x86_64.egg/deepC/packages/eigen-eigen-323c052e1731 subModel_deepC/subModel.cpp -o subModel_deepC/subModel.exe