Skip to content
secretsauceai edited this page Aug 14, 2021 · 10 revisions

Wake Word Project

The wake word project is split into 3 categories.

Workflow

The typical workflow to produce a production level wake word model is:

  1. Use the data collection tool
    1. wake word
      1. (wake word) variations
    2. not wake word
      1. background noise
      2. paragraph of text
      3. syllables (individual syllables of the wake word)
      4. random recordings (TV and conversation)
  2. Then use wakeword-data-prep
    1. test-training split finds the optimal split of the data, the script automatically:
      1. random 5 shuffles of test-training split (80/20 for less sparse data, 50/50 for sparse sub-categories)
      2. trains models
      3. selects best model
    2. The script continues with the best selected model and automatically generates sub-directories populated with sound files for every category with:
      1. gaussian noise mixed in (2 levels of gaussian noise means 2 new files for every file)
      2. background noise mixed in (each file has 5 new random background sound files generated, besides the gaussian generated files)

Wake word theory

Typically, a wake word model is a binary acoustic model. Precise uses a GRU-RNN. This is a good choice for this type of data. The data is a time series, and RNNs are very good at that kind of data. An LSTM is much more computationally expensive to run than a GRU, and many studies have shown (1, 2) the GRU can perform on par with the LSTM but is computationally less expensive.

The absolute best performance could be achieved via CRNN, however this adds a convolutional layer, making it more computationally expensive than just a GRU. Also the data requirements of a CNN tend to be higher, therefore the hypothesis is:

due to the constraints of sparsity and computation, a single layered GRU is the most optimal approach.

Collection method

  • discuss partial wake word collection

Generation of data

Because the data is so sparse, more data is required to keep a balanced data set, in addition creating noisy background data is required for the wake word to perform in noisy environments. Experimentally it was found that there was NO flexibility of the model for any background noise to perform successful wake ups without the generated data. Although some (uncited, cause I don't want be a jerk) have claimed that introducing random gaussian noise into samples will create totally new data, this hasn't been true from observation. If a clean file fails in testing, its gaussian counterpart will also always fail. This is because no matter the noise, it is still using the same feature set to classify both samples. Therefore, it cannot learn new features from generated data.

Experimentally, the amount of noise was selected for two levels. It was found that if the noise levels were too high, the model would always learn to identify the noise, even though examples of this noise were present with the same balance in the not-wake background noise data for both training and testing.

For the production model that included a full data collection for 2 individuals (male and female), 22 TTS samples were also used. These were added directly into the training set, with no test set. For such a small number of samples with individual voices, it isn't really possible to satisfy a test-train split. It is important to filter out any TTS samples that sound too robotic or mispronounce the wake word. The current hypothesis on recommended minimum TTS requirements (in addition to at least one complete data collection on an individual) is 32 samples.

Modeling method

  • initial training and model selection
  • incremental training

Testing method

  • mics
  • false wake ups
  • production environment

Results

Interestingly, the room acoustics and microphone properties played much less of a role on these models than hypothesized. Only when the models have been trained incrementally on a high number of samples (~25k or more), will the microphone properties start playing a role.

Before optimizing the data (data-prep), the model was often unbalanced due to the incremental training. However, in experiments where the data optimization method was used, it reached the limits of the data set (more data from common voice is required to continue to find the upper limits of balancing the data).

Recommended reading

Neural Network-based Small-Footprint Flexible Keyword Spotting

Wake Word Detection Using Recurrent Neural Networks

Data collection tool

The data collection tool is currently in a complete functional prototype phase. It solves the tricky data collection problem of wakewords.

Simply put, there are very specific minimum data requirements of the two data classes to classify wake words (wake-word and not-wake-word). These requirements are not generally known and best practices aren't usually followed.

The basic use case is: "What do I need to record to create a successful wakeword model using at the minimum, one user's voice".

The purpose of the prototype is to experimentally determine the following parameters required to successfully (user successfully uses wakeword 5 times in a row, and there are no false triggers for at least 1 hour of of testing with half an hour TV and half an hour user conversation) create a production model:

  • Number of wakeword recordings
  • Number of variants (deeper or higher pitched voice, faster, further away from mic, closer, etc.)
  • Number of background ambient noise recordings
  • Number/Length of speaking or conversational not-wake recordings (when the user is talking and the model is trained to their voice, the likelihood of wake up is high, therefore users need to record themselves talking in general)
  • Recordings of individual syllables of the wake word (the classifier can get stuck on identifying one sound in the whole wake word, ie for 'hey jarvis' it could faslely trigger for 'hey', 'jar' or 'vis')

The next phase of the wakeword data collection tool is to create actual software that could be run via website or as a skill for a user to easily collect wakeword data.

This is a good reference for wake word training.

Model optimization and data generation tool

wakeword-data-prep automatically splits the training and test set, including sub-classes. It finds the optimal split by training several models and selecting the best model. Then it generates both Gaussian and background noise files.

It is still a work in progress. Currently to run it, it should be placed in the same directory as Precise (and you should activate your venv before running it). Although it uses Precise to test the distribution of the data (train/testing), you could still use the output with another wakeword system (an optimal distribution is an optimal distribution for any wakeword system).

How does it work?

  • randomly splits basic audio files into training and test sets
    • 'wake-word' (80/20)
    • (wake-word) 'variations' (50/50)
    • (not-wake-word) 'background' (80/20)
    • (not-wake-word) 'paragraph' (50/50)
    • (not-wake-word) 'syllables' (50/50)
  • trains 5 'base' models using random shuffles of above splits (uses 450 epochs, the epochs run really fast!)
  • selects 'best' model from the 5
    • removes the model with the greatest difference between accuracy and validation accuracy (sometimes a random shuffle can lead to weird unbalanced data splits, this is characterized by either the accuracy or the validation accuracy being much higher/lower scored than the other. A quick fix is simply to always remove the model with the great difference between the sets.
    • Selects the model with the highest accuracy
    • deletes the unused data split directories for the models
  • Adds 2 levels of Gaussian noise to all categories and adds those audio files to their own 'gauss' sub directory for each category... This is important for two reasons:
    1. Allows the model to have basic functionality in noisy environments
    2. Generates more data to keep the classes more balanced
  • Adds pre-recorded background sounds (from open source noise data sets) to all categories (except of the gaussed ones, of course!) and adds those files to sub directories, for the same reasons as Gauss.
  • This directory and base model can be now used for deeper training (ie. precise-train-incremental) to make the optimal model. Perhaps this step will be included in a future version, so it does everything for you. We will see. But until then, check out the how to on training a model from the incremental part to continue making your bullet proof wake word model.

Engine Improvements

Currently the only open source wakeword system is Mycroft's Precise. It uses Tensorflow 1.13 with Keras to generate a GRU-RNN based on the data collected. It has a lot of undocumented features such as introducing background noises into wake and non-wake audio files, and even has an unofficial branch for Tensorflow 2.3.1. To improve the current state. The TF 2.3.1 branch was forked, some minor bug fixes were implemented and it has been run and tested. The goals:

  • implement a Tensorflow lite model, which is lighter than the current models (26kb) to speed up the model and reduce the resources required to run the model (done)
    • the model currently exported to tflite is 21kb (without optimization methods)
  • implement TFLite post training quantization to further compress the model
  • test uncompressed vs various compressed models to determine speed vs accuracy optimization
  • deploy forked repo using TF 2.3.1 on raspi4 aarch64 (ARM64) (done)
    • deploy on Mycroft (done)
  • create lighter TFLite runner (binary?) for raspi4 aarch64 (arm64)
  • deploy TFLite runner on Mycroft (done, it is tested and works, but not with a binary nor is it slimmed down)
  • benchmark CPU usage of TFLite runner vs whole deployed repo (TF 2.3.1) vs original precise (~25-30% base CPU usage) (TF2 uncompressed easy deployment: ~13-20% CPU)
  • document the deployment and usage of each component and improve existing documentation

Note on Precise TF1 and TF2: Interestingly, the model can be trained using the TF1 version then still be converted into TF2. Why would you want to do this? Although it has been reported that the latest version of TF has no performance issues anymore in training, training with TF2 is still much slower than TF1. Therefore, it is recommended to train the model in TF1 and export it to TF2 (or TFLite).

How to: Train a Model With Precise Manually

  • Make sure you first meet the data requirements and follow best practices (and here) in regards to data collection.
    • check to make sure all datasets are using the correct audio file format (the wakeword-recorder-py uses this format already):
      • wave file format
      • channels: 1
      • sample frequency: 16000
  • Optimally split your data (use the wakeword-data-prep script for this) otherwise do it manually:
    • Perform a random 80/20 split on the base wake-word root directory (not variations!) and background noise (in not-wake-wake-word/background) directory
    • 50/50 split for all other categories (ie variations, paragraph) based on their variation 'pairs' (further documentation to come, but its already built into the wakeword-data-prep script)
    • create several models from randomly shuffling the data (ie 5 random shuffles)
    • select best model
    • add background noise to samples
  • Gaussian (run notebook or wakeword-data-prep script)
    • precise-add-noise (dataset source folder) (background sounds folder) (output folder)
      • not currently implemented in the script, works with precise as a command
      • download the following data sets for noise (best practice: add them as sub directories to the random directory):
  • (if you don't use the script and want to train manually) Find optimal number of epochs: precise-train -e 600 jarvis_rebooted.net jarvis_rebooted/
    • For the first training, only train on base wake-word and not-wake-word (no noise, random sound files, etc.)
    • Once you are sure you are hitting ~93%-95% move on
  • Test the model: precise-test jarvis_rebooted.net jarvis_rebooted/
    • What is failing? Why?
  • Use the model: precise-listen jarvis_rebooted.net
    • Does it work? Does it work for just part of the wakeword (ie 'hey')?
    • Remember this is the weakest model you will built, it will trigger with almost any input!
  • Incremental training: precise-train-incremental jarvis_rebooted.net jarvis_rebooted/ -r jarvis_rebooted/random/
    • First incrementally train on the random conversational and TV recordings (ie from wake-word-recorder.py
    • Incremental training on random sounds also!
    • once done, run your model for roughly 300 epochs with the normal training method
    • Run a test and determine where it fails and why
    • Use the model: does it work? Does it wake up for parts of the wakeword only?
    • Ideally this final model will have few false wake ups but will detect the wake word every time, even in a noisy environment.
  • Convert model: precise-convert jarvis_rebooted.net
  • Deploy model
  • Test model in production
    • Say the wake word 5 times
    • Let the model run for 2h (at least 1h random conversation + 1h TV)
  • If model passes: congratulations
  • If model fails: back to the steps all over again!

How to: Install Tensorflow 2.3.1 on a Raspberry Pi 4 with aarch64 (arm64)

To run the whole precise repo, you have to install tensorflow 2.3.1. For most platforms, this is easy. But there are specific steps for a raspi4 aarch64.

  • get a fresh start (remember, the 64-bit OS is still under development)
$ sudo apt-get update
$ sudo apt-get upgrade
  • install pip and pip3 (but I am sure you already have this!)
$ sudo apt-get install python-pip python3-pip
  • remove old versions, if not placed in a virtual environment (let pip search for them)
$ sudo pip uninstall tensorflow
$ sudo pip3 uninstall tensorflow
  • install the dependencies (if not already onboard)
$ sudo apt-get install gfortran
$ sudo apt-get install libhdf5-dev libc-ares-dev libeigen3-dev
$ sudo apt-get install libatlas-base-dev libopenblas-dev libblas-dev
$ sudo apt-get install liblapack-dev
  • If you are doing this for a specific python env (which you should!) then drop the sudo -H in the rest of the instructions
  • upgrade setuptools 47.1.1 -> 50.3.0
$ sudo -H pip3 install --upgrade setuptools
$ sudo -H pip3 install pybind11
$ sudo -H pip3 install Cython==0.29.21
  • install h5py with Cython version 0.29.21 (± 6 min @1950 MHz)
$ sudo -H pip3 install h5py==2.10.0
  • install gdown to download from Google drive
$ pip3 install gdown
  • download the wheel (seriously, last time I checked you have to get it from some dude's google drive... When will they release this package officially? Also if you don't trust this part, you will have to make the wheel yourself..)
$ gdown https://drive.google.com/uc?id=1jbkp2rSZZ3YY-AM1vuHyB9hI05zrZGHg
  • install TensorFlow (± 63 min @1950 MHz)
$ sudo -H pip3 install tensorflow-2.3.1-cp37-cp37m-linux_aarch64.whl

How to: Quick and Dirty Tensorflow Lite Precise Model Deployed in Mycroft

This briefly describes how to get a TFlite model for Precise (TF2.3.1) running directly in Mycroft. (it is strongly recommended you follow the instructions for installing TF2.3.1 first!)

  • backup the TF 1.13 Precise engine in: ~/.mycroft/precise/precise-engine, i.e. precise-engine-old
  • copy the Tensorflow 2.3.1 Tensorflow lite (TFLite) Precise repo to:
    • ~/.mycroft/precise/ and rename the directory to precise-engine
      • Note: This isn't the slimest version of the runner, needs to be cut down in the future, perhaps even turned into a binary!
  • add the .tflite model to the ~/.mycroft/precise directory
    • Question: does the params file also need to be included? (No, I don't think this makes a difference with .tflite, it is only this file so fine I guess..)
  • in ~/.mycroft/ backup the current config file: mycroft.conf, i.e. mycroft.conf.bak
  • edit the mycroft.conf to include the path for the tflite model "local_model_file": "~/.mycroft/precise/*.tflite
  • run Mycroft in debug mode and test it a bunch of times. The sensitivity and trigger-level might need to be fine tuned for this model.
Clone this wiki locally