Getting started¶

This provides a quick walk-through to help you get started with TrulyNatural concepts and command-line tools.

Prerequisites¶

Install the TrulyNatural SDK using the provided installer executable.
Open a terminal with a command-line prompt.
Add ~/Sensory/TrulyNaturalSDK/7.6.1/bin to your shell PATH variable, and change your working directory to ~/Sensory/TrulyNaturalSDK/7.6.1
Configure your shell environment
```
PATH=${PATH}:~/Sensory/TrulyNaturalSDK/7.6.1/bin
cd ~/Sensory/TrulyNaturalSDK/7.6.1
```

Wake words¶

Let's start with running a simple wake word on live audio input.

The snsr-eval utility can run all recognition and VAD models types. Use the -t (task) flag to specify the path to the wake word model.

Start snsr-eval and say "voice genie" a number of times:
```
% snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr
  1275   1920 voicegenie
  5070   5730 voicegenie
  10395  10980 voicegenie
  ^C
```
The output shows the start and end times in ms since the start of the recording, and the phrase the key word spotter detected. snsr-eval runs until you interrupt it with ^C.

The spot-voicegenie-enUS-6.5.1-m.snsr model file includes everything needed to run this wake word, including reasonable default settings.

For wake words, there is one configuration option that you might want to adjust: operating-point, which controls the recognition sensitivity. You can do this on the command-line with the -s option:
```
% snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr -s operating-point=21
  1080   1605 voicegenie
  2610   3180 voicegenie
  5460   5970 voicegenie
^C
```
Increase snsr-eval output verbosity with one or more -v flags:
```
% snsr-eval -vt model/spot-voicegenie-enUS-6.5.1-m.snsr
Using live audio from default capture device. ^C to stop.
1410   1995 (0.999) voicegenie
^C

% snsr-eval -vvt model/spot-voicegenie-enUS-6.5.1-m.snsr
Using live audio from default capture device. ^C to stop.
Using operating point 8.
Available operating points: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21.
Available vocabulary:
1: "voicegenie"
phrase:
  495   1020 (0.9951) voicegenie
words:
  495   1020 (0.9951) voicegenie

^C
```
The value between the parenthesis is the recognition score.

You can also reduce the output verbosity with one or more -l flags. snsr-eval -ll reports only the recognition result, which makes it suitable for scripted batch testing.

Tip

Run snsr-eval, or any of the command-line tools without arguments to see a brief usage summary and a list of available options.
To recognize pre-recorded audio, specify an audio file in either RIFF WAV format or as a binary file containing audio samples and no header. Most models require 16-bit LPCM encoding sampled at 16 kHz.
```
% snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr \
    data/audio/voice-genie-set-cruise-control.wav
  2310   2910 voicegenie
```
snsr-eval ends once it has processed the entire file.

If you specify multiple files snsr-eval concatenates the files in order and evaluates the model on the result. Recognition timestamps reflect this concatenation.
```
% snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr \
    data/audio/voice-genie-set-cruise-control.wav \
    data/audio/voice-genie-set-cruise-control.wav
  2310   2910 voicegenie
  9075   9645 voicegenie
```

snsr-eval also runs command sets. These are keyword spotters with more than one active word or phrase, optimized for a low false reject rate.

Let's inspect the sample music control model vocabulary:

% snsr-eval -vvt model/spot-music-enUS-1.2.0-m.snsr /dev/null
Using operating point 17.
Available operating points: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20.
Available vocabulary:
  1: "play_music"
  2: "previous_song"
  3: "stop_music"
  4: "next_song"
  5: "pause_music"

Command sets work just like wake words:

% snsr-eval -t model/spot-music-enUS-1.2.0-m.snsr \
    data/audio/voice-genie-music.wav
  5160   5865 play_music
  8055   8790 next_song
 14820  15705 stop_music

Running an adapting wake word uses the same recipe as regular wake words:

% snsr-eval -vt model/ca-voicegenie-enUS-1.1.0.snsr
Using live audio from default capture device. ^C to stop.
      [^adapt-started] on worker thread
  8295   8790 (1 sv) voice_genie
      [^adapt-started] on worker thread
15645  16245 (1 sv) voice_genie
16515 [^adapted]
16515 [^new-user] user1/voice_genie
      [^adapt-started] on worker thread
20445  21060 (0.9678 sv) user1/voice_genie
21225 [^adapted]
^C

Note how the result changes once the model has adapted to your speech. If you have a second speaker say "voice genie" a couple of times you should see user2/voice_genie for their utterances.

The model adaptations do not persist and will be lost when the model is reloaded. We can address that by specifying a cache-file to hold these. Say "voice genie" a number of times, then stop snsr-eval with ^C:

% snsr-eval -vt model/ca-voicegenie-enUS-1.1.0.snsr \
    -s cache-file=voice-genie.cache
Using live audio from default capture device. ^C to stop.
      [^adapt-started] on worker thread
  1485   2070 (1 sv) voice_genie
      [^adapt-started] on worker thread
  4740   5355 (1 sv) voice_genie
  5460 [^adapted]
  5460 [^new-user] user1/voice_genie
      [^adapt-started] on worker thread
  9285   9855 (0.9951 sv) user1/voice_genie
 10140 [^adapted]
^C

snsr-eval loads voice-genie.cache when restarting. Note that the very first result already includes the user1/voice_genie result:

% snsr-eval -vt model/ca-voicegenie-enUS-1.1.0.snsr \
    -s cache-file=voice-genie.cache
Using live audio from default capture device. ^C to stop.
      [^adapt-started] on worker thread
  1185   1815 (0.9971 sv) user1/voice_genie
  2085 [^adapted]
^C

Wake word enrollment¶

EFT models adapt a specific wake word phrase to a speaker's voice. UDT models create new wake words for arbitrary phrases from a handful of speaker-specific recordings.

Let's start with UDT. The spot-enroll utility does EFT and UDT enrollment from recordings. Most models require four recordings of the same phrase for optimal performance. UDT models are less likely to trigger on the same phrase said by another speaker.

Enroll three phrases into a single model

% spot-enroll -vt model/udt-universal-3.67.1.0.snsr \
    -o udt-kws.snsr \
    +armadillo-1 data/enrollments/armadillo-1-{0,1,2,3}.wav \
    +armadillo-6 data/enrollments/armadillo-6-{0,1,2,3}.wav \
    +jackalope-1 data/enrollments/jackalope-1-{0,1,2,3}.wav \
    +jackalope-4 data/enrollments/jackalope-4-{0,1,2,3}.wav \
    +terminator-2 data/enrollments/terminator-2-{0,1,2,3}.wav \
    +terminator-6 data/enrollments/terminator-6-{0,1,2,3}.wav
Adapting: 100% complete.
Enrolled model saved to "udt-kws.snsr"

We've just created a new udt-kws.snsr wake word model that spots three different phrases with two unique speakers per phrase. We can run the model with snsr-eval. We'll use different recordings than the ones used for enrollment:

% snsr-eval -t udt-kws.snsr \
    data/enrollments/armadillo-1-0-c.wav \
    data/enrollments/armadillo-6-0.wav \
    data/enrollments/jackalope-1-0-c.wav \
    data/enrollments/jackalope-4-0.wav \
    data/enrollments/terminator-2-0.wav \
    data/enrollments/terminator-6-0.wav
   330    945 armadillo-1
  4485   5265 armadillo-6
  6795   7320 jackalope-1
 10365  10890 jackalope-4
 13245  13950 terminator-2
 15510  16215 terminator-6

The model identifies the phrases and the speakers correctly.

EFT models use the same tools and API for enrollment as UDT. We'll use live-enroll in this example. This is the interactive version of spot-enroll, the main difference being that it prompts to repeat recordings that aren't usable instead of reporting an enrollment error.

Run the example below, then say "hello blue genie" when prompted.

% live-enroll -vt model/eft-hbg-enUS-23.0.0.9.snsr \
    -o eft-hbg.snsr +user-1

Say the enrollment phrase (1/4) for "user-1"
Recording:   4.29 s
Preliminary enrollment checks passed.

Say the enrollment phrase (2/4) for "user-1"
Recording:   3.23 s
Preliminary enrollment checks passed.

Say the enrollment phrase (3/4) for "user-1" with context,
  for example: "<phrase> will it rain tomorrow?"
Recording:   3.15 s
Preliminary enrollment checks passed.

Say the enrollment phrase (4/4) for "user-1" with context,
  for example: "<phrase> will it rain tomorrow?"
Recording:   3.90 s
Preliminary enrollment checks passed.
Adapting: 100% complete.
Enrolled model saved to "eft-hbg.snsr"

As before, test with snsr-eval:

% snsr-eval -t eft-hbg.snsr
Using live audio from default capture device. ^C to stop.
   675   1500 (0.8921 sv) user-1/HBG
  3480   4500 (0.8245 sv) user-1/HBG
  7020   8160 (0.877 sv) user-1/HBG
^C

The value between the parenthesis is the sv-score.

Templates¶

Templates are models that add new behaviors to wake word, lvcsr, and stt models via composition. Templates have slots that we fill with models of the required type.

Let's say we would like to reduce false accepts of the music command set used above by requiring commands in it to be prefixed with a low false-accept wake word like "voice genie". We can do this with the tpl-spot-sequential template:
```
% snsr-eval -t model/tpl-spot-sequential-1.5.0.snsr \
    -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \
    -f 1 model/spot-music-enUS-1.2.0-m.snsr \
    data/audio/voice-genie-music.wav
  5175   5865 play_music
```
snsr-eval's -f slot filename option loads the named file into the specified slot. The tpl-spot-sequential documentation lists the slots the template supports, and the types of models it expects for these.

As expected, the new recognizer spots only the command directly following "voice genie".

snsr-eval supports on-the-fly model composition, but what if we have code that already works with spot-music-enUS-1.2.0-m.snsr that we don't want to modify? Enter snsr-edit, which supports composition and setting changes and can save the result as a new, self-contained model:
```
% snsr-edit -vvt model/tpl-spot-sequential-1.5.0.snsr \
    -o vg-music.snsr \
    -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \
    -f 1 model/spot-music-enUS-1.2.0-m.snsr
Loading "model/tpl-spot-sequential-1.5.0.snsr" as the template model.
Loading "model/spot-voicegenie-enUS-6.5.1-m.snsr" into setting "0".
Loading "model/spot-music-enUS-1.2.0-m.snsr" into setting "1".
Output written to "vg-music.snsr".

% snsr-eval -t vg-music.snsr data/audio/voice-genie-music.wav
  5175   5865 play_music
```
tpl-spot-sequential has a loop setting that changes behavior. Lets give that a try:
```
% snsr-eval -t vg-music.snsr -s loop=1 data/audio/voice-genie-music.wav
  5175   5865 play_music
  8055   8790 next_song
```
This recognizes the first two music commands, but not "stop music" as the gap between "next song" and "stop music" is more than the listen-window, which is five seconds:
```
% snsr-edit -t model/spot-music-enUS-1.2.0-m.snsr -q listen-window
listen-window = 5
```

We can run two keyword spotters simultaneously with tpl-spot-concurrent:

% snsr-eval -t model/tpl-spot-concurrent-1.5.0.snsr \
    -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \
    -f 1 model/spot-music-enUS-1.2.0-m.snsr \
    data/audio/voice-genie-music.wav
  4485   5085 voicegenie
  5160   5865 play_music
  8055   8790 next_song
 14820  15705 stop_music

Speech To Text stt¶

TrulyNatural STT includes support for modern transformer-based end-to-end recognizers suitable for transcription tasks.

STT models use the same tools and API as wake words. Let's run a sample audio file through stt-enUS-automotive-medium-2.3.15-pnc.snsr with snsr-eval:

% snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    data/audio/voice-genie-set-cruise-control.wav
P   2040   2960 God Boice jeni
P   2240   3560 Voice. Genie said the creek
P   2240   3960 Voice. Genie set the cruise control
P   2240   4200 Voice. Genie set the cruise control to
P   2240   4720 Voice. Genie set the cruise control to further
P   2240   5200 Voice. Genie set the cruise control to fifty five. Mrs
P   2240   5600 Voice Genie set the cruise control to fifty five miles back
P   2240   5800 Voice Genie set the cruise control to fifty five miles per hour
P   2240   5840 Voice Genie set the cruise control to fifty five miles per hour
NLU intent: set_cruise_control (0.9969) = voice genie set the cruise control to 55 miles per hour
NLU entity:   number (0.9931) = 55
NLU entity:   speed_unit (0.9942) = miles per hour
  2240   5840 Voice Genie set the cruise control to fifty five miles per hour.

Partial or interim hypotheses are shown prefixed with P. These provide useful feedback for live transcription tasks, but are less interesting when recognizing from file. You can suppress them by setting partial-result-interval = 0:

% snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    -s partial-result-interval=0 \
    data/audio/voice-genie-set-cruise-control.wav
NLU intent: set_cruise_control (0.9969) = voice genie set the cruise control to 55 miles per hour
NLU entity:   number (0.9931) = 55
NLU entity:   speed_unit (0.9942) = miles per hour
  2240   5840 Voice Genie set the cruise control to fifty five miles per hour.

STT and LVCSR models only produce a final recognition hypothesis at end-of-file, or when a VAD signals that speech has ended. For convenience snsr-eval has an -a option that adds the tpl-vad-lvcsr template if you're using live audio and specify an STT model that does not include a VAD.

% snsr-eval -lt model/stt-enUS-automotive-medium-2.3.15-pnc.snsr
ERROR: With live audio LVCSR and STT models require a VAD. You can add one with the -a flag.

% snsr-eval -alt model/stt-enUS-automotive-medium-2.3.15-pnc.snsr
NLU intent: no_command (0.9995) = the quick brown fox jumped over the lazy dog's back
The quick Brown Fox jumped over the lazy dog's back.
^C

Create a new model that includes a VAD:

% snsr-edit -vvo vad-stt.snsr \
    -t model/tpl-vad-lvcsr-3.17.0.snsr \
    -f 0 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr
Loading "model/tpl-vad-lvcsr-3.17.0.snsr" as the template model.
Loading "model/stt-enUS-automotive-medium-2.3.14-pnc.snsr" into setting "0".
Output written to "vad-stt.snsr".

% snsr-eval -lt vad-stt.snsr
NLU intent: no_command (0.9995) = the quick brown fox jumped over the lazy dog's back
The quick Brown Fox jumped over the lazy dog's back.

Creating an STT model that's gated by a wake word is also easy with tpl-opt-spot-vad-lvcsr:

% snsr-edit -vvo vg-vad-stt.snsr \
    -t model/tpl-opt-spot-vad-lvcsr-1.25.0.snsr \
    -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \
    -f 1 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \
    -s include-wake-word-audio=1
Loading "model/tpl-opt-spot-vad-lvcsr-1.24.0.snsr" as the template model.
Loading "model/spot-voicegenie-enUS-6.5.1-m.snsr" into setting "0".
Loading "model/stt-enUS-automotive-medium-2.3.14-pnc.snsr" into setting "1".
Output written to "vg-vad-stt.snsr".

include-wake-word-audio = 1 includes the wake word in the audio seen by the STT recognizer, but configures the STT to elide this from the recognition hypothesis. This improves recognition accuracy if there's no pause between the wake word and the STT command.

Test with snsr-eval

% snsr-eval -lt vg-vad-stt.snsr \
    data/audio/voice-genie-set-cruise-control.wav
NLU intent: set_cruise_control (0.9968) = set the cruise control to 55 miles per hour
NLU entity:   number (0.9937) = 55
NLU entity:   speed_unit (0.9936) = miles per hour
Set the cruise control to fifty five miles per hour.

Test with snsr-eval and live audio

% snsr-eval -t vg-vad-stt.snsr
P   7180   7340 Said
P   7220   7860 Said the radio
P   7220   8220 Set the radio tonight
P   7260   8580 Set the radio to ninety one
P   7260   9060 Set the radio to ninety one point. Three
P   7260   9260 Set the radio to ninety one point. Five
P   7260   9620 Set the radio to ninety one point. Five f
NLU intent: set_radio (0.9674) = set the radio to 91.5 FM
NLU entity:   radio_station (0.9688) = 91.5 FM
  7260   9860 Set the radio to ninety one point. Five F. M.
^C

LVCSR tnl¶

Use Grammar-based recognition for command and control tasks with small to medium sized vocabularies on devices that aren't powerful enough for STT. VoiceHub provides a convenient interface for creating these models.

Let's build a model that recognizes the example audio files in data/enrollments/. We'll use a grammar file from data/grammars/:

% snsr-edit -vvo commands.snsr \
    -t model/lvcsr-build-enUS-2.7.3.snsr \
    -f grammar-stream data/grammars/enrollments-nlu-slot.txt \
    -s partial-result-interval=0
Loading "model/lvcsr-build-enUS-2.7.3.snsr" as the template model.
Loading "data/grammars/enrollments-nlu-slot.txt" into setting "grammar-stream".
Output written to "commands.snsr".

% snsr-eval -t commands.snsr data/enrollments/armadillo-1-0-c.wav
NLU intent: calculate (0) =  18 percent of 643
NLU entity:   percent (0) = 18
NLU entity:   number (0) = 643
  375   3195 armadillo 18 percent of 643

We set grammar-stream to the contents of data/grammars/enrollments-nlu-slot.txt to define which sentences the recognizer will accept, and partial-result-interval = 0 to suppress interim results.

data/grammars/enrollments-nlu-slot.txt

# LVCSR grammar specification for test utterances in data/enrollments/
# Includes lightweight NLU slot markup.
#
# In a tpl-spot-vad-lvcsr pipeline the prefix would be consumed by the spotter.
prefix = armadillo | jackalope | terminator;

# Numbers used in the intent rule below
number = 18 | 643 | 20 | 6;

# Places
place = target | winco | susan's house | gas;

# Dates
date = friday | tomorrow | next week;

# List of known utterances in the *-c.wav files.
intent =
{calculate {percent $number} percent of {number}} |
{call call the nearest {place}} |
{navigate how far away is {place}} |
{avcontrol {action play} {type more} songs by this artist} |
{avcontrol {action record} a {type video}} |
{startTimer start a timer for {number} minutes {unit :minutes}} |
{navigate i'm running low on {place}} |
{calendar {action cancel} all my {type meetings} on {date}} |
{navigate directions to {place}} |
{messaging do i have any new texts {action :query} {type :texts}} |
{calendar {type open} my calendar to {date}} |
{alarm {type set} an alarm for {number} am {date}};

# Match the prefix and zero or one of the sentences.
# <s> and </s> are sentence start and end markers that
# match silence and small amounts of extraneous speech.
g = <s> $prefix $intent? </s>;

Let's combine the wake word we previously enrolled with this LVCSR model and tpl-opt-spot-vad-lvcsr:

% snsr-edit -vvo ww-commands.snsr \
    -t model/tpl-opt-spot-vad-lvcsr-1.25.0.snsr \
    -f 0 udt-kws.snsr \
    -f 1 commands.snsr \
    -s include-wake-word-audio=1
Loading "model/tpl-opt-spot-vad-lvcsr-1.24.0.snsr" as the template model.
Loading "udt-kws.snsr" into setting "0".
Loading "commands.snsr" into setting "1".
Output written to "ww-commands.snsr".

And then run it over all the enrollment recordings:

% snsr-eval -t ww-commands.snsr data/enrollments/*-c.wav
NLU intent: calculate (0) =  18 percent of 643
NLU entity:   percent (0) = 18
NLU entity:   number (0) = 643
  375   3195 armadillo 18 percent of 643
NLU intent: call (0) = call the nearest target
NLU entity:   place (0) = target
  4695   6360 armadillo call the nearest target
NLU intent: navigate (0) = how far away is winco
NLU entity:   place (0) = winco
  7680   9495 armadillo how far away is winco
NLU intent: avcontrol (0) =  record a video
NLU entity:   action (0) = record
NLU entity:   type (0) = video
14535  16095 armadillo record a video
NLU intent: startTimer (0) = start a timer for 20 minutes minutes
NLU entity:   number (0) = 20
NLU entity:   unit (0) = minutes
17640  19905 armadillo start a timer for 20 minutes minutes
NLU intent: navigate (0) = i'm running low on gas
NLU entity:   place (0) = gas
21060  22935 jackalope i'm running low on gas
NLU intent: calendar (0) =  cancel all my meetings on friday
NLU entity:   action (0) = cancel
NLU entity:   type (0) = meetings
NLU entity:   date (0) = friday
24315  26655 jackalope cancel all my meetings on friday
NLU intent: navigate (0) = directions to susan's house
NLU entity:   place (0) = susan's house
27915  30195 jackalope directions to susan's house
NLU intent: messaging (0) = do i have any new texts query texts
NLU entity:   action (0) = query
NLU entity:   type (0) = texts
31500  33525 jackalope do i have any new texts query texts
NLU intent: calendar (0) =  open my calendar to next week
NLU entity:   type (0) = open
NLU entity:   date (0) = next week
34695  36975 jackalope open my calendar to next week
NLU intent: alarm (0) =  set an alarm for 6 am tomorrow
NLU entity:   type (0) = set
NLU entity:   number (0) = 6
NLU entity:   date (0) = tomorrow
38160  40665 jackalope set an alarm for 6 am tomorrow