Getting started¶
This provides a quick walk-through to help you get started with TrulyNatural concepts and command-line tools.
Prerequisites¶
- Install the TrulyNatural SDK using the provided installer executable.
- Open a terminal with a command-line prompt.
-
Add ~/Sensory/TrulyNaturalSDK/7.6.1/bin to your shell
PATHvariable, and change your working directory to ~/Sensory/TrulyNaturalSDK/7.6.1
Wake words¶
-
Let's start with running a simple wake word on live audio input.
The snsr-eval utility can run all recognition and VAD models types. Use the
-t(task) flag to specify the path to the wake word model.Start
snsr-evaland say "voice genie" a number of times:% snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr 1275 1920 voicegenie 5070 5730 voicegenie 10395 10980 voicegenie ^CThe output shows the start and end times in ms since the start of the recording, and the phrase the key word spotter detected.
snsr-evalruns until you interrupt it with^C.The spot-voicegenie-enUS-6.5.1-m.snsr model file includes everything needed to run this wake word, including reasonable default settings.
For wake words, there is one configuration option that you might want to adjust: operating-point, which controls the recognition sensitivity. You can do this on the command-line with the
-soption:% snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr -s operating-point=21 1080 1605 voicegenie 2610 3180 voicegenie 5460 5970 voicegenie ^CIncrease snsr-eval output verbosity with one or more
-vflags:% snsr-eval -vt model/spot-voicegenie-enUS-6.5.1-m.snsr Using live audio from default capture device. ^C to stop. 1410 1995 (0.999) voicegenie ^C % snsr-eval -vvt model/spot-voicegenie-enUS-6.5.1-m.snsr Using live audio from default capture device. ^C to stop. Using operating point 8. Available operating points: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21. Available vocabulary: 1: "voicegenie" phrase: 495 1020 (0.9951) voicegenie words: 495 1020 (0.9951) voicegenie ^CThe value between the parenthesis is the recognition score.
You can also reduce the output verbosity with one or more
-lflags.snsr-eval -llreports only the recognition result, which makes it suitable for scripted batch testing.Tip
Run snsr-eval, or any of the command-line tools without arguments to see a brief usage summary and a list of available options.
-
To recognize pre-recorded audio, specify an audio file in either RIFF WAV format or as a binary file containing audio samples and no header. Most models require 16-bit LPCM encoding sampled at 16 kHz.
% snsr-eval -t model/spot-voicegenie-enUS-6.5.1-m.snsr \ data/audio/voice-genie-set-cruise-control.wav 2310 2910 voicegeniesnsr-evalends once it has processed the entire file.If you specify multiple files
snsr-evalconcatenates the files in order and evaluates the model on the result. Recognition timestamps reflect this concatenation. -
snsr-eval also runs command sets. These are keyword spotters with more than one active word or phrase, optimized for a low false reject rate.
Let's inspect the sample music control model vocabulary:
% snsr-eval -vvt model/spot-music-enUS-1.2.0-m.snsr /dev/null Using operating point 17. Available operating points: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20. Available vocabulary: 1: "play_music" 2: "previous_song" 3: "stop_music" 4: "next_song" 5: "pause_music"Command sets work just like wake words:
-
Running an adapting wake word uses the same recipe as regular wake words:
% snsr-eval -vt model/ca-voicegenie-enUS-1.1.0.snsr Using live audio from default capture device. ^C to stop. [^adapt-started] on worker thread 8295 8790 (1 sv) voice_genie [^adapt-started] on worker thread 15645 16245 (1 sv) voice_genie 16515 [^adapted] 16515 [^new-user] user1/voice_genie [^adapt-started] on worker thread 20445 21060 (0.9678 sv) user1/voice_genie 21225 [^adapted] ^CNote how the result changes once the model has adapted to your speech. If you have a second speaker say "voice genie" a couple of times you should see
user2/voice_geniefor their utterances.The model adaptations do not persist and will be lost when the model is reloaded. We can address that by specifying a cache-file to hold these. Say "voice genie" a number of times, then stop
snsr-evalwith^C:% snsr-eval -vt model/ca-voicegenie-enUS-1.1.0.snsr \ -s cache-file=voice-genie.cache Using live audio from default capture device. ^C to stop. [^adapt-started] on worker thread 1485 2070 (1 sv) voice_genie [^adapt-started] on worker thread 4740 5355 (1 sv) voice_genie 5460 [^adapted] 5460 [^new-user] user1/voice_genie [^adapt-started] on worker thread 9285 9855 (0.9951 sv) user1/voice_genie 10140 [^adapted] ^Csnsr-evalloads voice-genie.cache when restarting. Note that the very first result already includes theuser1/voice_genieresult:
Wake word enrollment¶
EFT models adapt a specific wake word phrase to a speaker's voice. UDT models create new wake words for arbitrary phrases from a handful of speaker-specific recordings.
-
Let's start with UDT. The spot-enroll utility does EFT and UDT enrollment from recordings. Most models require four recordings of the same phrase for optimal performance. UDT models are less likely to trigger on the same phrase said by another speaker.
Enroll three phrases into a single model% spot-enroll -vt model/udt-universal-3.67.1.0.snsr \ -o udt-kws.snsr \ +armadillo-1 data/enrollments/armadillo-1-{0,1,2,3}.wav \ +armadillo-6 data/enrollments/armadillo-6-{0,1,2,3}.wav \ +jackalope-1 data/enrollments/jackalope-1-{0,1,2,3}.wav \ +jackalope-4 data/enrollments/jackalope-4-{0,1,2,3}.wav \ +terminator-2 data/enrollments/terminator-2-{0,1,2,3}.wav \ +terminator-6 data/enrollments/terminator-6-{0,1,2,3}.wav Adapting: 100% complete. Enrolled model saved to "udt-kws.snsr"We've just created a new
udt-kws.snsrwake word model that spots three different phrases with two unique speakers per phrase. We can run the model with snsr-eval. We'll use different recordings than the ones used for enrollment:% snsr-eval -t udt-kws.snsr \ data/enrollments/armadillo-1-0-c.wav \ data/enrollments/armadillo-6-0.wav \ data/enrollments/jackalope-1-0-c.wav \ data/enrollments/jackalope-4-0.wav \ data/enrollments/terminator-2-0.wav \ data/enrollments/terminator-6-0.wav 330 945 armadillo-1 4485 5265 armadillo-6 6795 7320 jackalope-1 10365 10890 jackalope-4 13245 13950 terminator-2 15510 16215 terminator-6The model identifies the phrases and the speakers correctly.
-
EFT models use the same tools and API for enrollment as UDT. We'll use live-enroll in this example. This is the interactive version of spot-enroll, the main difference being that it prompts to repeat recordings that aren't usable instead of reporting an enrollment error.
Run the example below, then say "hello blue genie" when prompted.
% live-enroll -vt model/eft-hbg-enUS-23.0.0.9.snsr \ -o eft-hbg.snsr +user-1 Say the enrollment phrase (1/4) for "user-1" Recording: 4.29 s Preliminary enrollment checks passed. Say the enrollment phrase (2/4) for "user-1" Recording: 3.23 s Preliminary enrollment checks passed. Say the enrollment phrase (3/4) for "user-1" with context, for example: "<phrase> will it rain tomorrow?" Recording: 3.15 s Preliminary enrollment checks passed. Say the enrollment phrase (4/4) for "user-1" with context, for example: "<phrase> will it rain tomorrow?" Recording: 3.90 s Preliminary enrollment checks passed. Adapting: 100% complete. Enrolled model saved to "eft-hbg.snsr"As before, test with snsr-eval:
% snsr-eval -t eft-hbg.snsr Using live audio from default capture device. ^C to stop. 675 1500 (0.8921 sv) user-1/HBG 3480 4500 (0.8245 sv) user-1/HBG 7020 8160 (0.877 sv) user-1/HBG ^CThe value between the parenthesis is the sv-score.
Templates¶
Templates are models that add new behaviors to wake word, lvcsr, and stt models via composition. Templates have slots that we fill with models of the required type.
-
Let's say we would like to reduce false accepts of the music command set used above by requiring commands in it to be prefixed with a low false-accept wake word like "voice genie". We can do this with the tpl-spot-sequential template:
% snsr-eval -t model/tpl-spot-sequential-1.5.0.snsr \ -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \ -f 1 model/spot-music-enUS-1.2.0-m.snsr \ data/audio/voice-genie-music.wav 5175 5865 play_musicsnsr-eval's-f slot filenameoption loads the named file into the specified slot. The tpl-spot-sequential documentation lists the slots the template supports, and the types of models it expects for these.As expected, the new recognizer spots only the command directly following "voice genie".
snsr-eval supports on-the-fly model composition, but what if we have code that already works with spot-music-enUS-1.2.0-m.snsr that we don't want to modify? Enter snsr-edit, which supports composition and setting changes and can save the result as a new, self-contained model:
% snsr-edit -vvt model/tpl-spot-sequential-1.5.0.snsr \ -o vg-music.snsr \ -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \ -f 1 model/spot-music-enUS-1.2.0-m.snsr Loading "model/tpl-spot-sequential-1.5.0.snsr" as the template model. Loading "model/spot-voicegenie-enUS-6.5.1-m.snsr" into setting "0". Loading "model/spot-music-enUS-1.2.0-m.snsr" into setting "1". Output written to "vg-music.snsr". % snsr-eval -t vg-music.snsr data/audio/voice-genie-music.wav 5175 5865 play_musictpl-spot-sequential has a loop setting that changes behavior. Lets give that a try:
% snsr-eval -t vg-music.snsr -s loop=1 data/audio/voice-genie-music.wav 5175 5865 play_music 8055 8790 next_songThis recognizes the first two music commands, but not "stop music" as the gap between "next song" and "stop music" is more than the listen-window, which is five seconds:
-
We can run two keyword spotters simultaneously with tpl-spot-concurrent:
Speech To Text stt¶
TrulyNatural STT includes support for modern transformer-based end-to-end recognizers suitable for transcription tasks.
-
STT models use the same tools and API as wake words. Let's run a sample audio file through stt-enUS-automotive-medium-2.3.15-pnc.snsr with snsr-eval:
% snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \ data/audio/voice-genie-set-cruise-control.wav P 2040 2960 God Boice jeni P 2240 3560 Voice. Genie said the creek P 2240 3960 Voice. Genie set the cruise control P 2240 4200 Voice. Genie set the cruise control to P 2240 4720 Voice. Genie set the cruise control to further P 2240 5200 Voice. Genie set the cruise control to fifty five. Mrs P 2240 5600 Voice Genie set the cruise control to fifty five miles back P 2240 5800 Voice Genie set the cruise control to fifty five miles per hour P 2240 5840 Voice Genie set the cruise control to fifty five miles per hour NLU intent: set_cruise_control (0.9969) = voice genie set the cruise control to 55 miles per hour NLU entity: number (0.9931) = 55 NLU entity: speed_unit (0.9942) = miles per hour 2240 5840 Voice Genie set the cruise control to fifty five miles per hour.Partial or interim hypotheses are shown prefixed with
P. These provide useful feedback for live transcription tasks, but are less interesting when recognizing from file. You can suppress them by setting partial-result-interval= 0:% snsr-eval -t model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \ -s partial-result-interval=0 \ data/audio/voice-genie-set-cruise-control.wav NLU intent: set_cruise_control (0.9969) = voice genie set the cruise control to 55 miles per hour NLU entity: number (0.9931) = 55 NLU entity: speed_unit (0.9942) = miles per hour 2240 5840 Voice Genie set the cruise control to fifty five miles per hour. -
STT and LVCSR models only produce a final recognition hypothesis at end-of-file, or when a VAD signals that speech has ended. For convenience snsr-eval has an
-aoption that adds the tpl-vad-lvcsr template if you're using live audio and specify an STT model that does not include a VAD.% snsr-eval -lt model/stt-enUS-automotive-medium-2.3.15-pnc.snsr ERROR: With live audio LVCSR and STT models require a VAD. You can add one with the -a flag. % snsr-eval -alt model/stt-enUS-automotive-medium-2.3.15-pnc.snsr NLU intent: no_command (0.9995) = the quick brown fox jumped over the lazy dog's back The quick Brown Fox jumped over the lazy dog's back. ^CCreate a new model that includes a VAD:
% snsr-edit -vvo vad-stt.snsr \ -t model/tpl-vad-lvcsr-3.17.0.snsr \ -f 0 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr Loading "model/tpl-vad-lvcsr-3.17.0.snsr" as the template model. Loading "model/stt-enUS-automotive-medium-2.3.14-pnc.snsr" into setting "0". Output written to "vad-stt.snsr". % snsr-eval -lt vad-stt.snsr NLU intent: no_command (0.9995) = the quick brown fox jumped over the lazy dog's back The quick Brown Fox jumped over the lazy dog's back. -
Creating an STT model that's gated by a wake word is also easy with tpl-opt-spot-vad-lvcsr:
% snsr-edit -vvo vg-vad-stt.snsr \ -t model/tpl-opt-spot-vad-lvcsr-1.25.0.snsr \ -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr \ -f 1 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr \ -s include-wake-word-audio=1 Loading "model/tpl-opt-spot-vad-lvcsr-1.24.0.snsr" as the template model. Loading "model/spot-voicegenie-enUS-6.5.1-m.snsr" into setting "0". Loading "model/stt-enUS-automotive-medium-2.3.14-pnc.snsr" into setting "1". Output written to "vg-vad-stt.snsr".include-wake-word-audio
= 1includes the wake word in the audio seen by the STT recognizer, but configures the STT to elide this from the recognition hypothesis. This improves recognition accuracy if there's no pause between the wake word and the STT command.Test with snsr-eval% snsr-eval -lt vg-vad-stt.snsr \ data/audio/voice-genie-set-cruise-control.wav NLU intent: set_cruise_control (0.9968) = set the cruise control to 55 miles per hour NLU entity: number (0.9937) = 55 NLU entity: speed_unit (0.9936) = miles per hour Set the cruise control to fifty five miles per hour.Test with snsr-eval and live audio% snsr-eval -t vg-vad-stt.snsr P 7180 7340 Said P 7220 7860 Said the radio P 7220 8220 Set the radio tonight P 7260 8580 Set the radio to ninety one P 7260 9060 Set the radio to ninety one point. Three P 7260 9260 Set the radio to ninety one point. Five P 7260 9620 Set the radio to ninety one point. Five f NLU intent: set_radio (0.9674) = set the radio to 91.5 FM NLU entity: radio_station (0.9688) = 91.5 FM 7260 9860 Set the radio to ninety one point. Five F. M. ^C
LVCSR tnl¶
Use Grammar-based recognition for command and control tasks with small to medium sized vocabularies on devices that aren't powerful enough for STT. VoiceHub provides a convenient interface for creating these models.
-
Let's build a model that recognizes the example audio files in data/enrollments/. We'll use a grammar file from data/grammars/:
% snsr-edit -vvo commands.snsr \ -t model/lvcsr-build-enUS-2.7.3.snsr \ -f grammar-stream data/grammars/enrollments-nlu-slot.txt \ -s partial-result-interval=0 Loading "model/lvcsr-build-enUS-2.7.3.snsr" as the template model. Loading "data/grammars/enrollments-nlu-slot.txt" into setting "grammar-stream". Output written to "commands.snsr". % snsr-eval -t commands.snsr data/enrollments/armadillo-1-0-c.wav NLU intent: calculate (0) = 18 percent of 643 NLU entity: percent (0) = 18 NLU entity: number (0) = 643 375 3195 armadillo 18 percent of 643We set grammar-stream to the contents of data/grammars/enrollments-nlu-slot.txt to define which sentences the recognizer will accept, and partial-result-interval
= 0to suppress interim results.data/grammars/enrollments-nlu-slot.txt
# LVCSR grammar specification for test utterances in data/enrollments/ # Includes lightweight NLU slot markup. # # In a tpl-spot-vad-lvcsr pipeline the prefix would be consumed by the spotter. prefix = armadillo | jackalope | terminator; # Numbers used in the intent rule below number = 18 | 643 | 20 | 6; # Places place = target | winco | susan's house | gas; # Dates date = friday | tomorrow | next week; # List of known utterances in the *-c.wav files. intent = {calculate {percent $number} percent of {number}} | {call call the nearest {place}} | {navigate how far away is {place}} | {avcontrol {action play} {type more} songs by this artist} | {avcontrol {action record} a {type video}} | {startTimer start a timer for {number} minutes {unit :minutes}} | {navigate i'm running low on {place}} | {calendar {action cancel} all my {type meetings} on {date}} | {navigate directions to {place}} | {messaging do i have any new texts {action :query} {type :texts}} | {calendar {type open} my calendar to {date}} | {alarm {type set} an alarm for {number} am {date}}; # Match the prefix and zero or one of the sentences. # <s> and </s> are sentence start and end markers that # match silence and small amounts of extraneous speech. g = <s> $prefix $intent? </s>; -
Let's combine the wake word we previously enrolled with this LVCSR model and tpl-opt-spot-vad-lvcsr:
% snsr-edit -vvo ww-commands.snsr \ -t model/tpl-opt-spot-vad-lvcsr-1.25.0.snsr \ -f 0 udt-kws.snsr \ -f 1 commands.snsr \ -s include-wake-word-audio=1 Loading "model/tpl-opt-spot-vad-lvcsr-1.24.0.snsr" as the template model. Loading "udt-kws.snsr" into setting "0". Loading "commands.snsr" into setting "1". Output written to "ww-commands.snsr".And then run it over all the enrollment recordings:
% snsr-eval -t ww-commands.snsr data/enrollments/*-c.wav NLU intent: calculate (0) = 18 percent of 643 NLU entity: percent (0) = 18 NLU entity: number (0) = 643 375 3195 armadillo 18 percent of 643 NLU intent: call (0) = call the nearest target NLU entity: place (0) = target 4695 6360 armadillo call the nearest target NLU intent: navigate (0) = how far away is winco NLU entity: place (0) = winco 7680 9495 armadillo how far away is winco NLU intent: avcontrol (0) = record a video NLU entity: action (0) = record NLU entity: type (0) = video 14535 16095 armadillo record a video NLU intent: startTimer (0) = start a timer for 20 minutes minutes NLU entity: number (0) = 20 NLU entity: unit (0) = minutes 17640 19905 armadillo start a timer for 20 minutes minutes NLU intent: navigate (0) = i'm running low on gas NLU entity: place (0) = gas 21060 22935 jackalope i'm running low on gas NLU intent: calendar (0) = cancel all my meetings on friday NLU entity: action (0) = cancel NLU entity: type (0) = meetings NLU entity: date (0) = friday 24315 26655 jackalope cancel all my meetings on friday NLU intent: navigate (0) = directions to susan's house NLU entity: place (0) = susan's house 27915 30195 jackalope directions to susan's house NLU intent: messaging (0) = do i have any new texts query texts NLU entity: action (0) = query NLU entity: type (0) = texts 31500 33525 jackalope do i have any new texts query texts NLU intent: calendar (0) = open my calendar to next week NLU entity: type (0) = open NLU entity: date (0) = next week 34695 36975 jackalope open my calendar to next week NLU intent: alarm (0) = set an alarm for 6 am tomorrow NLU entity: type (0) = set NLU entity: number (0) = 6 NLU entity: date (0) = tomorrow 38160 40665 jackalope set an alarm for 6 am tomorrow