tpl-opt-spot-vad-lvcsr tnl¶

This template optionally runs the wake word in slot 0 until it detects, then segments the audio following the wake word with a VAD and sends the segmented audio to the LVCSR or STT recognizer in slot 1.

slot controls whether tpl-opt-spot-vad-lvcsr waits for the wake word:

With slot == 0 it waits for the wake word before starting the VAD. In this mode the behavior is that of tpl-spot-vad-lvcsr.
With slot == 1 starts the VAD immediately and the behavior is that of tpl-vad-lvcsr.

You can change slot at runtime. Use this to gate only the first of a series of commands with a wake word.

tpl-spot-vad-lvcsr has task-type==phrasespot.

Expected task types:

Slot 0: phrasespot
Slot 1: lvcsr

tpl-opt-spot-vad-lvcsr-1.28.0.snsr, tpl-spot-vad-lvcsr, tpl-vad-lvcsr

Operation¶

flowchart TD
  start((start))
  slotCheck0{slot == 0?}

  start --> slotCheck0
  slotCheck0 -->|yes| startWW
  slotCheck0 -->|no| fetch0

  subgraph slot0[**slot 0** &lpar;phrasespot&rpar;]
    startWW((start))
    fetchWW[/samples from ->audio-pcm/]
    audioWW(^sample-count)
    processWW[process]
    result(0.^result)
    stopWW((stop))
    startWW --> fetchWW
    fetchWW --> audioWW
    audioWW --> processWW
    processWW --> fetchWW
    processWW -->|recognize| result
    result --> stopWW
  end

  subgraph slot1[**slot 1** &lpar;lvcsr&rpar;]
    startSTT((start))
    startSTTfinal((start))
    stopSTT((stop))
    stopSTTpartial((stop))
    processSTT[process]
    partialSTT(^result-partial)
    intentSTT(^nlu-intent)
    slotSTT(^nlu-slot)
    resultSTT(^result)
    nluSTT{NLU<br>match?}

    slmSTT{SLM<br>included?}
    generateSTT[generate]
    slmstartSTT(^slm-start)
    slmresultpartialSTT(^slm-result-partial)
    slmresultSTT(^slm-result)

    startSTT --> processSTT
    processSTT ---->|hypothesis| partialSTT
    partialSTT --> stopSTTpartial

    startSTTfinal --> nluSTT
    nluSTT -->|yes| intentSTT
    nluSTT -->|no| resultSTT
    intentSTT --> slotSTT
    slotSTT --> resultSTT
    slotSTT -->|more| intentSTT

    resultSTT --> slmSTT
    slmSTT -->|yes| slmstartSTT
    slmSTT -->|no| stopSTT
    slmstartSTT -->|OK| generateSTT
    slmstartSTT -->|STOP| stopSTT
    generateSTT -->|response| slmresultpartialSTT
    slmresultpartialSTT --> generateSTT
    generateSTT -->|done| slmresultSTT
    slmresultSTT --> stopSTT
  end

  listenBegin(^listen-begin)
  listenEnd(^listen-end)

  stopWW --> listenBegin
  listenBegin --> fetch0

  fetch0[/samples from ->audio-pcm/]
  fetch1[/samples from ->audio-pcm/]
  audio0(^sample-count)
  audio1(^sample-count)

  silence(^silence)
  begin(^begin)
  END(^end)
  limit(^limit)

  process0[VAD process]
  process1[VAD process]

  final@{ shape: f-circ }

  slotCheck1{slot == 0?}

  fetch0 --> audio0
  audio0 --> process0
  process0 --> fetch0
  process0 -->|speech start| begin
  process0 -->|timeout| silence
  silence ~~~ final
  silence --> slotCheck1

  begin --> fetch1
  fetch1 --> audio1
  audio1 --> process1

  process1 --> startSTT
  stopSTTpartial --> fetch1

  process1 -->|speech end| END
  process1 -->|speech limit| limit
  END --> final
  limit --> final

  final --> startSTTfinal
  stopSTT --> slotCheck1

  slotCheck1 -->|no| fetch0
  slotCheck1 -->|yes| listenEnd
  listenEnd --> startWW

Read audio data from ->audio-pcm.
Invoke ^sample-count.
If processing does not detect a wake word, continue at step 1.
Invoke 0.^result for the wake word.
Invoke ^listen-begin and start VAD processing.
Read audio data from ->audio-pcm.
Invoke ^sample-count.
If VAD processing does not detect the start of speech within the leading-silence timeout, invoke ^silence and continue at step 15.
Invoke ^begin if processing detects the start of speech, else continue at step 6.
Read audio date from ->audio-pcm.
Invoke ^sample-count.
If VAD processing detects an endpoint invoke either ^limit or ^end and continue at step 14.
Process VAD segmented audio in the LVCSR or STT recognizer
- Invoke ^result-partial with interim recognition result hypothesis.
- Continue at step 10.
Produce a final LVCSR or STT recognition hypothesis.
- Invoke ^nlu-intent and ^nlu-slot for each NLU intent found.
- Invoke ^result with the final recognition hypothesis.
- If there's no SLM, continue at step 15.
- Invoke ^slm-start, if the callback returns STOP, continue at step 15.
- Generate SLM result, invoking ^slm-result-partial on each generated token.
- Invoke ^slm-result with complete SLM result.
Invoke ^listen-end and start listening for the wake word again at step 1.

Register callback handlers with setHandler only for those events you're interested in.

Settings¶

^begin, ^end, ^limit, ^listen-begin, ^listen-end, ^nlu-intent, ^nlu-slot, ^result, ^result-partial, ^sample-count, ^silence, ^slm-result, ^slm-result-partial, ^slm-start

operating-point-iterator, vocab-iterator

audio-stream, audio-stream-first, audio-stream-last

->audio-pcm, audio-stream-from, audio-stream-to

audio-stream-size, audio-stream-size, backlog-interval, backoff, custom-vocab, delay, duration-ms, hold-over, include-leading-silence, include-wake-word-audio, leading-silence, low-fr-operating-point, max-recording, operating-point, partial-result-interval, samples-per-second, slot, stt-profile, sv-threshold, wake-word-at-end

lvcsr, phrasespot

live-spot.c, snsr-eval.c, PhraseSpot.java

Notes¶

Use this template for command and control type applications where commands are initiated with a wake word in certain contexts and not in others.

We recommend that you set slot= 1 in the ^result handler, and slot= 0 in the ^silence handler. With this configuration the recognizer requires a wake word to start listening only for the first in a series of interactions. After this it will revert to requiring a wake word only if the user does not say anything for at least leading-silence ms.

VAD settings backoff, hold-over, leading-silence, max-recording, and trailing-silence apply to both slot 0 and slot 1, but include-leading-silence applies only to slot 0.

Set include-wake-word-audio= 1 to include include the wake word audio in the samples passed to the LVCSR or STT recognizer. STT hypotheses do not include the wake word text unless Sensory specifically configured the model to do so.

The ^result-partial and ^result events are for the LVCSR or STT recognizer in slot 1. If you need direct access to the wake word result, prefix the event with the slot path: 0.^result Use the slot prefix to read values in the 0.^result event handler too, for example call getString with key 0.text to read the wake word transcription.

Examples¶

Select wake-word or VAD-only behavior¶

% cd ~/Sensory/TrulyNaturalSDK/7.7.0

% bin/snsr-edit -o opt-vg-stt.snsr\
    -t model/tpl-opt-spot-vad-lvcsr-1.28.0.snsr\
    -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr\
    -f 1 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr\
    -s include-wake-word-audio=1

# Say "Voice genie, open the sunroof."
% snsr-eval -vt opt-vg-stt.snsr
Using live audio from default capture device. ^C to stop.
P  33010  33490 (0.3201) Open the sun
P  33050  33890 (0.7712) Open the sunroof
 32010  34185 [^end] VAD speech region.
NLU intent: open_window (0.9956) = open the sunroof
NLU entity:   roof (0.9595) = sunroof
 33050  33890 (0.5731) Open the sunroof.
^C

# Select the VAD-only path with slot=1
# Say "Close all the windows"
% snsr-eval -vt opt-vg-stt.snsr -s slot=1
Using live audio from default capture device. ^C to stop.
P   2150   2670 (0.257) Clothes. All
P   2190   3150 (0.7631) Close. All the wind
P   2190   3430 (0.9899) Close all the windows
  1950   3855 [^end] VAD speech region.
NLU intent: close_window (0.9977) = close all the windows
  2190   3470 (0.9244) Close all the windows.
^C

Use trailing wake-word 7.7.0¶

Recognize a phrase with the wake word at either end of an utterance.

% cd ~/Sensory/TrulyNaturalSDK/7.7.0

% bin/snsr-edit -o opt-vg-stt-vg.snsr\
    -t model/tpl-opt-spot-vad-lvcsr-1.28.0.snsr\
    -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr\
    -f 1 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr\
    -s include-wake-word-audio=1\
    -s wake-word-at-end=1

# Say "Voice genie, set the radio to 91.5 FM."
% bin/snsr-eval -vt opt-vg-stt-vg.snsr
Using live audio from default capture device. ^C to stop.
P   4360   5000 (0.2927) Set. The radio
P   4400   5280 (5.7e-07) Set the radio to n
P   4400   5760 (0.7336) Set the radio to ninety-one
P   4400   6120 (0.6005) Set the radio to ninety one point
P   4400   6440 (0.5195) Set the radio to ninety one point. Five
P   4400   6480 (0.6733) Set the radio to ninety one point. Five
  3405   7455 [^end] VAD speech region.
NLU intent: set_radio (0.9674) = set the radio to 91.5 FM
NLU entity:   radio_station (0.9688) = 91.5 FM
  4400   7080 (0.3896) Set the radio to ninety one point. Five F. M.
 15225  17490 [^end] VAD speech region.

# Say "Will it rain in Portland tomorrow, Voice Genie?"
NLU intent: no_command (0.9977) = will it rain in portland tomorrow
NLU entity:   time (0.9773) = tomorrow
 15460  17260 (0.6731) Will it rain in Portland tomorrow?
^C