tpl-opt-spot-vad-lvcsr tnl¶
This template optionally runs the wake word in slot 0 until it detects, then segments the audio following the wake word with a VAD and sends the segmented audio to the LVCSR or STT recognizer in slot 1.
slot controls whether tpl-opt-spot-vad-lvcsr waits for the wake word:
- With
slot == 0it waits for the wake word before starting the VAD. In this mode the behavior is that of tpl-spot-vad-lvcsr. - With
slot == 1starts the VAD immediately and the behavior is that of tpl-vad-lvcsr.
You can change slot at runtime. Use this to gate only the first of a series of commands with a wake word.
tpl-spot-vad-lvcsr has task-type==phrasespot.
Expected task types:
- Slot 0: phrasespot
- Slot 1: lvcsr
tpl-opt-spot-vad-lvcsr-1.25.0.snsr, tpl-spot-vad-lvcsr, tpl-vad-lvcsr
Operation¶
flowchart TD
start((start))
slotCheck0{slot == 0?}
start --> slotCheck0
slotCheck0 -->|yes| startWW
slotCheck0 -->|no| fetch0
subgraph slot0[**slot 0** (phrasespot)]
startWW((start))
fetchWW[/samples from ->audio-pcm/]
audioWW(^sample-count)
processWW[process]
result(0.^result)
stopWW((stop))
startWW --> fetchWW
fetchWW --> audioWW
audioWW --> processWW
processWW --> fetchWW
processWW -->|recognize| result
result --> stopWW
end
subgraph slot1[**slot 1** (lvcsr)]
startSTT((start))
startSTTfinal((start))
stopSTT((stop))
stopSTTpartial((stop))
processSTT[process]
partialSTT(^result-partial)
intentSTT(^nlu-intent)
slotSTT(^nlu-slot)
resultSTT(^result)
nluSTT{NLU<br>match?}
slmSTT{SLM<br>included?}
generateSTT[generate]
slmstartSTT(^slm-start)
slmresultpartialSTT(^slm-result-partial)
slmresultSTT(^slm-result)
startSTT --> processSTT
processSTT ---->|hypothesis| partialSTT
partialSTT --> stopSTTpartial
startSTTfinal --> nluSTT
nluSTT -->|yes| intentSTT
nluSTT -->|no| resultSTT
intentSTT --> slotSTT
slotSTT --> resultSTT
slotSTT -->|more| intentSTT
resultSTT --> slmSTT
slmSTT -->|yes| slmstartSTT
slmSTT -->|no| stopSTT
slmstartSTT -->|OK| generateSTT
slmstartSTT -->|STOP| stopSTT
generateSTT -->|response| slmresultpartialSTT
slmresultpartialSTT --> generateSTT
generateSTT -->|done| slmresultSTT
slmresultSTT --> stopSTT
end
listenBegin(^listen-begin)
listenEnd(^listen-end)
stopWW --> listenBegin
listenBegin --> fetch0
fetch0[/samples from ->audio-pcm/]
fetch1[/samples from ->audio-pcm/]
audio0(^sample-count)
audio1(^sample-count)
silence(^silence)
begin(^begin)
END(^end)
limit(^limit)
process0[VAD process]
process1[VAD process]
final@{ shape: f-circ }
slotCheck1{slot == 0?}
fetch0 --> audio0
audio0 --> process0
process0 --> fetch0
process0 -->|speech start| begin
process0 -->|timeout| silence
silence ~~~ final
silence --> slotCheck1
begin --> fetch1
fetch1 --> audio1
audio1 --> process1
process1 --> startSTT
stopSTTpartial --> fetch1
process1 -->|speech end| END
process1 -->|speech limit| limit
END --> final
limit --> final
final --> startSTTfinal
stopSTT --> slotCheck1
slotCheck1 -->|no| fetch0
slotCheck1 -->|yes| listenEnd
listenEnd --> startWW - Read audio data from ->audio-pcm.
- Invoke ^sample-count.
- If processing does not detect a wake word, continue at step 1.
- Invoke 0.^result for the wake word.
- Invoke ^listen-begin and start VAD processing.
- Read audio data from ->audio-pcm.
- Invoke ^sample-count.
- If VAD processing does not detect the start of speech within the leading-silence timeout, invoke ^silence and continue at step 15.
- Invoke ^begin if processing detects the start of speech, else continue at step 6.
- Read audio date from ->audio-pcm.
- Invoke ^sample-count.
- If VAD processing detects an endpoint invoke either ^limit or ^end and continue at step 14.
- Process VAD segmented audio in the LVCSR or STT recognizer
- Invoke ^result-partial with interim recognition result hypothesis.
- Continue at step 10.
- Produce a final LVCSR or STT recognition hypothesis.
- Invoke ^nlu-intent and ^nlu-slot for each NLU intent found.
- Invoke ^result with the final recognition hypothesis.
- If there's no SLM, continue at step 15.
- Invoke ^slm-start, if the callback returns STOP, continue at step 15.
- Generate SLM result, invoking ^slm-result-partial on each generated token.
- Invoke ^slm-result with complete SLM result.
- Invoke ^listen-end and start listening for the wake word again at step 1.
Register callback handlers with setHandler only for those events you're interested in.
Settings¶
^begin, ^end, ^limit, ^listen-begin, ^listen-end, ^nlu-intent, ^nlu-slot, ^result, ^result-partial, ^sample-count, ^silence, ^slm-result, ^slm-result-partial, ^slm-start
operating-point-iterator, vocab-iterator
audio-stream, audio-stream-first, audio-stream-last
->audio-pcm, audio-stream-from, audio-stream-to
audio-stream-size, audio-stream-size, backoff, custom-vocab, delay, duration-ms, hold-over, include-leading-silence, include-wake-word-audio, leading-silence, low-fr-operating-point, max-recording, operating-point, partial-result-interval, samples-per-second, slot, stt-profile, sv-threshold
live-spot.c, snsr-eval.c, PhraseSpot.java
Notes¶
Use this template for command and control type applications where commands are initiated with a wake word in certain contexts and not in others.
We recommend that you set slot= 1 in the ^result handler, and slot= 0 in the ^silence handler. With this configuration the recognizer requires a wake word to start listening only for the first in a series of interactions. After this it will revert to requiring a wake word only if the user does not say anything for at least leading-silence ms.
VAD settings backoff, hold-over, leading-silence, max-recording, and trailing-silence apply to both slot 0 and slot 1, but include-leading-silence applies only to slot 0.
Set include-wake-word-audio= 1 to include include the wake word audio in the samples passed to the LVCSR or STT recognizer. STT hypotheses do not include the wake word text unless Sensory specifically configured the model to do so.
The ^result-partial and ^result events are for the LVCSR or STT recognizer in slot 1. If you need direct access to the wake word result, prefix the event with the slot path: 0.^result Use the slot prefix to read values in the 0.^result event handler too, for example call getString with key 0.text to read the wake word transcription.
Examples¶
% cd ~/Sensory/TrulyNaturalSDK/7.6.1
% bin/snsr-edit -o opt-vg-stt.snsr\
-t model/tpl-opt-spot-vad-lvcsr-1.25.0.snsr\
-f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr\
-f 1 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr\
-s include-wake-word-audio=1
# Say "Voice genie, open the sunroof."
% snsr-eval -vt opt-vg-stt.snsr
Using live audio from default capture device. ^C to stop.
P 33010 33490 (0.3201) Open the sun
P 33050 33890 (0.7712) Open the sunroof
32010 34185 [^end] VAD speech region.
NLU intent: open_window (0.9956) = open the sunroof
NLU entity: roof (0.9595) = sunroof
33050 33890 (0.5731) Open the sunroof.
^C
# Select the VAD-only path with slot=1
# Say "Close all the windows"
% snsr-eval -vt opt-vg-stt.snsr -s slot=1
Using live audio from default capture device. ^C to stop.
P 2150 2670 (0.257) Clothes. All
P 2190 3150 (0.7631) Close. All the wind
P 2190 3430 (0.9899) Close all the windows
1950 3855 [^end] VAD speech region.
NLU intent: close_window (0.9977) = close all the windows
2190 3470 (0.9244) Close all the windows.
^C