tpl-spot-vad-lvcsr tnl¶
This template runs the wake word in slot 0 until it detects, segments the audio following the wake word with a VAD, and sends the segmented audio to the LVCSR or STT recognizer in slot 1.
This behavior is also available in the tpl-opt-spot-vad-lvcsr template, which adds an option to skip the wake word.
tpl-spot-vad-lvcsr has task-type==phrasespot.
Expected task types:
- Slot 0: phrasespot
- Slot 1: lvcsr
tpl-spot-vad-lvcsr-3.20.0.snsr, tpl-opt-spot-vad-lvcsr
Operation¶
flowchart TD
start((start))
start --> startWW
subgraph slot0[**slot 0** (phrasespot)]
startWW((start))
fetchWW[/samples from ->audio-pcm/]
audioWW(^sample-count)
processWW[process]
result(0.^result)
stopWW((stop))
startWW --> fetchWW
fetchWW --> audioWW
audioWW --> processWW
processWW --> fetchWW
processWW -->|recognize| result
result --> stopWW
end
subgraph slot1[**slot 1** (lvcsr)]
startSTT((start))
startSTTfinal((start))
stopSTT((stop))
stopSTTpartial((stop))
processSTT[process]
partialSTT(^result-partial)
intentSTT(^nlu-intent)
slotSTT(^nlu-slot)
resultSTT(^result)
nluSTT{NLU<br>match?}
slmSTT{SLM<br>included?}
generateSTT[generate]
slmstartSTT(^slm-start)
slmresultpartialSTT(^slm-result-partial)
slmresultSTT(^slm-result)
startSTT --> processSTT
processSTT ---->|hypothesis| partialSTT
partialSTT --> stopSTTpartial
startSTTfinal --> nluSTT
nluSTT -->|yes| intentSTT
nluSTT -->|no| resultSTT
intentSTT --> slotSTT
slotSTT --> resultSTT
slotSTT -->|more| intentSTT
resultSTT --> slmSTT
slmSTT -->|yes| slmstartSTT
slmSTT -->|no| stopSTT
slmstartSTT -->|OK| generateSTT
slmstartSTT -->|STOP| stopSTT
generateSTT -->|response| slmresultpartialSTT
slmresultpartialSTT --> generateSTT
generateSTT -->|done| slmresultSTT
slmresultSTT --> stopSTT
end
listenBegin(^listen-begin)
listenEnd(^listen-end)
stopWW --> listenBegin
listenBegin --> fetch0
fetch0[/samples from ->audio-pcm/]
fetch1[/samples from ->audio-pcm/]
audio0(^sample-count)
audio1(^sample-count)
silence(^silence)
begin(^begin)
END(^end)
limit(^limit)
process0[VAD process]
process1[VAD process]
final@{ shape: f-circ }
fetch0 --> audio0
audio0 --> process0
process0 --> fetch0
process0 -->|speech start| begin
process0 -->|timeout| silence
silence ~~~ final
silence --> listenEnd
begin --> fetch1
fetch1 --> audio1
audio1 --> process1
process1 --> startSTT
stopSTTpartial --> fetch1
process1 -->|speech end| END
process1 -->|speech limit| limit
END --> final
limit --> final
final --> startSTTfinal
stopSTT --> listenEnd
listenEnd --> startWW - Read audio data from ->audio-pcm.
- Invoke ^sample-count.
- If processing does not detect a wake word, continue at step 1.
- Invoke 0.^result for the wake word.
- Invoke ^listen-begin and start VAD processing.
- Read audio data from ->audio-pcm.
- Invoke ^sample-count.
- If VAD processing does not detect the start of speech within the leading-silence timeout, invoke ^silence and continue at step 15.
- Invoke ^begin if processing detects the start of speech, else continue at step 6.
- Read audio date from ->audio-pcm.
- Invoke ^sample-count.
- If VAD processing detects an endpoint invoke either ^limit or ^end and continue at step 14.
- Process VAD segmented audio in the LVCSR or STT recognizer
- Invoke ^result-partial with interim recognition result hypothesis.
- Continue at step 10.
- Produce a final LVCSR or STT recognition hypothesis.
- Invoke ^nlu-intent and ^nlu-slot for each NLU intent found.
- Invoke ^result with the final recognition hypothesis.
- If there's no SLM, continue at step 15.
- Invoke ^slm-start, if the callback returns STOP, continue at step 15.
- Generate SLM result, invoking ^slm-result-partial on each generated token.
- Invoke ^slm-result with complete SLM result.
- Invoke ^listen-end and start listening for the wake word again at step 1.
Register callback handlers with setHandler only for those events you're interested in.
Settings¶
^begin, ^end, ^limit, ^listen-begin, ^listen-end, ^nlu-intent, ^nlu-slot, ^result, ^result-partial, ^sample-count, ^silence, ^slm-result, ^slm-result-partial, ^slm-start
operating-point-iterator, vocab-iterator
audio-stream, audio-stream-first, audio-stream-last
->audio-pcm, audio-stream-from, audio-stream-to
audio-stream-size, audio-stream-size, backoff, custom-vocab, delay, duration-ms, hold-over, include-leading-silence, include-wake-word-audio, leading-silence, low-fr-operating-point, max-recording, operating-point, partial-result-interval, samples-per-second, stt-profile, sv-threshold
live-spot.c, snsr-eval.c, PhraseSpot.java
Notes¶
Use this template for command and control type applications where commands are initiated with a wake word.
The ^result-partial and ^result events are for the LVCSR or STT recognizer in slot 1. If you need direct access to the wake word result, prefix the event with the slot path: 0.^result Use the slot prefix to read values in the 0.^result event handler too, for example call getString with key 0.text to read the wake word transcription.
Set include-wake-word-audio= 1 to include include the wake word audio in the samples passed to the LVCSR or STT recognizer. STT hypotheses do not include the wake word text unless Sensory specifically configured the model to do so.
Examples¶
% cd ~/Sensory/TrulyNaturalSDK/7.6.1
% bin/snsr-edit -o vg-stt.snsr\
-t model/tpl-spot-vad-lvcsr-3.20.0.snsr\
-f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr\
-f 1 model/stt-enUS-automotive-medium-2.3.15-pnc.snsr\
-s include-wake-word-audio=1
# Say "Voice genie, open the sunroof."
% snsr-eval -vt vg-stt.snsr
Using live audio from default capture device. ^C to stop.
P 2770 3250 (0.4166) Open the sun
P 2810 3650 (0.7161) Open the sunroof
1815 3990 [^end] VAD speech region.
NLU intent: open_window (0.9956) = open the sunroof
NLU entity: roof (0.9595) = sunroof
2810 3690 (0.4394) Open the sunroof.
^C