tpl-spot-vad¶
This template runs the wake word in slot 0 until it detects, then does start- and endpoint detection with a VAD on the audio stream following the wake word.
tpl-spot-vad has task-type==phrasespot-vad.
Expected task types:
- Slot 0: phrasespot
Operation¶
flowchart TD
start((start))
start --> startWW
subgraph slot0[**slot 0** (phrasespot)]
startWW((start))
fetchWW[/samples from ->audio-pcm/]
audioWW(^sample-count)
processWW[process]
result(^result)
stopWW((stop))
startWW --> fetchWW
fetchWW --> audioWW
audioWW --> processWW
processWW --> fetchWW
processWW -->|recognize| result
result --> stopWW
end
listenBegin(^listen-begin)
listenEnd(^listen-end)
stopWW --> listenBegin
listenBegin --> fetch0
fetch0[/samples from ->audio-pcm/]
fetch1[/samples from ->audio-pcm/]
audio0(^sample-count)
audio1(^sample-count)
silence(^silence)
begin(^begin)
END(^end)
limit(^limit)
process0[process]
process1[process]
out[\samples to <-audio-pcm\]
final@{ shape: f-circ }
fetch0 --> audio0
audio0 --> process0
process0 --> fetch0
process0 -->|speech start| begin
process0 -->|timeout| silence
silence --> final
begin --> fetch1
fetch1 --> audio1
audio1 --> out
out --> process1
process1 --> fetch1
process1 -->|speech end| END
process1 -->|speech limit| limit
END --> final
limit --> final
final --> listenEnd
listenEnd --> startWW - Read audio data from ->audio-pcm.
- Invoke ^sample-count.
- If processing detects a vocabulary phrase, skip to step 5.
- Continue processing until STREAM_END occurs on ->audio-pcm, or one of the event handlers returns a code other than OK.
- Invoke ^result
- Invoke ^listen-begin
- Read audio data from ->audio-pcm.
- Invoke ^sample-count.
- If speech detected within leading-silence ms continue at step 12.
- If no speech detected within leading-silence ms, invoke ^silence and skip to step 19.
- Continue processing at step 7 until STREAM_END.
- Invoke ^begin.
- Read audio data from ->audio-pcm.
- Invoke ^sample-count.
- If pass-through
== 1write speech samples to <-audio-pcm. - If end detected within max-recording ms, invoke ^end and skip to step 19.
- If end not detected within max-recording ms, invoke ^limit and skip to step 19.
- Continue processing at step 13 until STREAM_END.
- Invoke ^listen-end
- Restart at step 1.
Register callback handlers with setHandler only for those events you're interested in.
Settings¶
^begin, ^end, ^limit, ^listen-begin, ^listen-end, ^result, ^sample-count, ^silence
operating-point-iterator, vocab-iterator
audio-stream, audio-stream-first, audio-stream-last
->audio-pcm, <-audio-pcm, audio-stream-from, audio-stream-to, dsp-acmodel-stream, dsp-header-stream, dsp-search-stream, skip-to-ms, skip-to-sample
audio-stream-size, backoff, delay, dsp-target, duration-ms, hold-over, include-leading-silence, include-wake-word-audio, leading-silence, listen-window, low-fr-operating-point, max-recording, operating-point, pass-through, samples-per-second, sv-threshold
live-segment.c, snsr-eval.c, segmentSpottedAudio.java
Notes¶
Use this for wake-word gated audio sent to cloud engines.
Set include-wake-word-audio = 1 to include the wake word audio in the VAD audio output stream.
This template writes the VAD-segmented audio to <-audio-pcm. If your application does not use this, set pass-through = 0.
Examples¶
% cd ~/Sensory/TrulyNaturalSDK/7.6.1
% bin/snsr-edit -o vg-vad.snsr\
-t model/tpl-spot-vad-3.10.0.snsr\
-f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr\
-s include-wake-word-audio=1
# Say "Voice genie, what's the capital of Oregon?"
% bin/snsr-eval -o vad-audio.wav -vvt vg-vad.snsr
Using live audio from default capture device. ^C to stop.
Using operating point 8.
Available operating points: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21.
Available vocabulary:
1: "voicegenie"
phrase:
1950 2550 (1) voicegenie
words:
1950 2550 (1) voicegenie
2730 [^listen-begin]
2730 [^begin]
1650 4200 [^end] VAD speech region.
4980 [^listen-end]
^C
Review vad-audio.wav and note that the recording starts backoff ms before the the beginning of "voice genie" and continues until hold-over ms after the end of the utterance.