tpl-spot-vad¶

This template runs the wake word in slot 0 until it detects, then does start- and endpoint detection with a VAD on the audio stream following the wake word.

tpl-spot-vad has task-type==phrasespot-vad.

Expected task types:

Slot 0: phrasespot

tpl-spot-vad-3.10.0.snsr

Operation¶

flowchart TD
  start((start))
  start --> startWW

  subgraph slot0[**slot 0** &lpar;phrasespot&rpar;]
    startWW((start))
    fetchWW[/samples from ->audio-pcm/]
    audioWW(^sample-count)
    processWW[process]
    result(^result)
    stopWW((stop))
    startWW --> fetchWW
    fetchWW --> audioWW
    audioWW --> processWW
    processWW --> fetchWW
    processWW -->|recognize| result
    result --> stopWW
  end

  listenBegin(^listen-begin)
  listenEnd(^listen-end)

  stopWW --> listenBegin
  listenBegin --> fetch0

  fetch0[/samples from ->audio-pcm/]
  fetch1[/samples from ->audio-pcm/]
  audio0(^sample-count)
  audio1(^sample-count)

  silence(^silence)
  begin(^begin)
  END(^end)
  limit(^limit)

  process0[process]
  process1[process]
  out[\samples to <-audio-pcm\]

  final@{ shape: f-circ }

  fetch0 --> audio0
  audio0 --> process0
  process0 --> fetch0
  process0 -->|speech start| begin
  process0 -->|timeout| silence
  silence --> final

  begin --> fetch1
  fetch1 --> audio1
  audio1 --> out
  out --> process1
  process1 --> fetch1
  process1 -->|speech end| END
  process1 -->|speech limit| limit
  END --> final
  limit --> final

  final --> listenEnd
  listenEnd --> startWW

Read audio data from ->audio-pcm.
Invoke ^sample-count.
If processing detects a vocabulary phrase, skip to step 5.
Continue processing until STREAM_END occurs on ->audio-pcm, or one of the event handlers returns a code other than OK.
Invoke ^result
Invoke ^listen-begin
Read audio data from ->audio-pcm.
Invoke ^sample-count.
If speech detected within leading-silence ms continue at step 12.
If no speech detected within leading-silence ms, invoke ^silence and skip to step 19.
Continue processing at step 7 until STREAM_END.
Invoke ^begin.
Read audio data from ->audio-pcm.
Invoke ^sample-count.
If pass-through == 1 write speech samples to <-audio-pcm.
If end detected within max-recording ms, invoke ^end and skip to step 19.
If end not detected within max-recording ms, invoke ^limit and skip to step 19.
Continue processing at step 13 until STREAM_END.
Invoke ^listen-end
Restart at step 1.

Register callback handlers with setHandler only for those events you're interested in.

Settings¶

^begin, ^end, ^limit, ^listen-begin, ^listen-end, ^result, ^sample-count, ^silence

operating-point-iterator, vocab-iterator

audio-stream, audio-stream-first, audio-stream-last

->audio-pcm, <-audio-pcm, audio-stream-from, audio-stream-to, dsp-acmodel-stream, dsp-header-stream, dsp-search-stream, skip-to-ms, skip-to-sample

audio-stream-size, backoff, delay, dsp-target, duration-ms, hold-over, include-leading-silence, include-wake-word-audio, leading-silence, listen-window, low-fr-operating-point, max-recording, operating-point, pass-through, samples-per-second, sv-threshold

phrasespot-vad

live-segment.c, snsr-eval.c, segmentSpottedAudio.java

Notes¶

Use this for wake-word gated audio sent to cloud engines.

Set include-wake-word-audio = 1 to include the wake word audio in the VAD audio output stream.

This template writes the VAD-segmented audio to <-audio-pcm. If your application does not use this, set pass-through = 0.

Examples¶

% cd ~/Sensory/TrulyNaturalSDK/7.6.1

% bin/snsr-edit -o vg-vad.snsr\
    -t model/tpl-spot-vad-3.10.0.snsr\
    -f 0 model/spot-voicegenie-enUS-6.5.1-m.snsr\
    -s include-wake-word-audio=1

# Say "Voice genie, what's the capital of Oregon?"
% bin/snsr-eval -o vad-audio.wav -vvt vg-vad.snsr
Using live audio from default capture device. ^C to stop.
Using operating point 8.
Available operating points: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21.
Available vocabulary:
  1: "voicegenie"
phrase:
  1950   2550 (1) voicegenie
words:
  1950   2550 (1) voicegenie

  2730 [^listen-begin]
  2730 [^begin]
  1650   4200 [^end] VAD speech region.
  4980 [^listen-end]
^C

Review vad-audio.wav and note that the recording starts backoff ms before the the beginning of "voice genie" and continues until hold-over ms after the end of the utterance.