Speech To Text stt¶
These models do audio transcription with transformers.
STT models have task-type==lvcsr and filenames that by convention match stt-*.snsr
STT models included in this distribution.
Operation¶
flowchart TD
start((start))
fetch[/samples from ->audio-pcm/]
audio(^sample-count)
process[process]
partial(^result-partial)
intent(^nlu-intent)
slot(^nlu-slot)
result(^result)
nlu{NLU<br>match?}
slm{SLM<br>included?}
generate[generate]
slmstart(^slm-start)
slmresultpartial(^slm-result-partial)
slmresult(^slm-result)
start --> fetch
fetch --> audio
audio --> process
process --> fetch
process -->|hypothesis| partial
partial --> fetch
process -->|VAD endpoint<br>or STREAM_END| nlu
nlu -->|yes| intent
nlu -->|no| result
intent --> slot
slot --> result
slot -->|more| intent
result --> slm
slm -->|yes| slmstart
slm -->|no| fetch
slmstart -->|OK| generate
slmstart -->|STOP| fetch
generate -->|response| slmresultpartial
slmresultpartial --> generate
generate -->|done| slmresult
slmresult --> fetch - Read audio data from ->audio-pcm.
- Invoke ^sample-count.
- Invoke ^result-partial with interim recognition hypotheses every partial-result-interval ms.
- Continue processing until STREAM_END occurs on ->audio-pcm, one of the event handlers returns a code other than OK, or an external VAD detects a speech endpoint.
- If NLU is configured, invoke ^nlu-intent and ^nlu-slot for each top-level result that matches.
- Invoke ^result with the final recognition hypothesis.
- If an SLM is not available, resume processing at step 1.
- Invoke ^slm-start. If the handler returns STOP, resume processing at step 1.
- Invoke ^slm-result-partial as the model generates text.
- Invoke ^slm-result when text generation is complete.
- Resume processing at step 1.
Note
STT recognizers do not produce a final recognition hypothesis until they runs out of audio samples to process, or an external VAD detects a speech endpoint.
With live audio you should these with a VAD template such as tpl-vad-lvcsr, tpl-opt-spot-vad-lvcsr, or tpl-spot-vad-lvcsr.
Settings¶
^nlu-intent, ^nlu-slot, ^result, ^result-partial, ^sample-count, ^slm-result, ^slm-result-partial, ^slm-start
none
audio-stream, audio-stream-first, audio-stream-last
->audio-pcm, audio-stream-from, audio-stream-to
audio-stream-size, custom-vocab, partial-result-interval, samples-per-second, stt-profile
live-spot.c, snsr-eval.c, PhraseSpot.java, segmentSpottedAudio.java