Speech To Text stt¶

These models do audio transcription with transformers.

STT models have task-type==lvcsr and filenames that by convention match stt-*.snsr

STT models included in this distribution.

Operation¶

flowchart TD
    start((start))
    fetch[/samples from ->audio-pcm/]
    audio(^sample-count)
    process[process]
    partial(^result-partial)
    intent(^nlu-intent)
    slot(^nlu-slot)
    result(^result)
    nlu{NLU<br>match?}

    slm{SLM<br>included?}
    generate[generate]
    slmstart(^slm-start)
    slmresultpartial(^slm-result-partial)
    slmresult(^slm-result)

    start --> fetch
    fetch --> audio
    audio --> process
    process --> fetch
    process -->|hypothesis| partial
    partial --> fetch
    process -->|VAD endpoint<br>or STREAM_END| nlu
    nlu -->|yes| intent
    nlu -->|no| result
    intent --> slot
    slot --> result
    slot -->|more| intent

    result --> slm
    slm -->|yes| slmstart
    slm -->|no| fetch
    slmstart -->|OK| generate
    slmstart -->|STOP| fetch
    generate -->|response| slmresultpartial
    slmresultpartial --> generate
    generate -->|done| slmresult
    slmresult --> fetch

Read audio data from ->audio-pcm.
Invoke ^sample-count.
Invoke ^result-partial with interim recognition hypotheses every partial-result-interval ms.
Continue processing until STREAM_END occurs on ->audio-pcm, one of the event handlers returns a code other than OK, or an external VAD detects a speech endpoint.
If NLU is configured, invoke ^nlu-intent and ^nlu-slot for each top-level result that matches.
Invoke ^result with the final recognition hypothesis.
If an SLM is not available, resume processing at step 1.
Invoke ^slm-start. If the handler returns STOP, resume processing at step 1.
Invoke ^slm-result-partial as the model generates text.
Invoke ^slm-result when text generation is complete.
Resume processing at step 1.

Note

STT recognizers do not produce a final recognition hypothesis until they runs out of audio samples to process, or an external VAD detects a speech endpoint.

With live audio you should these with a VAD template such as tpl-vad-lvcsr, tpl-opt-spot-vad-lvcsr, or tpl-spot-vad-lvcsr.

Settings¶

^nlu-intent, ^nlu-slot, ^result, ^result-partial, ^sample-count, ^slm-result, ^slm-result-partial, ^slm-start

none

audio-stream, audio-stream-first, audio-stream-last

->audio-pcm, audio-stream-from, audio-stream-to

audio-stream-size, custom-vocab, partial-result-interval, samples-per-second, stt-profile

lvcsr

live-spot.c, snsr-eval.c, PhraseSpot.java, segmentSpottedAudio.java