Skip to content

LVCSR tnl

These recognizers use a phonetic acoustic model and an FST vocabulary decoder. They are suitable for small to medium vocabulary tasks, but not for unconstrained audio transcription.

These models have task-type==lvcsr and filenames that by convention match lvcsr-*.snsr

You can create LVCSR recognizers with VoiceHub or by specifying a grammar with build-capable1 model.

LVCSR recognizers include support for decoding with statistical language models, but Sensory does not distribute the tools we use to create these2. Language models can provide improved accuracy for constrained target domains. For transcription type tasks we recommend that you use an STT model instead.

Our FST decoder supports hybrid models that contain both grammar-based and language model components. Models that include -background- in the filename (such as lvcsr-background-enUS-1.2.3.snsr) use this feature to match out-of-grammar utterances to a ~background class that contains a small language model.

LVCSR models included in this distribution.

Operation

flowchart TD
    start((start))
    fetch[/samples from ->audio-pcm/]
    audio(^sample-count)
    process
    partial(^result-partial)
    intent(^nlu-intent)
    slot(^nlu-slot)
    result(^result)
    nlu{NLU<br>match?}
    start --> fetch
    fetch --> audio
    audio --> process
    process --> fetch
    process -->|hypothesis| partial
    partial --> fetch
    process -->|VAD endpoint<br>or STREAM_END| nlu
    nlu -->|yes| intent
    nlu -->|no| result
    intent --> slot
    slot --> result
    slot -->|more| intent
    result --> fetch
  1. Read audio data from ->audio-pcm.
  2. Invoke ^sample-count.
  3. Invoke ^result-partial with interim recognition hypotheses every partial-result-interval ms.
  4. Continue processing until STREAM_END occurs on ->audio-pcm, one of the event handlers returns a code other than OK, or an external VAD detects a speech endpoint.
  5. If NLU is configured, invoke ^nlu-intent and ^nlu-slot for each top-level result that matches.
  6. Invoke ^result with the final recognition hypothesis.
  7. Resume processing from step 1.

Note

LVCSR recognizers do not produce a final recognition hypothesis until they runs out of audio samples to process, or an external VAD detects a speech endpoint.

With live audio you should these with a VAD template such as tpl-vad-lvcsr, tpl-opt-spot-vad-lvcsr, or tpl-spot-vad-lvcsr.

Settings

^nlu-intent, ^nlu-slot, ^result, ^result-partial, ^sample-count

none

audio-stream, audio-stream-first, audio-stream-last

->audio-pcm, audio-stream-from, audio-stream-to, grammar-stream, phrases-stream

audio-stream-size, complete-only, partial-result-interval, samples-per-second, search.frame-nota, show-silence

lvcsr

live-spot.c, snsr-eval.c, PhraseSpot.java, segmentSpottedAudio.java

Notes

Sensory optimizes hybrid models with a background component only to detect speech that is not in the specified grammar. These models report an nlu-intent-name of background when they detect out-of-grammar utterances. You should not use the out-of-grammar recognition text result as this will have a high word error rate. Consider using STT for transcription tasks instead.

phrases-stream provides a convenient way to specify a recognition vocabulary from an exhaustive list of alternative utterances.

Grammar-based recognition 6.7.0

Sensory's LVCSR models use grammars to constrain the possible utterances they can recognize. Focussing on a limited set of words and structures defined in these grammars improves recognition speed and accuracy at the expense of recognizing arbitrary input.

You can create a custom recognizer by specifying a fixed grammar during development if the recognition vocabulary is entirely known, or at runtime if it not. You can also use a hybrid approach and build the invariant parts during development, and delay adding variable parts (such as a list of favorite TV channels) until runtime.

Creating a recognizer

Let's create a grammar-based recognizer using the command-line tools. We'll use data/grammars/enrollments.txt which contains a sample grammar specification for the enrollment recordings in data/enrollments/.

We can create a custom recognizer using this grammar with snsr-edit by specifying an LVCSR model that supports building and grammar-stream.

data/grammars/enrollments.txt
# LVCSR grammar specification for test utterances in data/enrollments/
#
# In a tpl-spot-vad-lvcsr pipeline the prefix would be consumed by the spotter.
prefix = armadillo | jackalope | terminator;

# List of known utterances in the *-c.wav files.
sentence =
 18 percent of 643 |
 call the nearest target |
 how far away is winco |
 play more songs by this artist |
 record a video |
 start a timer for 20 minutes |
 i'm running low on gas |
 cancel all my meetings on friday |
 directions to susan's house |
 do i have any new texts |
 open my calendar to next week |
 set an alarm for 6 am tomorrow;

# Match the prefix and zero or one of the sentences.
# <s> and </s> are sentence start and end markers that
# match silence and small amounts of extraneous speech.
g = <s> $prefix $sentence? </s>;
% cd ~/Sensory/TrulyNaturalSDK/7.6.1

% bin/snsr-edit -vv -t model/lvcsr-build-enUS-2.7.3.snsr \
    -f grammar-stream data/grammars/enrollments.txt \
    -o lvcsr-enrollments.snsr
Loading "model/lvcsr-build-enUS-2.7.3.snsr" as the template model.
Loading "data/grammars/enrollments.txt" into setting "grammar-stream".
Saved edited model to "lvcsr-enrollments.snsr".

Run the new model with snsr-eval:

% bin/snsr-eval -t lvcsr-enrollments.snsr \
    -s partial-result-interval=0 \ # (1)!
    data/enrollments/armadillo-1-3-c.wav
   165   2760 armadillo play more songs by this artist
  1. We're setting partial-result-interval= 0 to see only the final recognition hypothesis.

For small grammars such as this the build time is negligible. We could use snsr-eval to build and run the recognizer in a single operation:

% bin/snsr-eval -t model/lvcsr-build-enUS-2.7.3.snsr \
    -f grammar-stream data/grammars/enrollments.txt \
    -s partial-result-interval=0 \
    data/enrollments/armadillo-1-3-c.wav
   165   2760 armadillo play more songs by this artist

Classes

A symbol that starts with the tilde ~ sigil specifies a recognition class. Class recognizers have their own grammar specifications, separate from the top-level grammar. The behavior of a class-based recognizer is similar to that specified by a rule. Classes, however, can be updated without recompiling the rest of the grammar, and all references to a class use the same recognizer. This can reduce the recognizer size and improve build speed.

This example uses a modified enrollment grammar which references two toy classes: ~number and ~place:

enrollments-class.txt
# LVCSR grammar specification for test utterances in data/enrollments/
# This references two class sub-recognizers: ~number and ~place
#
# In a tpl-spot-vad-lvcsr pipeline the prefix would be consumed by the spotter.
prefix = armadillo | jackalope | terminator;

# List of known utterances in the *-c.wav files.
sentence =
 ~number percent of ~number |
 call the nearest ~place |
 how far away is ~place |
 play more songs by this artist |
 record a video |
 start a timer for ~number minutes |
 i'm running low on gas |
 cancel all my meetings on friday |
 directions to ~place |
 do i have any new texts |
 open my calendar to next week |
 set an alarm for ~number am tomorrow;

# Match the prefix and zero or one of the sentences.
# <s> and </s> are sentence start and end markers that
# match silence and small amounts of extraneous speech.
g = <s> $prefix $sentence? </s>;
place.txt
# Example place name class recognizer.

g = target | winco | susan's house;

The ~number and ~place classes referenced in enrollments-class.txt create two new dynamic settings for these classes: grammar-stream.number and grammar-stream.place. Specify these to create a complete recognizer:

% snsr-edit -v -t model/lvcsr-build-enUS-2.7.3.snsr\
    -f grammar-stream enrollments-class.txt \
    -g grammar-stream.number "g = 18 | 643 | 20 | 6;" \ # (1)!
    -o lvcsr-enrollments-class.snsr
Output written to "lvcsr-enrollments-class.snsr".
  1. snsr-edit's -g option sets the grammar-stream.number stream to a string argument. We could have also used a file for the number grammar.

Run the recognizer:

% snsr-eval -v -t lvcsr-enrollments-class.snsr \
    -s partial-result-interval=0 \
    -f grammar-stream.number number.txt \
    data/enrollments/armadillo-1-0-c.wav
   390   3210 (0.00 sv) armadillo 18 percent of 643

Class libraries 6.15.0

TrulyNatural 6.15.0 introduced support for pre-built binary class repositories. These contain classes built from frequently used grammar fragments such as dates, times, and numbers.

Load binary class repositories into the same Session as an LVCSR model to add this capability to the model. If a grammar references a class that's not explicitly defined, the class name is looked up in the provided class library or libraries. System class libraries provided by Sensory use a prefix of s. for all class names.

See lvcsr-lib-enUS-1.2.0.snsr for a description of the classes used below.

class-lib.txt
# Example recognizer with classes from a class library
call = call {number ~s.phone-number};
emergency = ~s.call-emergency;
timer = {timer ~s.timer-phrases};
commands = {call} | {emergency} | $timer;
g = <s> $commands </s>;

We'll use live audio for this example, so we need to use snsr-eval's -a flag to add add a VAD to find the end of each utterance and signal the recognizer to produce a final hypothesis.

% snsr-eval -a -t model/lvcsr-build-enUS-2.7.3.snsr \
    -t model/lvcsr-lib-enUS-1.2.0.snsr \
    -f grammar-stream class-lib.txt \
    -s partial-result-interval=0

# Say: Call 800 555 1212
NLU intent: call (0) = call eight hundred five five five one two one two
NLU entity:   number (0) = eight hundred five five five one two one two
  7815  11190 call eight hundred five five five one two one two

# Say: Set a timer for 31 minutes.
NLU intent: timer (0) = set a timer for thirty one minutes
 24375  27015 set a timer for thirty one minutes

# Say: Call the fire department.
NLU intent: emergency (0) = call the fire department
 40110  41595 call the fire department

Configuring class-based recognition with the C API:

SnsrSession s;

snsrNew(&s);
snsrLoad(s,   snsrStreamFromFileName("model/tpl-vad-lvcsr-3.17.0.snsr", "r"));
snsrSetStream(s, SNSR_SLOT_0,
              snsrStreamFromFileName("model/lvcsr-build-enUS-2.7.3.snsr", "r"));
snsrLoad(s,   snsrStreamFromFileName("model/lvcsr-lib-enUS-1.2.0.snsr", "r"));
snsrSetStream(s, SNSR_GRAMMAR_STREAM,
              snsrStreamFromFileName("class-lib.txt", "r"));
if (snsrRC(s) != SNSR_RC_OK) {
    fprintf(stderr, "ERROR: %s\n", snsrErrorDetail(s));
    return snsrRC(s);
}

Configuring class-based recognition with the Java API:

SnsrSession s = new SnsrSession();
try {
    s.load(SnsrStream.fromFileName("model/tpl-vad-lvcsr-3.17.0.snsr", "r"));
    s.setStream(Snsr.SLOT_0,
                SnsrStream.fromFileName("model/lvcsr-build-enUS-2.7.3.snsr", "r"));
    s.load(SnsrStream.fromFileName("model/lvcsr-lib-enUS-1.2.0.snsr", "r"));
    s.setStream(Snsr.GRAMMAR_STREAM,
                SnsrStream.fromFileName("class-lib.txt", "r"));
} catch (IOException e) {
    e.printStackTrace();
    return s.rC();
}

Syntax

A context-free grammar is a set of rules that describes the sequences of words that an LVCSR model can recognize.

Definition

  1. Grammars use UTF-8 encoding.
  2. # marks the start of a comment, which extends to the end of the line.
  3. A grammar is a series of rules representing variable definitions. The final rule in a grammar specifies the recognition vocabulary and typically references rules defined earlier. It should include the sentence start (<s>) and end (</s>) markers.
  4. A rule is an assignment of the form name = expr ; where name is a symbol and expr is a sequence of symbols and operators. expr is a type of regular expression.
  5. A symbol is a sequence of characters that does not include any whitespace or operators, optionally prefixed by sigils $ or ~. A symbol without a sigil is called a terminal and is part of the recognition vocabulary, for example temperature. Special symbols are predefined terminals that describe input characteristics such as pauses and the edges of an utterance.
  6. The $ sigil does rule substitution at build time. The parser substitutes the value of the rule named name for $name. Substitutions include an implicit grouping operator: Grammar a = 1 | 2 | 3; b = <s> $a </s>; is equivalent to b = <s> (1 | 2 | 3) </s>;.
  7. The ~ sigil substitutes a named recognition class at runtime.
    • Each class is a recognizer with its own grammar, separate from the main grammar.
    • All references to a class use instances of the same class recognizer.
    • You can update each class in isolation, without having to recompile the main grammar.
    • If you have a large rule that's referenced multiple times, converting it to a class can speed up build time significantly.
    • Use classes to augment a recognition vocabulary at runtime. In a voice dialing application, for example, one would define the entire recognition grammar at build time but use ~contacts instead of a predefined list of contact names. Once loaded, the application would scan the address book and build only the ~contacts class.
    • Specify class definitions with grammar-stream.classname or phrases-stream.classname, for example phrases-stream.contacts.
  8. Operators include grouping parentheses, brackets, and braces, infix operators that indicate logical AND and OR between symbols, and postfix operators that change how the preceding symbol matches input. The operator precedence table lists the order and direction in which the parser applies operators.
  9. Grouping
    • ( ) Parentheses enclose items that are grouped together.
    • [ ] Square brackets enclose optional items. [...] is equivalent to (...)?.
    • { } Braces implement slot-capturing lightweight NLU markup.
      • {slotName a b c} makes a b c available as the nlu-slot-value of nlu-slot-name slotName when the recognizer matches a b c to the input audio.
      • You can nest NLU slots to an arbitrary depth.
      • We define the outermost slots as intents and all the nested slots in each intent as entities.
      • Each identified intent invokes handlers registered for ^nlu-intent and ^nlu-slot.
      • {rule} is shorthand for {rule $rule}.
      • With this grammar:
        seconds = 1 | 2 | 4 | 8 | half:0.5 a:? | a:? quarter:0.25 [of: a:];
        shutterSpeed = set shutter speed to {seconds} ( second | seconds );
        cmd = <s> {shutterSpeed} </s>;
        
        an utterance of "set shutter speed to a quarter of a second" will produce set shutter speed to 0.25 second as recognition output, with an additional ^nlu-intent callback for the top-level shutterSpeed slot:
        NLU intent: shutterSpeed (0) = set shutter speed to 0.25 second
        NLU entity:   seconds (0) = 0.25
        
  10. Infix operators
    • These are valid between symbols and may be surrounded by whitespace.
    • ^ is the conjunction operator and is implied between adjacent terminals: Grammar g = one two three; will recognize only the sequence "one two three".
    • | is the disjunctive operator. It separates alternative items. Grammar g = one | two | three; will recognize "one", or "two", or "three".
  11. Postfix operators
    • These directly follow a symbol without any intervening whitespace.
    • ? A question mark following a symbol makes that symbol optional: It requires zero or one repetitions of the symbol.
    • + A plus sign following a symbol or a group requires one or more repetitions of it.
    • * An asterisk following a symbol or a group requires zero or more repetitions.
    • : is the rewrite operator.
      • left:right recognizes symbol left but produces terminal right as a recognition result.
      • left: recognizes symbol left but rewrites that to an empty string, eliding left from the recognition result.
      • :right inserts right into the recognition result. If you say "one two three", grammar g = <s> one :mississippi two :mississippi three </s>; produces "one mississippi two mississippi three".
    • / A forward slash following a symbol followed by a floating point number defines a weight to be associated with that symbol. If there's a rewrite operator (:) the slash must follow the rewritten-to terminal, for example: one:een/0.123 Weights are in the logprob domain, convert from a \([0, 1]\) probability to a weight with \(w = -log_{10}(p)\). The default symbol weight is 0 for a probability of 1.0.
  12. \ escape symbol. To include a literal special character in a grammar specification, escape it with a backslash. The list of characters that support this include: ^, |, *, +, ?, =, [ ], ( ), ;, #, and :.

grammar-stream, phrases-stream, nlu-grammar-stream, ^nlu-intent, ^nlu-slot

Operator precedence

The following table lists the precedence and associativity of grammar operators. Operators are listed in descending precedence: level 0 is applied first and level 5 last.

Precedence Operator Description Associativity
0 : Rewrite output
0 / Symbol weight
1 ( ) Grouping
1 [ ] Optional group
1 { } Slot-capturing semantic markup
2 ? Zero-or-one symbol left-to-right
2 + One-or-more symbols left-to-right
2 * Zero-or-more symbols left-to-right
3 ^ And, implied between symbols right-to-left
4 | Alternative right-to-left
5 = Rule assignment right-to-left

This grammar:

a = one | two three four;
g = <s> ( $a | five six) </s>;

will recognize only these phrases:

one
two three four
five six

Special symbols

A grammar can include these special symbols:

  • <s> - The silence at the start of a sentence.
  • </s> - The silence at the end of a sentence.
  • <wp> - Short pauses between words. The grammar compiler automatically adds these where needed, so there is no need to do so explicitly. Do not add <wp> to NLU grammars, use <pause/> instead.
  • <pause/> - A explicit short pause.
  • <no-match/> - Matches when none of the alternatives are likely (i.e. "none of the above").
    • Recognition results at the phrase level can include <no-match/> even if this symbol was not explicitly used in the grammar. This is an indication that the result was rejected due to search.frame-nota, or that RAM or CPU constraints limited the recognizer's ability to produce a result.
  • <unknown/> - Similar to <no-match/>. In some models the threshold for determining whether this symbol matches better than any other is different from that of <no-match/>.
  • . - When used with lightweight NLU grammars a single period matches any input word. Use .:* to match any input words and remove them from the NLU result.

  1. LVCSR models created by VoiceHub include build components only if the grammar references at least one user-defined class, such as ~dynamic-1. If the grammar contains no unresolved classes VoiceHub removes the build components to reduce model files size and RAM use. 

  2. Contact your sales representative if you would like to explore using a custom language model for your application.