Homepage¶

This page describes the organization of a DSP application using Sensory’s TrulyHandsFree-Micro™ (THF-Micro™) speech recognition technology for embedded devices.

Application developers using THF-Micro™ will be supplied a ported, prebuilt static library by Sensory. The THF-Micro™ library performance will be optimized for a given platform by a Sensory developer. Discussions about optimizations and customizations needed for an intended application will be made throughout the porting process between Sensory and the application developers.

Licensing Information¶

The THF-Micro™ library supports license limits.
The library can be used for development and production purposes.
For development purposes, the library will be licensed with a development license. An application using this library might encounter a license limit error when it reaches the license usage limits. The user needs to restart the system when this error occurs.
For production purposes, the library will be licensed with a production license. An application using this library will not have any usage limits.

Code, Header, and Sample Files¶

A THF-Micro™ SDK delivery includes the following files:

libTHFMicro.a: Ported, prebuilt static library, which contains platform-specific optimizations.
sensorylib.h: Defines the API to the THF-Micro™ library. This includes definitions of the speech recognition routines and the data structures needed to pass information to the routines. It also includes definitions of error codes that are returned by the library. The version number of sensorylib.h must match the version of libTHFMicro.a; if not, the library will not perform speech recognition and it will return an error code ERR_HEADER_VERSION.
sensorytypes.h: Contains typedefs and other definitions used by both the application and the library.
SensoryDemo.c: Sample application program. It contains usage of the API data structure (t2siStruct) and uses audio samples read in from a raw (.pcm) or .wav file. It usually reads in the .bin files for the acoustic model and search. The sample demonstrates the program flow for initializing recognition and then repeatedly calling recognition with frames of samples until the recognition ends.
SensoryDemoMulti.c: Demonstrates handling of multiple audio channels and handling multiple frames from an audio channel simulatenously.

The SDK will also contain several data files for speech recognition, described in the next section.

Data Files¶

Each THF-Micro™ application requires two data files for the vocabulary: an acoustic model (net) and a grammar (search).

Usually these data files are formatted as binary data (-net.bin for net and -search.bin for grammar). In the sample application, we allocate memory for the net and grammar using malloc, read them in from binary files, and pass the allocated addresses on to the t2siStruct. In an actual application they must be loaded into DSP accessible memory at addresses known to the application, so that their addresses can be placed in the net and gram fields of the t2siStruct. The application finds out how much dynamic memory to allocate based on the size of the vocabulary.

Alternatively the net and grammar data can be supplied as arrays declared in .c files. This may be easier to implement on hardware.

For Fixed Trigger or Command Set vocabularies, these .bin files are supplied by Sensory.

For Speaker Verification or User Defined Trigger vocabularies, .snsr files may be supplied by Sensory, or built by the user using Sensory's VoiceHub™ tool.

spot-convert™¶

Note that .snsr files cannot be used by THF-Micro™ directly. To produce the .bin files needed, the .snsr files must processed by Sensory's spot-convert™ tool.

spot-convert™ requires a target, which is a code for a given vocabulary format; this will change attributes of the data, such as endianness, word width, and alignment. Sensory will provide the target needed for a specific port.

Example: Use spot-convert™ to convert the file 'trigger.snsr' to .bin and .c formats, for pc60 target

spot-convert -c -t trigger.snsr pc60w

The -c parameter generates .c files.
The -t parameter specifies the input task (.snsr).
The target is specified last.

snsr-eval™¶

Use Sensory's snsr-eval™ tool to simulate speech recognition. snsr-eval™ requires an input task (.snsr) and input utterance (.wav). It should produce timestamps and a score when a recognition occurs. The outputs from snsr-eval™ are taken as reference.

Example: Use snsr-eval™ to simulate speech recognition with 'trigger.snsr' and 'speech.wav'

snsr-eval -v -v -t trigger.snsr speech.wav

Definitions¶

A frame or brick is the basic unit of samples used by the recognition routine process and is 240 samples, or 15 msec, of audio data. Recognition works on 16-bit samples of audio data, collected at 16 kHz sample rate, mono.
A realtime application should continuously pass audio frames into the recognizer. The application may optionally use the Low Power Sound Detector (LPSD). The LPSD examines incoming frames and determines if they contain sound or silence; the recognizer can then be switched ON when sound is detected and OFF when silence is detected. Using the LPSD can allow for low-power solution, if the DSP can remain in a low-power state during silence and quickly switch to a high-power state for recognition.
After a successful recognition, the application may use the provided endpoints, which allows the application to know the points in the audio buffer (start and end) where a successful recognition occurred. This is useful if succeeding audio samples will be passed to another process, or if the audio data is also being processed by another task.
The DSP application is a task which recognizes a single phrase or one of a group of phrases. There is a trigger or (wake word) phrase which will be listened for continuously. In a realtime application, recognizing the phrase should trigger other application actions, such as waking up another processor, beginning another phase of recognition, hardware I/O, etc.
In some applications, trigger recognition is followed by listening for one of a list of command phrases. For instance, controlling a CD player could involve a command set such as “Play Music” or “Next Song”. Usually command recognition is limited to a few seconds after the recognition of the trigger, after which the recognizer is reset to listen for the trigger phrase.
Typically the customer works with Sensory’s linguistics team or uses Sensory's VoiceHub™ tool to develop fixed vocabularies for the trigger and command phrases in the application. Usually these are spotted vocabularies, meaning that the phrases can be recognized even in noisy environments mixed with other sounds.
The library supports Fixed Trigger (FT) + Speaker Verification (SV), also called Enrolled Fixed Trigger (EFT), in which the Fixed Trigger acoustic model is adapted for a particular speaker by enrollment.
The library supports User Defined Trigger (UDT), in which both the acoustic model and the grammar are generated by a speaker for a particular phrase.
The library supports User Defined Trigger (UDT) + Speaker Verification (SV), also called User Defined Passphrase (UDP), in which the UDT acoustic model is adapted for a particular speaker by enrollment.

Accuracy Definitions¶

It may be that a whole set of searches is delivered, each of which represents a point on a frontier graph of False Accept/False Reject (FA/FR) points.
False Accept means erroneously recognizing an utterance as one of the vocabulary words.
False Reject means not recognizing an utterance that is one of the vocabulary words.
A search that allows more False Accept is said to be looser.
A search that allows more False Reject is said to be tighter.
THF-Micro™ strives to minimize both False Accept and False Reject, but there is a tradeoff.

Data Structures¶

t2siStruct¶

The THF-Micro™ application program must provide a t2siStruct, which contains the variables needed to describe the recognition task.

The t2siStruct contains the Sensory Persistent Pointer (SPP), the RAM memory THF-Micro™ will use for dynamic data. Note that the THF-Micro™ library is built so that it uses no static memory and does not dynamically allocate any of the memory it needs. The application program is responsible for allocating required RAM memory and passing its address in the spp field. The application should call SensoryAlloc to find out the amount of RAM the library needs, based on vocabulary size and recognition task descriptors in the t2siStruct. Note that the SPP must be available for the entire time of the recognition process; it cannot be reallocated between frames.

The application controls the t2siStruct parameters listed below. The application must set the net, gram, and spp fields. Consult Sensory before changing other fields.

`t2siStruct` Member	Description
`net`	Pointer to address of net data in memory.
`gram`	Pointer to address of search data in memory.
`spp`	Pointer to address of SPP in memory.
`extras`	Optional pointer to user resources.
`featureSource`	Can be set to another exisiting `t2siStruct`, if sharing features.
`sdet_type`	No LPSD: `sdet_type = SDET_NONE` (default). Use LPSD: `sdet_type = SDET_LPSD`.
`timeout`	Time to listen for a recognition, in seconds. 0 (default): 0 for triggers, 3 seconds for commands.
`maxTalkTime`	Maximum talk time, in seconds. 0 (default): Use the value from grammar.
`separator`	Sets trailing silence, in msec. 0 (default): Use the value from grammar. Range = [100, 900]. Larger value may be needed if a phrase contains internal silence ("cupcake”). Only used for non-spotted vocabulary.
`maxResults`	Max number of recognition results. 0 (default): 6.
`maxTokens`	Max number of token allocation for search. 0 (default): 300. Use more for multiple phrases. Can be 1 for UDT/SV only.
`paramAOffset`	0 (default): Use the value from grammar.
`noVerify`	0 (default): Use the value from grammar. 1: Treat as UDT, so no `svScore` is calculated.
`lpsdFixedThresh`	Fixed part of LPSD threshold. 0 (default): Use the value from grammar.
`delay`	Delay for returning results, in msec. 0 (default): Use the value from grammar. Only used for spotted vocabulary.
`initFromLast`	0 (default): `SensoryProcessInit` performs total initialization. 1: `SensoryProcessInit` uses some history from previous processing.
`SvThreshold`	Speaker verification threshold score. 0 (default): Use the value from grammar.
`epqMinSNR`	0 (default): Use the value from grammar. 0xFFFF: Disable. Encoded as (u16) (256 10^epqMinSNR/10). Range = [-24 dB, 24 dB], encoded to [1, 0xFB30]*.
`audioBufferLen`	Count of audio buffer samples for one channel. 0 (default): No buffering.
`audioBuffer`	Pointer to user’s audio buffer (optional). If NULL, `SensoryAlloc` will request memory for buffering.
`LPSDLatencyCounter`	Number of frames to 'catch up' when LPSD has detected sound. `LPSDLatencyCounter = 4` enables LPSD latency reduction by 4 frames = 4 15 msec = 60 msec*.
`LPSDIncreasePowerMode`	Pointer callback function to be called when LPSD has detected sound (optional).
`LPSDDecreasePowerMode`	Pointer callback function to be called when LPSD has detected silence (optional).
`outOfMemory`	Set to TRUE during recognition process if the supply of search tokens has been exceeded.
`tokensPruned`	Set to TRUE during recognition process if the search has been pruned to free up search tokens.
`maxTokensUsed`	The maximum count of tokens used in a recognition.
`size`	Size of SPP in bytes.
`channels`	Number of audio channels being processed at one time.
`depth`	Number of frames in each channel to be processed at one time.
`brickCount`	Count of frames that have been processed in each audio channel.

infoStruct¶

The infoStruct used to report the THF-Micro™ version number. It is optional to use in an application.

The infoStruct contains one member, version. The version format is eight hex digits: MMMmmppp. 'MMM' is the major revision, 'mm' is the minor revision, and 'ppp' is the point revision.

To get the THF-Micro™ version in the application, pass an existing infoStruct structure to SensoryInfo. Refer to section 'API Documentation' for more details.

RecoResult¶

Starting from THF-Micro™ version 8.0.0+, the RecoResult structure is used to return recognition results. If a recognition did not happen, most fields will be zero.

The following RecoResult members may be used by an application:

`RecoResult` Member	Description
`error`	ERR_OK if recognition occured. ERR_NOT_FINISHED if processing audio. If other, refer to section 'API Documentation' for more details.
`channel`	Channel from which this result is from.
`wordID`	When no recognition, `wordID = 0`. When recognition occurs, `wordID = index of wakeword or command recognized (non-zero)`. Refer to the grammar header file.
`duration`	When recognition occurs, the duration, in frames, of the utterance.
`recoState`	0: RecoStateNone. 1: RecoStateFailed. 2: RecoStatePending. 3: RecoStateDone. 4: RecoStateError.
`countDown`	If recognition is going to happen, the expected number of frames before it happens. May stay at the same number for a while.
`sdet_state`	0: SDET_LPSD_SILENCE. 1: SDET_MAKING_BLOCKS. 2: SDET_RECOGNIZING.
`brickStart`	When recognition occurs, the start `brickCount` of recognized utterance.
`brickEnd`	When recognition occurs, the end `brickCount` of recognized utterance.
`brickCount`	The current frame number for which this result is being created.
`startIndex`	When recognition occurs, the start index in the audio buffer of the utterance. Can be -1 if the start index is out of the buffer.
`endIndex`	When recognition occurs, the end index in the audio buffer of the utterance. Can be -1 if the end index is out of the buffer.
`startBackupFrames`	When recognition occurs, how far back the start of the utterance was.
`endBackupFrames`	When recognition occurs, how far back the end of the utterance was.

The following RecoResult members are useful for information/debugging, if recognition took place:

`RecoResult` Member	Description
`finalScore`	The final recognition score.
`nnpqPass`	NNPQ passed: `nnpqPass = TRUE`. NNPQ failed: `nnpqPass = FALSE`.
`nnpqThreshold`	Threshold that the score from the NNPQ check is compared to.
`svScore`	When speaker verification recognition occurs, the SV score.
`numResults`	The number of alternate results. Usually one.
`garbageScore`	The comparison score.
`nnpqScore`	The score from the NNPQ check.
`thf7Score`	The score from the THF 7 check.

Audio Buffer¶

The audio buffer is defined as an array of signed 16-bit integers. It can be allocated statically or dynamically. The application program manages this buffer using the t2siStruct, with the following constraints:

The audioBufferLen field sets the size of the buffer for one channel. It is usually defined as AUDIO_BUFFER_LEN in sensorylib.h. This can be changed at the app level without recompiling the THF-Micro™ library.
If the audioBuffer field is set, then this memory will be used. It must have room for (audioBufferLen * channels) samples. If not set, the audio buffer will come out of the space provided in spp.
The audio buffer size must be an integral multiple of 240 samples, or 15 msec.
The minimum audio buffer size depends on the application requirements, LPSD usage, and desire to save captured audio.
audioBufferLen can be zero, in which case there is no buffering.

Without trigger endpoint detection¶

For no LPSD, the minimum audio buffer size is 45 msec. It can be 0 for the default brick size (240 samples), which means no audio buffering. It needs to be larger if depth is more than 1.
For LPSD, the minimum audio buffer size should match BACKOFF_MS, which is defined in sensorylib.h. A typical value for BACKOFF_MS is 270 msec. Note that the application developer cannot change BACKOFF_MS, as it is integrated in the THF-Micro™ library.

With trigger endpoint detection¶

For no LPSD, the size depends on the delay value, and the value of ADDITIONAL_ENDPOINT_BACKOFF_FRAMES in sensorylib.h. The application developer cannot change the latter value, as it is integrated in THF-Micro™ library. Typically, for endpoint detection it is recommended that the default delay value be overridden by setting the delay to be 240 msec. A typical value for the AUDIO_BUFFER_MS in this scenario is 360 msec. This is suitable for a delay value of 240 msec and ADDITIONAL_ENDPOINT_BACKOFF_FRAMES of up to 6.
For LPSD, the size of the audio buffer should be the sum of the requirements of LPSD and endpoint detection. A typical value is 630 msec, which is the sum of 270 msec for LPSD and 360 msec for endpoint detection.

Rewinding Audio¶

When a wakeword (trigger) is followed by command, backing up the audio a small amount after the wakeword recognition will improve accuracy for the command recognition.

Use the SensoryAudioRewind API to rewind audio input pointer a certain number of milliseconds (usually 90-300 msec).
Consult with Sensory to discover the appropriate amount to rewind for the wakeword (backoff in .snsr file).
Note that after “rewind,” the audio input pointer will be behind realtime by the rewinded amount of milliseconds.
Use SensoryAudioFastForward after command recognition or timeout to return the audio input pointer to the current realtime frame.
See details for SensoryAudioRewind and SensoryAudioFastForward in the API Documentation section.

Recognition Results and Endpointing¶

Starting from version 8.0.0+, recognition results are stored in the RecoResult structure, returned by SensoryProcessData. Important members of the RecoResult structure are described below.

Results¶

error: Error code set by SensoryProcessData. Usually ERR_OK or ERR_NOT_FINISHED.
wordID: ID for recognized phrase. This is specified in the grammar header file.

Endpointing¶

Starting from version 8.0.0+, the user need not call SensoryFindEndpoint and SensoryFindStartpoint after recognition. Those values are automatically calculated and stored in the RecoResult.

brickStart: Brick where the recognized phrase started.
brickEnd: Brick where the recognized phrase ended.
brickCount: Current brick at time of result.
startIndex: Start index of the recognized phrase in the audio buffer.
endIndex: End index of the recognized phrase in the audio buffer.
startBackupFrames: Number of frames between the current brick and the start of the recognized phrase.
endBackupFrames: Number of frames between the current brick and the end of the recognized phrase.

Low Power Sound Detector¶

The THF-Micro™ library implements a Low Power Sound Detector (LPSD). Using the LPSD, an application can be constructed to run the DSP with lower power when no sound is detected and then switch to higher power to perform recognition when sound is detected. To use the LPSD:

Set the sdet_type field in the t2siStruct to SDET_LPSD, before calling SensoryProcessData.
Define an audio buffer large enough for LPSD. When using LPSD, SensoryProcessData does not fully process audio frames as they are acquired; it only decides whether they represent sound or not. When LPSD signals a transition from silence to sound, the speech recognition process goes back in the audio buffer by BACKOFF_MS, then begins the full recognition process.
The developer may optionally set callback functions in the t2siStruct. LPSDIncreasePowerMode should notify the THF-Micro™ application when sound is detected and increase the CPU cycle rate for recognition. The LPSDDecreasePowerMode should notify the THF-Micro™ application when silence is detected and decrease the CPU cycle rate to conserve power.

Using Multiple Channels/Depth¶

Usually, speech recognition will happen using one channel at a time (channels = 1) and one frame (depth = 1) at a time.

In some cases, if the vocabulary model is large and access to memory is slow, there may be great benefit in processing multiple audio frames at once. The model is loaded from memory once for all frames being processed.

If there are C audio channels (ex. C microphone captures being processed), then the user should ask for C channels using SensoryAllocMulti. The model can then be loaded once for one call to SensoryProcessMultiData, for all the channels. Refer to section 'API Documentation' for more details.

Even if there is only one channel, the user can ask for for depth D > 1 using SensoryAllocMulti. D audio frames will then be processed at once. The downside is the added latency of (D - 1) / 2 frames. If D = 2, this means an added latency of 7 msec, on average.

If (C or D) > 1:

The user must use SensoryAllocMulti to set up the recognizer for multiple channels and/or depth.
The user must use SensoryProcessMulti to process audio and get results.
The user must use SensoryGetResult to get results for the specified channel and frame.
For THF-Micro™ version 8.x, the model used must be a newer (DNN) model.