Abstract:
The present invention combines a conventional audio microphone with an additional speech sensor that provides a speech sensor signal based on an input. The speech sensor signal is generated based on an action undertaken by a speaker during speech, such as facial movement, bone vibration, throat vibration, throat impedance changes, etc. A speech detector component receives an input from the speech sensor and outputs a speech detection signal indicative of whether a user is speaking. The speech detector generates the speech detection signal based on the microphone signal and the speech sensor signal.
Abstract:
Prosodic databases hold fundamental frequency templates for use in a speech synthesis system. Prosodic database templates may hold fundamental frequency values for syllables in a given sentence. These fundamental frequency values may be applied in synthesizing a sentence of speech. The templates are indexed by tonal pattern markings. A predicted tonal marking pattern is generated for each sentence of text that is to be synthesized, and this predicted pattern of tonal markings is used to locate a best matching template. The templates are derived by calculating fundamental frequencies on a pursuable basis for sentences that are spoken by a human trainer for a given unlabeled corpus.
Abstract:
A speech recognition system (36) is extensible in that new terms may be added to a list (42) of terms that are recognized by the speech recognition system. The speech recognition system provides audio feedback when new terms are added so that a user may hear how the system expects the word to be pronounced. The user may then accept the pronunciation or provide his own pronunciation. The user may also selectively change the pronunciation of words to avoid misrecognitions by the system. The system may provide appropriate user interface elements for enabling a user to change the pronunciation of words. The system may also include intelligence for automatically changing the pronunciation of words used in recognition based upon empirically derived information.
Abstract:
A method and system for editing words that have been misrecognized. The system allows a speaker to specify a number of alternative words to be displayed in a correction window by resizing the correction window. The system also displays the words in the correction window in alphabetical order. A preferred system eliminates the possibility, when a misrecognized word is respoken, that the respoken utterance will be again recognized as the same misrecognized word. The system, when operating with a word processor, allows the speaker to specify the amount of speech that is buffered before transferring to the word processor.
Abstract:
A method of recognizing speech, comprising:
receiving input data indicative of the speech to be recognized; detecting pauses in the speech, based on the input data, to identify a phrase duration; generating a plurality of phrase hypotheses representative of likely word phrases represented by the input data between the pauses detected; comparing a word duration associated with each word in each phrase hypothesis, based on a number of words in the phrase hypothesis and based on the phrase duration, with an expected word duration for a phrase having a number of words equal to the number of words in the phrase hypothesis; and assigning a score to each phrase hypothesis based on the comparison of the word duration with the expected word duration to obtain a most likely phrase hypothesis represented by the input data.
Abstract:
The present invention pertains to a concatenative speech synthesis system and method which produces a more natural sounding speech. The system provides for multiple instances of each acoustic unit which can be used to generate a speech waveform representing an linguistic expression. The multiple instances are formed during an analysis or training phase of the synthesis process and are limited to a robust representation of the highest probability instances. The provision of multiple instances enables the synthesizer to select the instance which closely resembles the desired instance thereby eliminating the need to alter the stored instance to match the desired instance. This in essence minimizes the spectral distortion between the boundaries of adjacent instances thereby producing more natural sounding speech.
Abstract:
A method of recognizing speech, comprising:
receiving input data indicative of the speech to be recognized; detecting pauses in the speech, based on the input data, to identify a phrase duration; generating a plurality of phrase hypotheses representative of likely word phrases represented by the input data between the pauses detected; comparing a word duration associated with each word in each phrase hypothesis, based on a number of words in the phrase hypothesis and based on the phrase duration, with an expected word duration for a phrase having a number of words equal to the number of words in the phrase hypothesis; and assigning a score to each phrase hypothesis based on the comparison of the word duration with the expected word duration to obtain a most likely phrase hypothesis represented by the input data.