Abstract:
The present invention describes a method and arrangement for reducing a sequence of initial frames into a reduced set of representative frames by combining the initial frames into a plurality of representative frames, the combining process including generating a distortion measure associated with each representative frame and comparing each distortion measure to a distortion threshold. From these representative frames, a set of mutually exclusive frames is determined to minimize the number of representative frames, whereby each representative frame in the set represents a unique set of contiguous initial frames and has an associated distortion measure which does not exceed the distortion threshold.
Abstract:
Described herein, is an arrangement and method for processing speech information in a speech recognition system (300). In such a system where the speech information is depicted as words, each word representing a sequence of frames (510) and where the recognition system has means (120) for comparing present input speech to a word template, the word template stored in template memory and derived from one or more previous input word, the present invention is best employed. The invention describes combining contiguous acoustically similar frames (512) derived from the previous input word or words into representative frames to form a corresponding reduced word template, storing the reduced word template in template memory in an efficient manner, and comparing frames of the present input speech to the representative frames of the reduced word template according to the number of frames combined in the representative frames of the reduced word template. In doing so, a measure of similarity between the present input speech and the word template is generated.
Abstract:
The present invention is a method and apparatus for preventing overflow and underflow of an encoder buffer in a video compression system. A virtual buffer is created in a rate controller to model the decoder buffer fullness (102). A sequence of bits is generated by an encoder (104). The encoder is controlled by the rate controller to prevent a decoder buffer underflow and overflow. Then, the sequence of bits is received by the encoder buffer to produce a bitstream (106). The bitstream corresponds to an instantataneous channel bitrate. The bitstream is transmitted from the encoder buffer to a decoder buffer following a delay (108). The delay is controlled by a rate controller to synchronize an encoder buffer fullness with a virtual buffer fullness (110). The synchronization prevents overflow and underflow of the encoder buffer.
Abstract:
A user-interactive control system for an electronic device which synthesizes speech from speech recognition templates to generate voice reply feedback to the user indicative of which template word was recognized. The acoustic features of the user-spoken speech are extracted by the acoustic processor (110) and applied to the training processor (170) to generate word recognition templates stored in the template memory (160). Recognition processor (120) compares the user-spoken features to the recognition templates to provide voice command data for the device controller (130) which controls the operating parameters of the electronic device (150). The device controller also produces device status data for the synthesis processor (140) which synthesizes a speech reply signal from the word recognition templates. In the preferred embodiment, a hands-free user-interactive control system for a mobile radiotelephone is provided utilizing speech synthesis from speech recognition templates.
Abstract:
A channel bank speech synthesizer for reconstructing speech from externally-generated acoustic feature information without using externally-generated voicing or pitch information is disclosed. An N-channel pitch-excited channel bank synthesizer (340) is provided having a first low-frequency group of channel gain values (1 to M) and a second high-frequency group of channel gain values (+1 to N). The first group controls a first group of amplitude modulators (950) excited by a periodic pitch pulse source (920), and the second group controls amplitude modulators excited by a noise source (930). Both groups of modulated excitation signals are applied to the bandpass filters (960) to reconstruct the speech channels, and then combined at the summation network (970) to form a reconstructed synthesized speech signal. Additionally, the pitch pulse source (920) varies the pitch pulse period such that the pitch pulse rate decreases over the length of the word.
Abstract:
The present invention describes a method and arrangement for reducing a sequence of initial frames into a reduced set of representative frames by combining the initial frames into a plurality of representative frames, the combining process including generating a distortion measure associated with each representative frame and comparing each distortion measure to a distortion threshold. From these representative frames, a set of mutually exclusive frames is determined to minimize the number of representative frames, whereby each representative frame in the set represents a unique set of contiguous initial frames and has an associated distortion measure which does not exceed the distortion threshold.
Abstract:
Arrangement and method for processing speech information in a speech recognition system. In such a system where the speech information is depicted as words, each word representing a sequence of frames and where the recognition system has means for comparing present input speech to a word template, the word template stored in template memory (160) and derived from one or more previous input word, the present invention is best employed. The invention describes combining (322) contiguous acoustically similar frames derived from the previous input word or words into representative frames to form a corresponding reduced word template, storing the reduced word template in template memory (160) in an efficient manner, and comparing (326) frames of the present input speech to the representative frames of the reduced word template according to the number of frames combined in the representative frames of the reduced word template. In doing so, a measure of similarity between the present input speech and the word template is generated.
Abstract:
Method and arrangement for a speech recognition system using channel bank information to represent speech. The method considers background noise included with the speech. The method includes determining three energy levels for each channel, the first representative of background noise energy (20), the second representative of the input frame energy (16) and the third representative of the word template frame energy (18). Values representing energy level differentials are assigned at each channel. If the second energy level is less than the first energy level, then a predetermined constant value is assigned at that particular channel. These values are combined to generate a distance measure depicting the similarity between the two frames.