Abstract:
A system and method are presented for optimization of audio fingerprint search. In an embodiment, the audio fingerprints are organized into a recursive tree with different branches containing fingerprint sets that are dissimilar to each other. The tree is constructed using a clustering algorithm based on a similarity measure. The similarity measure may comprise a Hamming distance for a binary fingerprint or a Euclidean distance for continuous valued fingerprints. In another embodiment, each fingerprint is stored at a plurality of resolutions and clustering is performed hierarchically. The recognition of an incoming fingerprint begins from the root of the tree and proceeds down its branches until a match or mismatch is declared. In yet another embodiment, a fingerprint definition is generalized to include more detailed audio information than in the previous definition.
Abstract:
A method for training a neural network of a neural network based speaker classifier for use in speaker change detection. The method comprises: a) preprocessing input speech data; b) extracting a plurality of feature frames from the preprocessed input speech data; c) normalizing the extracted feature frames of each speaker within the preprocessed input speech data with each speaker's mean and variance; d) concatenating the normalized feature frames to form overlapped longer frames having a frame length and a hop size; e) inputting the overlapped longer frames to the neural network based speaker classifier; and f) training the neural network through forward-backward propagation.
Abstract:
A system and method for learning alternate pronunciations for speech recognition is disclosed. Alternative name pronunciations may be covered, through pronunciation learning, that have not been previously covered in a general pronunciation dictionary. In an embodiment, the detection of phone-level and syllable-level mispronunciations in words and sentences may be based on acoustic models trained by Hidden Markov Models. Mispronunciations may be detected by comparing the likelihood of the potential state of the targeting pronunciation unit with a pre-determined threshold through a series of tests. It is also within the scope of an embodiment to detect accents.
Abstract:
A system and method are presented for the correction of packet loss in audio in automatic speech recognition (ASR) systems. Packet loss correction, as presented herein, occurs at the recognition stage without modifying any of the acoustic models generated during training. The behavior of the ASR engine in the absence of packet loss is thus not altered. To accomplish this, the actual input signal may be rectified, the recognition scores may be normalized to account for signal errors, and a best-estimate method using information from previous frames and acoustic models may be used to replace the noisy signal.
Abstract:
Technologies for authenticating a speaker in a voice authentication system using voice biometrics include a speech collection computing device and a speech authentication computing device. The speech collection computing device is configured to collect a speech signal from a speaker and transmit the speech signal to the speech authentication computing device. The speech authentication computing device is configured to compute a speech signal feature vector for the received speech signal, retrieve a speech signal classifier associated with the speaker, and feed the speech signal feature vector to the retrieved speech signal classifier. Additionally, the speech authentication computing device is configured to determine whether the speaker is an authorized speaker based on an output of the retrieved speech signal classifier. Additional embodiments are described herein.
Abstract:
A system and method for learning alternate pronunciations for speech recognition is disclosed. Alternative name pronunciations may be covered, through pronunciation learning, that have not been previously covered in a general pronunciation dictionary. In an embodiment, the detection of phone-level and syllable-level mispronunciations in words and sentences may be based on acoustic models trained by Hidden Markov Models. Mispronunciations may be detected by comparing the likelihood of the potential state of the targeting pronunciation unit with a pre-determined threshold through a series of tests. It is also within the scope of an embodiment to detect accents.
Abstract:
A system and method are presented for optimization of audio fingerprint search. In an embodiment, the audio fingerprints are organized into a recursive tree with different branches containing fingerprint sets that are dissimilar to each other. The tree is constructed using a clustering algorithm based on a similarity measure. The similarity measure may comprise a Hamming distance for a binary fingerprint or a Euclidean distance for continuous valued fingerprints. In another embodiment, each fingerprint is stored at a plurality of resolutions and clustering is performed hierarchically. The recognition of an incoming fingerprint begins from the root of the tree and proceeds down its branches until a match or mismatch is declared. In yet another embodiment, a fingerprint definition is generalized to include more detailed audio information than in the previous definition.
Abstract:
A system and method for learning alternate pronunciations for speech recognition is disclosed. Alternative name pronunciations may be covered, through pronunciation learning, that have not been previously covered in a general pronunciation dictionary. In an embodiment, the detection of phone-level and syllable-level mispronunciations in words and sentences may be based on acoustic models trained by Hidden Markov Models. Mispronunciations may be detected by comparing the likelihood of the potential state of the targeting pronunciation unit with a pre-determined threshold through a series of tests. It is also within the scope of an embodiment to detect accents.
Abstract:
A system and method are presented for neural network based feature extraction for acoustic model development. A neural network may be used to extract acoustic features from raw MFCCs or the spectrum, which are then used for training acoustic models for speech recognition systems. Feature extraction may be performed by optimizing a cost function used in linear discriminant analysis. General non-linear functions generated by the neural network are used for feature extraction. The transformation may be performed using a cost function from linear discriminant analysis methods which perform linear operations on the MFCCs and generate lower dimensional features for speech recognition. The extracted acoustic features may then be used for training acoustic models for speech recognition systems.
Abstract:
A system and method are presented for neural network based feature extraction for acoustic model development. A neural network may be used to extract acoustic features from raw MFCCs or the spectrum, which are then used for training acoustic models for speech recognition systems. Feature extraction may be performed by optimizing a cost function used in linear discriminant analysis. General non-linear functions generated by the neural network are used for feature extraction. The transformation may be performed using a cost function from linear discriminant analysis methods which perform linear operations on the MFCCs and generate lower dimensional features for speech recognition. The extracted acoustic features may then be used for training acoustic models for speech recognition systems.