Speech recognition using connectionist temporal classification

    公开(公告)号:US10580432B2

    公开(公告)日:2020-03-03

    申请号:US15908115

    申请日:2018-02-28

    Abstract: Generally discussed herein are devices, systems, and methods for speech recognition. Processing circuitry can implement a connectionist temporal classification (CTC) neural network (NN) including an encode NN to receive an audio frame and generate a current encoded hidden feature vector, an attend NN to generate, based on a current encoded hidden feature vector and a first context vector from a previous time slice, a weight vector indicating an amount the current encoded hidden feature vector, a previous encoded hidden feature vector, and a future encoded hidden feature vector from a future time slice contribute to a current, second context vector, an annotate NN to generate the current, second context vector based on the weight vector, the current encoded hidden feature vector, the previous encoded hidden feature vector, and the future encoded hidden feature vector, and a normal NN to generate a normalized output vector based on the context vector.

    SPEECH RECOGNITION USING CONNECTIONIST TEMPORAL CLASSIFICATION

    公开(公告)号:US20190267023A1

    公开(公告)日:2019-08-29

    申请号:US15908115

    申请日:2018-02-28

    Abstract: Generally discussed herein are devices, systems, and methods for speech recognition. Processing circuitry can implement a connectionist temporal classification (CTC) neural network (NN) including an encode NN to receive an audio frame and generate a current encoded hidden feature vector, an attend NN to generate, based on a current encoded hidden feature vector and a first context vector from a previous time slice, a weight vector indicating an amount the current encoded hidden feature vector, a previous encoded hidden feature vector, and a future encoded hidden feature vector from a future time slice contribute to a current, second context vector, an annotate NN to generate the current, second context vector based on the weight vector, the current encoded hidden feature vector, the previous encoded hidden feature vector, and the future encoded hidden feature vector, and a normal NN to generate a normalized output vector based on the context vector.

    DOMAIN ADAPTATION IN SPEECH RECOGNITION VIA TEACHER-STUDENT LEARNING

    公开(公告)号:US20190051290A1

    公开(公告)日:2019-02-14

    申请号:US15675249

    申请日:2017-08-11

    Abstract: Improvements in speech recognition in a new domain are provided via the student/teacher training of models for different speech domains. A student model for a new domain is created based on the teacher model trained in an existing domain. The student model is trained in parallel to the operation of the teacher model, with inputs in the new and existing domains respectfully, to develop a neural network that is adapted to recognize speech in the new domain. The data in the new domain may exclude transcription labels but rather are parallelized with the data analyzed in the existing domain analyzed by the teacher model. The outputs from the teacher model are compared with the outputs of the student model and the differences are used to adjust the parameters of the student model to better recognize speech in the second domain.

    Variable-component deep neural network for robust speech recognition

    公开(公告)号:US10019990B2

    公开(公告)日:2018-07-10

    申请号:US14414621

    申请日:2014-09-09

    CPC classification number: G10L15/20 G10L15/16 G10L19/24 G10L25/84

    Abstract: Systems and methods for speech recognition incorporating environmental variables are provided. The systems and methods capture speech to be recognized. The speech is then recognized utilizing a variable component deep neural network (DNN). The variable component DNN processes the captured speech by incorporating an environment variable. The environment variable may be any variable that is dependent on environmental conditions or the relation of the user, the client device, and the environment. For example, the environment variable may be based on noise of the environment and represented as a signal-to-noise ratio. The variable component DNN may incorporate the environment variable in different ways. For instance, the environment variable may be incorporated into weighting matrices and biases of the DNN, the outputs of the hidden layers of the DNN, or the activation functions of the nodes of the DNN.

    Pre-training with alignments for recurrent neural network transducer based end-to-end speech recognition

    公开(公告)号:US11657799B2

    公开(公告)日:2023-05-23

    申请号:US16840311

    申请日:2020-04-03

    CPC classification number: G10L15/063 G06N3/0445 G06N3/08

    Abstract: Techniques performed by a data processing system for training a Recurrent Neural Network Transducer (RNN-T) herein include encoder pretraining by training a neural network-based token classification model using first token-aligned training data representing a plurality of utterances, where each utterance is associated with a plurality of frames of audio data and tokens representing each utterance are aligned with frame boundaries of the plurality of audio frames; obtaining first cross-entropy (CE) criterion from the token classification model, wherein the CE criterion represent a divergence between expected outputs and reference outputs of the model; pretraining an encoder of an RNN-T based on the first CE criterion; and training the RNN-T with second training data after pretraining the encoder of the RNN-T. These techniques also include whole-network pre-training of the RNN-T. A RNN-T pretrained using these techniques may be used to process audio data that includes spoken content to obtain a textual representation.

    Internal language model for E2E models

    公开(公告)号:US11527238B2

    公开(公告)日:2022-12-13

    申请号:US17154956

    申请日:2021-01-21

    Abstract: A computer device is provided that includes one or more processors configured to receive an end-to-end (E2E) model that has been trained for automatic speech recognition with training data from a source-domain, and receive an external language model that has been trained with training data from a target-domain. The one or more processors are configured to perform an inference of the probability of an output token sequence given a sequence of input speech features. Performing the inference includes computing an E2E model score, computing an external language model score, and computing an estimated internal language model score for the E2E model. The estimated internal language model score is computed by removing a contribution of an intrinsic acoustic model. The processor is further configured to compute an integrated score based at least on E2E model score, the external language model score, and the estimated internal language model score.

Patent Agency Ranking