-
公开(公告)号:US20250111671A1
公开(公告)日:2025-04-03
申请号:US18900457
申请日:2024-09-27
Applicant: Google LLC
Inventor: Tao Zhu , Jiahui Yu , Jingchen Feng , Kai Chen , Pooya Abolghasemi , Gagan Bansal , Jieren Xu , Hui Miao , Yaping Zhang , Shuchao Bi , Yonghui Wu , Claire Cui , Rohan Anil
IPC: G06V20/40 , G06F40/284 , G10L25/57
Abstract: Methods and systems for media item characterization based on multimodal embeddings are provided herein. A media item including a sequence of video frames is identified. A set of video embeddings representing visual features of the sequence of video frames is obtained. A set of audio embeddings representing audio features of the sequence of video frames is obtained. A set of audiovisual embeddings is generated based on the set of video embeddings and the set of audio embeddings. Each of the set of audiovisual embeddings represents a visual feature and an audio feature of a respective video frame of the sequence of video frames. One or more media characteristics associated with the media item are determined based on the set of audiovisual embeddings.
-
公开(公告)号:US20250111235A1
公开(公告)日:2025-04-03
申请号:US18900506
申请日:2024-09-27
Applicant: Google LLC
Inventor: Siyuan Qiao , Chenxi Liu , Jiahui Yu , Yonghui Wu
IPC: G06N3/088
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for training neural networks through contrastive learning. In particular, the contrastive learning is modified to use a relative margin to adjust a training pair's contribution to optimization.
-
公开(公告)号:US12254865B2
公开(公告)日:2025-03-18
申请号:US18418246
申请日:2024-01-20
Applicant: Google LLC
Inventor: Zhifeng Chen , Bo Li , Eugene Weinstein , Yonghui Wu , Pedro J. Moreno Mengibar , Ron J. Weiss , Khe Chai Sim , Tara N. Sainath , Patrick An Phu Nguyen
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.
-
公开(公告)号:US20240362453A1
公开(公告)日:2024-10-31
申请号:US18766038
申请日:2024-07-08
Applicant: Google LLC
Inventor: Anmol Gulati , Weikeng Qin , Zhengdong Zhang , Ruoming Pang , Niki Parmar , Jiahui Yu , Wei Han , Chung-Cheng Chiu , Yu Zhang , Yonghui Wu , Shibo Wang
Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.
-
公开(公告)号:US12079703B2
公开(公告)日:2024-09-03
申请号:US17139525
申请日:2020-12-31
Applicant: Google LLC
Inventor: Anmol Gulati , Ruoming Pang , Niki Parmar , Jiahui Yu , Wei Han , Chung-Cheng Chiu , Yu Zhang , Yonghui Wu , Shibo Wang , Weikeng Qin , Zhengdong Zhang
Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.
-
公开(公告)号:US12032920B2
公开(公告)日:2024-07-09
申请号:US17056554
申请日:2020-03-07
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Melvin Johnson , Fadi Biadsy , Ron Weiss , Wolfgang Macherey
Abstract: The present disclosure provides systems and methods that train and use machine-learned models such as, for example, sequence-to-sequence models, to perform direct and text-free speech-to-speech translation. In particular, aspects of the present disclosure provide an attention-based sequence-to-sequence neural network which can directly translate speech from one language into speech in another language, without relying on an intermediate text representation.
-
公开(公告)号:US20240161732A1
公开(公告)日:2024-05-16
申请号:US18418246
申请日:2024-01-20
Applicant: Google LLC
Inventor: Zhifeng Chen , Bo Li , Eugene Weinstein , Yonghui Wu , Pedro J. Moreno Mengibar , Ron J. Weiss , Khe Chai Sim , Tara N. Sainath , Patrick An Phu Nguyen
CPC classification number: G10L15/005 , G10L15/07 , G10L15/16 , G10L2015/0631
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer-readable media, for speech recognition using multi-dialect and multilingual models. In some implementations, audio data indicating audio characteristics of an utterance is received. Input features determined based on the audio data are provided to a speech recognition model that has been trained to output score indicating the likelihood of linguistic units for each of multiple different language or dialects. The speech recognition model can be one that has been trained using cluster adaptive training. Output that the speech recognition model generated in response to receiving the input features determined based on the audio data is received. A transcription of the utterance generated based on the output of the speech recognition model is provided.
-
公开(公告)号:US20240112667A1
公开(公告)日:2024-04-04
申请号:US18525475
申请日:2023-11-30
Applicant: Google LLC
Inventor: Ye Jia , Zhifeng Chen , Yonghui Wu , Jonathan Shen , Ruoming Pang , Ron J. Weiss , Ignacio Lopez Moreno , Fei Ren , Yu Zhang , Quan Wang , Patrick An Phu Nguyen
Abstract: Methods, systems, and apparatus, including computer programs encoded on a computer storage medium, for speech synthesis. The methods, systems, and apparatus include actions of obtaining an audio representation of speech of a target speaker, obtaining input text for which speech is to be synthesized in a voice of the target speaker, generating a speaker vector by providing the audio representation to a speaker encoder engine that is trained to distinguish speakers from one another, generating an audio representation of the input text spoken in the voice of the target speaker by providing the input text and the speaker vector to a spectrogram generation engine that is trained using voices of reference speakers to generate audio representations, and providing the audio representation of the input text spoken in the voice of the target speaker for output.
-
公开(公告)号:US11922932B2
公开(公告)日:2024-03-05
申请号:US18194586
申请日:2023-03-31
Applicant: Google LLC
Inventor: Rohit Prakash Prabhavalkar , Tara N. Sainath , Yonghui Wu , Patrick An Phu Nguyen , Zhifeng Chen , Chung-Cheng Chiu , Anjuli Patricia Kannan
IPC: G10L15/197 , G10L15/02 , G10L15/06 , G10L15/16 , G10L15/22
CPC classification number: G10L15/197 , G10L15/02 , G10L15/063 , G10L15/16 , G10L15/22 , G10L2015/025
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer-readable storage media, for speech recognition using attention-based sequence-to-sequence models. In some implementations, audio data indicating acoustic characteristics of an utterance is received. A sequence of feature vectors indicative of the acoustic characteristics of the utterance is generated. The sequence of feature vectors is processed using a speech recognition model that has been trained using a loss function that uses a set of speech recognition hypothesis samples, the speech recognition model including an encoder, an attention module, and a decoder. The encoder and decoder each include one or more recurrent neural network layers. A sequence of output vectors representing distributions over a predetermined set of linguistic units is obtained. A transcription for the utterance is obtained based on the sequence of output vectors. Data indicating the transcription of the utterance is provided.
-
公开(公告)号:US11908448B2
公开(公告)日:2024-02-20
申请号:US17327076
申请日:2021-05-21
Applicant: Google LLC
Inventor: Isaac Elias , Jonathan Shen , Yu Zhang , Ye Jia , Ron J. Weiss , Yonghui Wu , Byungha Chun
IPC: G10L13/08 , G10L13/047 , G06F40/126 , G10L21/10 , G06N3/08 , G06N3/088 , G06N3/044 , G06N3/045 , G06N3/048
CPC classification number: G10L13/08 , G06F40/126 , G06N3/044 , G06N3/045 , G06N3/08 , G06N3/088 , G10L13/047 , G10L21/10 , G06N3/048
Abstract: A method for training a non-autoregressive TTS model includes receiving training data that includes a reference audio signal and a corresponding input text sequence. The method also includes encoding the reference audio signal into a variational embedding that disentangles the style/prosody information from the reference audio signal and encoding the input text sequence into an encoded text sequence. The method also includes predicting a phoneme duration for each phoneme in the input text sequence and determining a phoneme duration loss based on the predicted phoneme durations and a reference phoneme duration. The method also includes generating one or more predicted mel-frequency spectrogram sequences for the input text sequence and determining a final spectrogram loss based on the predicted mel-frequency spectrogram sequences and a reference mel-frequency spectrogram sequence. The method also includes training the TTS model based on the final spectrogram loss and the corresponding phoneme duration loss.
-
-
-
-
-
-
-
-
-