-
公开(公告)号:US12154581B2
公开(公告)日:2024-11-26
申请号:US17237021
申请日:2021-04-21
Applicant: Google LLC
Inventor: Arun Narayanan , Tara Sainath , Chung-Cheng Chiu , Ruoming Pang , Rohit Prabhavalkar , Jiahui Yu , Ehsan Variani , Trevor Strohman
Abstract: An automated speech recognition (ASR) model includes a first encoder, a second encoder, and a decoder. The first encoder receives, as input, a sequence of acoustic frames, and generates, at each of a plurality of output steps, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The second encoder receives, as input, the first higher order feature representation generated by the first encoder at each of the plurality of output steps, and generates, at each of the plurality of output steps, a second higher order feature representation for a corresponding first higher order feature frame. The decoder receives, as input, the second higher order feature representation generated by the second encoder at each of the plurality of output steps, and generates, at each of the plurality of time steps, a first probability distribution over possible speech recognition hypotheses.
-
公开(公告)号:US20230351149A1
公开(公告)日:2023-11-02
申请号:US18141340
申请日:2023-04-28
Applicant: Google LLC
Inventor: Jiahui Yu , Zirui Wang , Vijay Vasudevan , Ho Man Yeung , Seyed Mojtaba Seyedhosseini Tarzjani , Yonghui Wu
IPC: G06N3/04
CPC classification number: G06N3/04
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for processing multi-modal inputs using contrastive captioning neural networks.
-
公开(公告)号:US20220405579A1
公开(公告)日:2022-12-22
申请号:US17613773
申请日:2021-03-03
Applicant: Google LLC
Inventor: Jiahui Yu , Pengchong Jin , Hanxiao Liu , Gabriel Mintzer Bender , Pieter-Jan Kindermans , Mingxing Tan , Xiaodan Song , Ruoming Pang , Quoc V. Le
IPC: G06N3/08
Abstract: Methods, systems, and apparatus, including computer programs encoded on computer storage media, for selecting a neural network to perform a particular machine learning task while satisfying a set of constraints.
-
公开(公告)号:US12190869B2
公开(公告)日:2025-01-07
申请号:US17936547
申请日:2022-09-29
Applicant: Google LLC
Inventor: Tara N. Sainath , Rami Botros , Anmol Gulati , Krzysztof Choromanski , Ruoming Pang , Trevor Strohman , Weiran Wang , Jiahui Yu
Abstract: A computer-implemented method includes receiving a sequence of acoustic frames as input to an automatic speech recognition (ASR) model. Here, the ASR model includes a causal encoder and a decoder. The method also includes generating, by the causal encoder, a first higher order feature representation for a corresponding acoustic frame in the sequence of acoustic frames. The method also includes generating, by the decoder, a first probability distribution over possible speech recognition hypotheses. Here, the causal encoder includes a stack of causal encoder layers each including a Recurrent Neural Network (RNN) Attention-Performer module that applies linear attention.
-
公开(公告)号:US12094453B2
公开(公告)日:2024-09-17
申请号:US17447285
申请日:2021-09-09
Applicant: Google LLC
Inventor: Jiahui Yu , Chung-cheng Chiu , Bo Li , Shuo-yiin Chang , Tara Sainath , Wei Han , Anmol Gulati , Yanzhang He , Arun Narayanan , Yonghui Wu , Ruoming Pang
IPC: G10L15/06 , G10L15/16 , G10L15/187 , G10L15/22 , G10L15/30
CPC classification number: G10L15/063 , G10L15/16 , G10L15/22 , G10L15/30 , G10L15/187
Abstract: A computer-implemented method of training a streaming speech recognition model that includes receiving, as input to the streaming speech recognition model, a sequence of acoustic frames. The streaming speech recognition model is configured to learn an alignment probability between the sequence of acoustic frames and an output sequence of vocabulary tokens. The vocabulary tokens include a plurality of label tokens and a blank token. At each output step, the method includes determining a first probability of emitting one of the label tokens and determining a second probability of emitting the blank token. The method also includes generating the alignment probability at a sequence level based on the first probability and the second probability. The method also includes applying a tuning parameter to the alignment probability at the sequence level to maximize the first probability of emitting one of the label tokens.
-
公开(公告)号:US20230237993A1
公开(公告)日:2023-07-27
申请号:US18011571
申请日:2021-10-01
Applicant: Google LLC
Inventor: Jiahui Yu , Ruoming Pang , Wei Han , Anmol Gulati , Chung-Cheng Chiu , Bo Li , Tara N. Sainath , Yonghui Hu
Abstract: Systems and methods of the present disclosure are directed to a computing system, including one or more processors and a machine-learned multi-mode speech recognition model configured to operate in a streaming recognition mode or a contextual recognition mode. The computing system can perform operations including obtaining speech data and a ground truth label and processing the speech data using the contextual recognition mode to obtain contextual prediction data. The operations can include evaluating a difference between the contextual prediction data and the ground truth label and processing the speech data using the streaming recognition mode to obtain streaming prediction data. The operations can include evaluating a difference between the streaming prediction data and the ground truth label and the contextual and streaming prediction data. The operations can include adjusting parameters of the speech recognition model.
-
公开(公告)号:US20220207321A1
公开(公告)日:2022-06-30
申请号:US17139525
申请日:2020-12-31
Applicant: Google LLC
Inventor: Anmol Gulati , Ruoming Pang , Niki Parmar , Jiahui Yu , Wei Han , Chung-Cheng Chiu , Yu Zhang , Yonghui Wu , Shibo Wang , Weikeng Qin , Zhengdong Zhang
Abstract: Systems and methods can utilize a conformer model to process a data set for various data processing tasks, including, but not limited to, speech recognition, sound separation, protein synthesis determination, video or other image set analysis, and natural language processing. The conformer model can use feed-forward blocks, a self-attention block, and a convolution block to process data to learn global interactions and relative-offset-based local correlations of the input data.
-
公开(公告)号:US20250037426A1
公开(公告)日:2025-01-30
申请号:US18716912
申请日:2022-12-09
Applicant: Google LLC
Inventor: Bowen Zhang , Jiahui Yu , Christopher Fifty , Wei Han , Andrew M. Dai , Ruoming Pang , Fei Sha
IPC: G06V10/764 , G06V10/774
Abstract: A method includes obtaining video datasets each including pairs of a training video and a ground-truth action classification of the training video. The method also includes generating an action recognition model that includes a shared encoder model and action classification heads. A number of the action classifications heads may be equal to a number of the video datasets, and each action classification head may be configured to, based on an output of the shared encoder model, classify training videos sampled from a corresponding video dataset. The method also includes determining, by the action recognition model and for each training video sampled from the video datasets, an inferred action classification. The method further includes determining a loss value based on the inferred action classifications and the ground-truth action classifications, and adjusting parameters of the action recognition model based on the loss value.
-
公开(公告)号:US20240404238A1
公开(公告)日:2024-12-05
申请号:US18698997
申请日:2022-10-05
Applicant: Google LLC
Inventor: Jiahui Yu , Vijay Vasudevan , Alexander Yeong-Shiuh Ku , Yonghui Wu , Jason Michael Baldridge , Yuanzhong Xu , Jing Yu Koh , Thang Minh Luong , Gunjan Baid , Zirui Wang , Han Zhang , Xin Li
IPC: G06V10/28 , G06F40/284 , G06V10/764 , G06V10/766 , G06V10/82
Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pre-training a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.
-
公开(公告)号:US20240112088A1
公开(公告)日:2024-04-04
申请号:US18520083
申请日:2023-11-27
Applicant: Google LLC
Inventor: Jiahui Yu , Xin Li , Han Zhang , Vijay Vasudevan , Alexander Yeong-Shiuh Ku , Jason Michael Baldridge , Yuanzhong Xu , Jing Yu Koh , Thang Minh Luong , Gunjan Baid , Zirui Wang , Yonghui Wu
IPC: G06N20/00
CPC classification number: G06N20/00
Abstract: Systems and methods are provided for vector-quantized image modeling using vision transformers and improved codebook handling. In particular, the present disclosure provides a Vector-quantized Image Modeling (VIM) approach that involves pretraining a machine learning model (e.g., Transformer model) to predict rasterized image tokens autoregressively. The discrete image tokens can be encoded from a learned Vision-Transformer-based VQGAN (example implementations of which can be referred to as ViT-VQGAN). The present disclosure proposes multiple improvements over vanilla VQGAN from architecture to codebook learning, yielding better efficiency and reconstruction fidelity. The improved ViT-VQGAN further improves vector-quantized image modeling tasks, including unconditional image generation, conditioned image generation (e.g., class-conditioned image generation), and unsupervised representation learning.
-
-
-
-
-
-
-
-
-