End-to-end multi-speaker audio-visual automatic speech recognition

Invention Grant

US11615781B2 End-to-end multi-speaker audio-visual automatic speech recognition 有权

Please log in to see more content

Patent Title: End-to-end multi-speaker audio-visual automatic speech recognition
Application No.: US17062538

Application Date: 2020-10-02
Publication No.: US11615781B2

Publication Date: 2023-03-28
Inventor: Otavio Braga
Applicant: Google LLC
Applicant Address: US CA Mountain View
Assignee: Google LLC
Current Assignee: Google LLC
Current Assignee Address: US CA Mountain View
Agency: Honigman LLP
Agent Brett A. Krueger; Grant Griffith
Main IPC: G10L15/06
IPC: G10L15/06 ; G06N3/08 ; G10L15/22 ; G10L19/008

End-to-end multi-speaker audio-visual automatic speech recognition

Abstract:

A singe audio-visual automated speech recognition model for transcribing speech from audio-visual data includes an encoder frontend and a decoder. The encoder includes an attention mechanism configured to receive an audio track of the audio-visual data and a video portion of the audio-visual data. The video portion of the audio-visual data includes a plurality of video face tracks each associated with a face of a respective person. For each video face track of the plurality of video face tracks, the attention mechanism is configured to determine a confidence score indicating a likelihood that the face of the respective person associated with the video face tack includes a speaking face of the audio track. The decoder is configured to process the audio track and the video face track of the plurality of video face tracks associated with the highest confidence score to determine a speech recognition result of the audio track.

Public/Granted literature

US20210118427A1 End-To-End Multi-Speaker Audio-Visual Automatic Speech Recognition Public/Granted day:2021-04-22

Information query

Espacenet

IPC分类:

G	物理
G10	乐器；声学
G10L	语音分析或合成；语音识别；语音或声音处理；语音或音频编码或解码
G10L15/00	语音识别（G10L17/00优先）
G10L15/06	.创建基准模板；训练语音识别系统，例如对说话者声音特征的适应（G10L15/14优先）