Hypothesis stitcher for speech recognition of long-form audio

Invention Grant

US11574639B2 Hypothesis stitcher for speech recognition of long-form audio 有权

Please log in to see more content

Patent Title: Hypothesis stitcher for speech recognition of long-form audio
Application No.: US17127938

Application Date: 2020-12-18
Publication No.: US11574639B2

Publication Date: 2023-02-07
Inventor: Naoyuki Kanda , Xuankai Chang , Yashesh Gaur , Xiaofei Wang , Zhong Meng , Takuya Yoshioka
Applicant: Microsoft Technology Licensing, LLC
Applicant Address: US WA Redmond
Assignee: Microsoft Technology Licensing, LLC
Current Assignee: Microsoft Technology Licensing, LLC
Current Assignee Address: US WA Redmond
Main IPC: G10L15/00
IPC: G10L15/00 ; G10L17/02 ; G10L15/22 ; G10L15/26 ; G10L19/022 ; G10L21/0272

Abstract:

A hypothesis stitcher for speech recognition of long-form audio provides superior performance, such as higher accuracy and reduced computational cost. An example disclosed operation includes: segmenting the audio stream into a plurality of audio segments; identifying a plurality of speakers within each of the plurality of audio segments; performing automatic speech recognition (ASR) on each of the plurality of audio segments to generate a plurality of short-segment hypotheses; merging at least a portion of the short-segment hypotheses into a first merged hypothesis set; inserting stitching symbols into the first merged hypothesis set, the stitching symbols including a window change (WC) symbol; and consolidating, with a network-based hypothesis stitcher, the first merged hypothesis set into a first consolidated hypothesis. Multiple variations are disclosed, including alignment-based stitchers and serialized stitchers, which may operate as speaker-specific stitchers or multi-speaker stitchers, and may further support multiple options for differing hypothesis configurations.

Public/Granted literature

US20220199091A1 HYPOTHESIS STITCHER FOR SPEECH RECOGNITION OF LONG-FORM AUDIO Public/Granted day:2022-06-23

Information query

Espacenet

IPC分类:

G	物理
G10	乐器；声学
G10L	语音分析或合成；语音识别；语音或声音处理；语音或音频编码或解码
G10L15/00	语音识别（G10L17/00优先）