Invention Grant
- Patent Title: Real-time neural text-to-speech
-
Application No.: US17061433Application Date: 2020-10-01
-
Publication No.: US11705107B2Publication Date: 2023-07-18
- Inventor: Sercan O. Arik , Mike Chrzanowski , Adam Coates , Gregory Diamos , Andrew Gibiansky , John Miller , Andrew Ng , Jonathan Raiman , Shubhahrata Sengupta , Mohammad Shoeybi
- Applicant: Baidu USA, LLC
- Applicant Address: US CA Sunnyvale
- Assignee: Baidu USA LLC
- Current Assignee: Baidu USA LLC
- Current Assignee Address: US CA Sunnyvale
- Agency: North Weber & Baugh LLP
- Main IPC: G10L13/08
- IPC: G10L13/08 ; G10L13/027 ; G10L25/30 ; G06N3/082 ; G06N3/044 ; G06N3/045 ; G06N3/02 ; G06F40/242 ; G06N3/047

Abstract:
Embodiments of a production-quality text-to-speech (TTS) system constructed from deep neural networks are described. System embodiments comprise five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For embodiments of the segmentation model, phoneme boundary detection was performed with deep neural networks using Connectionist Temporal Classification (CTC) loss. For embodiments of the audio synthesis model, a variant of WaveNet was created that requires fewer parameters and trains faster than the original. By using a neural network for each component, system embodiments are simpler and more flexible than traditional TTS systems, where each component requires laborious feature engineering and extensive domain expertise. Inference with system embodiments may be performed faster than real time.
Public/Granted literature
- US20210027762A1 REAL-TIME NEURAL TEXT-TO-SPEECH Public/Granted day:2021-01-28
Information query