Acoustic trigger detection
    1.
    发明授权

    公开(公告)号:US10460722B1

    公开(公告)日:2019-10-29

    申请号:US15639175

    申请日:2017-06-30

    Abstract: A method for selective transmission of audio data to a speech processing server uses detection of an acoustic trigger in the audio data in determining the data to transmit. Detection of the acoustic trigger makes use of an efficient computation approach that reduces the amount of run-time computation required, or equivalently improves accuracy for a given amount of computation, by combining a “time delay” structure in which intermediate results of computations are reused at various time delays, thereby avoiding computation of computing new results, and decomposition of certain transformations to require fewer arithmetic operations without sacrificing significant performance. For a given amount of computation capacity the combination of these two techniques provides improved accuracy as compared to current approaches.

    Estimating speaker-specific affine transforms for neural network based speech recognition systems
    3.
    发明授权
    Estimating speaker-specific affine transforms for neural network based speech recognition systems 有权
    基于神经网络的语音识别系统估计说话人特定的仿射变换

    公开(公告)号:US09378735B1

    公开(公告)日:2016-06-28

    申请号:US14135474

    申请日:2013-12-19

    CPC classification number: G10L15/16

    Abstract: Features are disclosed for estimating affine transforms in Log Filter-Bank Energy Space (“LFBE” space) in order to adapt artificial neural network-based acoustic models to a new speaker or environment. Neural network-based acoustic models may be trained using concatenated LFBEs as input features. The affine transform may be estimated by minimizing the least squares error between corresponding linear and bias transform parts for the resultant neural network feature vector and some standard speaker-specific feature vector obtained for a GMM-based acoustic model using constrained Maximum Likelihood Linear Regression (“cMLLR”) techniques. Alternatively, the affine transform may be estimated by minimizing the least squares error between the resultant transformed neural network feature and some standard speaker-specific feature obtained for a GMM-based acoustic model.

    Abstract translation: 公开了用于估计Log Filter-Bank Energy Space(“LFBE”空间)中的仿射变换的特征,以便将基于人造神经网络的声学模型适应于新的扬声器或环境。 可以使用连接的LFBE作为输入特征来训练基于神经网络的声学模型。 仿射变换可以通过最小化用于所得到的神经网络特征向量的相应线性偏置变换部分和偏置变换部分之间的最小二乘误差来估计,以及使用约束最大似然线性回归(“ cMLLR“)技术。 或者,可以通过最小化所得到的经变换的神经网络特征与为基于GMM的声学模型获得的某些标准的说话者特有特征之间的最小二乘误差来估计仿射变换。

    SPEECH RECOGNIZER WITH MULTI-DIRECTIONAL DECODING
    4.
    发明申请
    SPEECH RECOGNIZER WITH MULTI-DIRECTIONAL DECODING 有权
    具有多方向解码的语音识别器

    公开(公告)号:US20150095026A1

    公开(公告)日:2015-04-02

    申请号:US14039383

    申请日:2013-09-27

    Abstract: In an automatic speech recognition (ASR) processing system, ASR processing may be configured to process speech based on multiple channels of audio received from a beamformer. The ASR processing system may include a microphone array and the beamformer to output multiple channels of audio such that each channel isolates audio in a particular direction. The multichannel audio signals may include spoken utterances/speech from one or more speakers as well as undesired audio, such as noise from a household appliance. The ASR device may simultaneously perform speech recognition on the multi-channel audio to provide more accurate speech recognition results.

    Abstract translation: 在自动语音识别(ASR)处理系统中,ASR处理可以被配置为基于从波束形成器接收的多个音频信道来处理语音。 ASR处理系统可以包括麦克风阵列和波束形成器以输出多个音频通道,使得每个通道在特定方向上隔离音频。 多声道音频信号可以包括来自一个或多个扬声器的说话话音/语音以及不期望的音频,例如来自家用电器的噪声。 ASR设备可以同时对多声道音频执行语音识别,以提供更准确的语音识别结果。

    Deep multi-channel acoustic modeling using multiple microphone array geometries

    公开(公告)号:US11574628B1

    公开(公告)日:2023-02-07

    申请号:US16368331

    申请日:2019-03-28

    Abstract: Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-geometry/multi-channel DNN) that is trained using a plurality of microphone array geometries. Thus, the first model may receive a variable number of microphone channels, generate multiple outputs using multiple microphone array geometries, and select the best output as a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. The DNN front-end enables improved performance despite a reduction in microphones.

    Device-directed utterance detection

    公开(公告)号:US11551685B2

    公开(公告)日:2023-01-10

    申请号:US16822744

    申请日:2020-03-18

    Abstract: A speech interface device is configured to detect an interrupt event and process a voice command without detecting a wakeword. The device includes on-device interrupt architecture configured to detect when device-directed speech is present and send audio data to a remote system for speech processing. This architecture includes an interrupt detector that detects an interrupt event (e.g., device-directed speech) with low latency, enabling the device to quickly lower a volume of output audio and/or perform other actions in response to a potential voice command. In addition, the architecture includes a device directed classifier that processes an entire utterance and corresponding semantic information and detects device-directed speech with high accuracy. Using the device directed classifier, the device may reject the interrupt event and increase a volume of the output audio or may accept the interrupt event, causing the output audio to end and performing speech processing on the audio data.

    Deep multi-channel acoustic modeling

    公开(公告)号:US11475881B2

    公开(公告)日:2022-10-18

    申请号:US16932049

    申请日:2020-07-17

    Abstract: Techniques for speech processing using a deep neural network (DNN) based acoustic model front-end are described. A new modeling approach directly models multi-channel audio data received from a microphone array using a first model (e.g., multi-channel DNN) that takes in raw signals and produces a first feature vector that may be used similarly to beamformed features generated by an acoustic beamformer. A second model (e.g., feature extraction DNN) processes the first feature vector and transforms it to a second feature vector having a lower dimensional representation. A third model (e.g., classification DNN) processes the second feature vector to perform acoustic unit classification and generate text data. These three models may be jointly optimized for speech processing (as opposed to individually optimized for signal enhancement), enabling improved performance despite a reduction in microphones and a reduction in bandwidth consumption during real-time processing.

Patent Agency Ranking