Abstract:
A method of reconstructing speech input at a communication device comprises receiving, at the communication device, encoded data that includes encoded spectral data and encoded energy data of the speech input, the encoded spectral data being encoded as a series of mel-frequency cepstral coefficients. The method further comprises decoding, at the communication device, the encoded spectral data and encoded energy data to determine the spectral data and energy data, wherein decoding comprises: performing an inverse discrete cosine transform on the mel-frequency cepstral coefficients at harmonic mel-frequencies corresponding to a pitch period of the speech input to determine log-spectral magnitudes of the speech input at the harmonic mel-frequencies, and exponentiating the log-spectral magnitudes to determine the spectral magnitudes of the speech input. The method also comprises combining the spectral data and energy data to reconstruct the speech input at the communication device. A communication device for use in distributed speech recognition system is also disclosed.
Abstract:
Un método de extensión del ancho de banda que comprende: recibir una señal de audio digital de entrada que comprende una señal de banda estrecha en un primer intervalo de frecuencias; determinar un nivel de energía de banda alta estimado en un segundo intervalo de frecuencias, correspondientes a la señal de audio digital de entrada, donde el segundo intervalo de frecuencias es mayor en frecuencia que el primer intervalo de frecuencias y a la energía de banda alta estimada le falta información para ser estimada y utilizada en la extensión del ancho de banda; y modificar el nivel de energía de banda alta estimado sobre la base de las características de la señal de banda estrecha; donde la etapa de modificar el nivel de energía de banda alta estimado comprende la etapa de modificar el nivel de energía de banda alta estimado sobre la base de una ocurrencia de un ataque / sonido oclusivo; donde los niveles de energía de banda alta estimados de una secuencia de Kmax tramas que empieza en una trama en la cual se ha detectado el ataque / sonido oclusivo son modificados; donde las primeras Kmin tramas son ajustadas a un nivel de energía lo más bajo posible Emin; donde la modificación de los niveles de energía de banda alta estimados continúa hasta la trama Kmax-ésima siempre que el nivel de voz de una trama dentro de la secuencia de Kmax tramas excede un umbral; y donde la modificación del nivel de energía de banda alta estimado viene dada por la disminución del nivel de energía de banda alta en una cantidad fija hasta una trama KT en la que el nivel de voz de la trama excede un umbral y es aumentado de nuevo hacia la energía de banda alta estimada.
Abstract:
Un método (100) incluye recibir (101) una señal de audio digital de entrada que comprende una señal de banda angosta; la señal de audio digital de entrada es procesada (102) para generar una señal de audio digital procesada; se determina (103) un estimado del nivel de energía de banda alta correspondiente a una señal de audio digital de entrada de ancho de banda extendido; se realiza la modificación del nivel de energía de banda alta estimado con base en una precisión de la estimación y/o características de la señal de banda angosta (104); una señal de audio digital de banda alta es generada con base en el estimado modificado del nivel de energía de banda alta y un espectro de banda alta estimado correspondiente al estimado modificado del nivel de energía de banda alta (105).
Abstract:
A system or method for modeling a signal, such as a speech signal, in which harmonic frequencies and amplitudes are identified and the harmonic magnitudes are interpolated to obtain spectral magnitudes at a set of fixed frequencies. An inverse transform is applied to the spectral magnitudes to obtain a pseudo auto-correlation sequence, from which linear prediction coefficients are calculated. From the linear prediction coefficients, model harmonic magnitudes are generated by sampling the spectral envelope defined by the linear prediction coefficients. A set of scale factors are then calculated as the ratio of the harmonic magnitudes to the model harmonic magnitudes and interpolated to obtain a second set of scale factors at the set of fixed frequencies. The spectral envelope magnitudes at the set of fixed frequencies are multiplied by the second set of scale factors to obtain new spectral magnitudes and the process is iterated to obtain final linear prediction coefficients. The signal is modeled by the linear prediction coefficients.
Abstract:
A system, method and computer readable medium for quantizing pitch information of audio is disclosed. The method includes capturing audio representing a numbered frame of a plurality of numbered frames. The method further includes calculating a class of the frame, wherein a class is any one of a voiced or unvoiced class. If the frame is a voiced class, a pitch is calculated for the frame. If the frame is an even numbered frame and a voiced class, a codeword of a first length is calculated by absolutely quantizing the frame pitch. If the frame is an odd numbered frame and a voiced class and a reliable frame is available, a codeword of a second length is calculated by differentially quantizing the frame pitch. If there is no reliable frame available, a codeword of the second length is calculated by absolutely quantizing the frame pitch.
Abstract:
A method and apparatus for noise suppression within a distributed speech recognition system is provided herein. Mel-frequency cepstral coefficients (MFCCs) values are converted to filter bank outputs (F'0 through F'22). The filter bank outputs are then used by a noise suppressor (303) for channel energy estimation, noise energy estimation, etc. Noise-suppression takes place on F'0 through F'22 and the noise-suppressed filter bank outputs F''0 through F''22 are converted back to MFCC values.
Abstract:
A signal that includes noise (301) is sampled to provide a plurality of digital information samples (303). A predetermined number of the digital information samples are grouped as a set (305). Noise suppression is performed on the signal using the following steps. One or more digital representations of silence is attached to the set, forming an extended set (401). A Fourier transform is performed on the extended set, yielding a set of frequency domain coefficients (403), at least some of which are scaled (405). An inverse Fourier transform is performed on the set of scaled frequency domain coefficients to provide a set of time domain samples (407), which are partially overlapped in time and added with a previously formed set of time domain samples (409 and 411), which result is provided with the non-overlapping time domain samples as a noise suppressed version of the signal (413).
Abstract:
A system or method for modeling a signal, such as a speech signal, wherein harmonic frequencies and amplitudes are identified (106) and the harmonic magnitudes are interpolated (110) to obtain spectral magnitudes at a set of fixed frequencies. An inverse transform is applied (112) to the spectral magnitudes to obtain a pseudo auto-correlation sequence, from which linear prediction coefficients are calculated (114). From the linear prediction coefficients, model harmonic magnitudes are generated by sampling the spectral envelope (118) defined by the linear prediction coefficients. A set of scale factors are then calculated (120) as the ratio of the harmonic magnitudes to the model harmonic magnitudes and interpolated to obtain a second set of scale factors (122) at the set of fixed frequencies. The spectral envelope magnitudes at the set of fixed frequencies (124) are multiplied by the second set of scale factors (126) to obtain new spectral magnitudes and the process is iterated to obtain final linear prediction coefficients.
Abstract:
A system, method and computer readable medium for quantizing pitch information of audio is disclosed. The method includes capturing audio representing a numbered frame of a plurality of numbered frames. The method further includes calculating a class of the frame, wherein a class is any one of a voiced or unvoiced class. If the frame is a voiced class, a pitch is calculated for the frame (903). If the frame is an even numbered frame and a voiced class, a codeword of first length is calculated by absolutely quantizing the frame pitch (910). If the frame is an odd numbered frame and a voiced class and a reliable frame is available, a codeword of a second length is calculated by differentially quantizing the frame pitch (905). If there is no reliable frame available, a codeword of the second length is calculated by absolutely quantizing the frame pitch.