Abstract:
An objective is to provide a technique for accurately reproducing features of a fundamental frequency of a target-speaker's voice on the basis of only a small amount of learning data. A learning apparatus learns shift amounts from a reference source F0 pattern to a target F0 pattern of a target-speaker's voice. The learning apparatus associates a source F0 pattern of a learning text to a target F0 pattern of the same learning text by associating their peaks and troughs. For each of points on the target F0 pattern, the learning apparatus obtains shift amounts in a time-axis direction and in a frequency-axis direction from a corresponding point on the source F0 pattern in reference to a result of the association, and learns a decision tree using, as an input feature vector, linguistic information obtained by parsing the learning text, and using, as an output feature vector, the calculated shift amounts.
Abstract:
A speech feature extraction apparatus, speech feature extraction method, and speech feature extraction program. A speech feature extraction apparatus includes: first difference calculation module to: (i) receive, as an input, a spectrum of a speech signal segmented into frames for each frequency bin; and (ii) calculate a delta spectrum for each of the frame, where the delta spectrum is a difference of the spectrum within continuous frames for the frequency bin; and first normalization module to normalize the delta spectrum of the frame for the frequency bin by dividing the delta spectrum by a function of an average spectrum; where the average spectrum is an average of spectra through all frames that are overall speech for the frequency bin; and where an output of the first normalization module is defined as a first delta feature.
Abstract:
Vorrichtung zur Extraktion von Sprachmerkmalen, wobei die Vorrichtung Folgendes umfasst:eine erste Differenzberechnungseinheit (600, 700, 800) zum Empfangen eines Spektrums für jede einer Mehrzahl von Frequenzgruppen eines Sprachsignals, wobei das Sprachsignal für jede Frequenzgruppe in Rahmen segmentiert ist, und zum Berechnen, für jeden Rahmen jeder Frequenzgruppe, einer Differenz des Spektrums zwischen fortlaufenden Rahmen für die Frequenzgruppe als ein Delta-Spektrum; undeine erste Normierungseinheit (605, 710, 810) zum Ausführen einer Normierung des Delta-Spektrums für jeden Rahmen jeder Frequenzgruppe durch Dividieren des Delta-Spektrums durch eine Funktion des mittleren Spektrums, welches durch einen Mittelwert von Spektren über alle Sprache darstellenden Rahmen gegeben ist.
Abstract:
The present invention relates to the provision of natural-soundingphonemes and accents for text. There is provided a system that outputs phonemes and accents of texts.The system has a storage section storing a first corpus in which spellings, phonemes, and accents of a text input beforehand are recorded separately for individual segmentations of the words that are contained in the text. A text for which phonemes and accents are to be output is acquired and the first corpus is searched to retrieve at least one set of spellings that match the spellings in the text from among sets of contiguous spellings. Then, the combination of a phoneme and an accent that has a higher probability of occurrence in the first corpus than a predetermined reference probability is selected as the phonemes and accent of the text.
Abstract:
The invention independently vector-quantizes the spectrum representing the static feature of speech on the frequency axis and the variation pattern of the spectrum on the time axis. The resultant pair of label trains are evaluated, based on the knowledge that there is a small correlation between them, by the equation: P(La, Lc?W) = P(La, Lc?I,W)P(I?W) I = P(La(1)?Ma(i1)P(Lc(1)?Mc(i1)) I P(Bi1,i2?Ma(i1),?Mc(i1)) P(La(2)?Ma(i2))P(Lc(2)?M(i2)) P(Bi2,i3 ?Ma(i2), Mc(i2) ...La(T)?Ma(iT))P(Lc(T)?Mc(iT)) P(BiT, iT+1?Ma(it), Mc(iT)) wherein W designated a Markov model representing a word; I = i1, I2, I3, ... iT, a state train; Ma and Mc, Markov models by label corresponding to the spectrum and the spectrum variation, respectively; and B , a transition from the state i to the scale j. P(La, Lc?W) is calculated for each Markov model W representing a word and W giving the maximum value for it is determined as the recognition result.
Abstract:
PROBLEM TO BE SOLVED: To provide a practical system etc. for voice recognition, in which recognition performance is improved by considering utterance variation. SOLUTION: The system includes a voice recognition device 200 and a pre-processor 100 for creating a recognition graph used for voice recognition processing by the voice recognition device 200. The pre-processor 100 comprises: a language model estimation section 110 for estimating a language model; a recognition word dictionary section 130 holding corresponding information to a word, a phoneme string just in the same description as in the word, and to information on the phoneme string in which utterance variation is described; and a recognition graph creating section 140 for creating a recognition graph on the basis of a language model estimated by a language model estimation section 110, and the correspondence information held by the recognition word dictionary section 130 regarding the word included in the language model. The recognition graph creating section 140 creates the recognition graph by applying the phoneme string considering utterance variation regarding the word with respect to the word included in a word string composed of more than a fixed number of words. COPYRIGHT: (C)2010,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To provide a highly accurate voice activity detection method in a low S/N environment. SOLUTION: The voice activity is performed by extracting a long-term spectrum variation component and a harmonic structure as feature vectors from a speech signal and increasing difference in feature vectors between speech and non-speech included in the speech signal by using the long-term spectrum variation component feature, or a long-term spectrum variation component extraction and a harmonic structure feature extraction. A correct rate and an accuracy rate of the voice activity detection is improved over conventional methods by using a long-term spectrum variation component having a window length over an average phoneme duration of an utterance in the speech signal. The voice activity detection system and method provides speech processing, automatic speech recognition, and speech output capable of very accurate voice activity detection. COPYRIGHT: (C)2009,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To efficiently create high quality synthesis voice by connecting a plurality of phonemes. SOLUTION: A system comprises: a phoneme storage section for storing a plurality of phoneme data; a synthesis section for creating a voice data which indicates synthesis voice of a text by reading and connecting a phoneme data corresponding to each phoneme, which indicates pronunciation of the input text, from the phoneme storage section; a calculation section for calculating an index value which indicates unnaturalness of the synthesis voice of the text, based on the voice data; a paraphrase storage section for storing a second notation which is paraphrasing of a first notation by relating it to each of the plurality of first notations; a replacing section for replacing the searched notation with the second notation corresponding to the first notation, by searching notation which corresponds to any of the first notation from the text; and a determination section in which the created voice data is output on condition that the calculated index value is smaller than a reference value, and in which the text is input to the synthesis section so that the voice data of the replaced text may be further created, on condition that the index value is the reference value or more. COPYRIGHT: (C)2008,JPO&INPIT