Abstract:
PROBLEM TO BE SOLVED: To provide a system for supporting text-to-speech synthesis, capable of efficiently generating high-quality synthetic speech. SOLUTION: The system supports text-to-speech synthesis, and comprises a learning data generating part which recognizes input speech and generates a 1st learning data, to make correspondence between notation and reading of a phrase; a frequency data generating part which creates a frequency data showing the appearance frequency of the notation and reading of the phrase, based on the 1st learning data; and a setting part which sets the frequency data to a language processing part which generates reading method, corresponding to the notation from the text notation, based on the appearance frequency of the reading so as to bring output speech of the text-to-speech synthesis that should be made close to the input speech. COPYRIGHT: (C)2008,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To provide a method in which background noise other than the sound source located along an objective direction is efficiently eliminated to realize highly precise voice recognition and to provide a system using the method. SOLUTION: An angle distinctive power distribution, that is observed by orienting the directivity of a microphone array toward various sound source directions being considered, is approximated by the sum of coefficient multiples of a reference angle distinctive power distribution that is beforehand measured using reference sound along the objective sound source directions and a reference angle distinctive power distribution of non-directive background sound. Using the above fact in a noise suppressing process section, only the components along the objective sound source direction are extracted. Moreover, when the objective sound source direction is unknown, the objective sound source direction is estimated by selecting the one which minimizes an approximation residue in a sound source location searching section among the reference angle distinctive power distributions along various sound source directions. Furthermore, a maximum liklihood operation is conducted using the voice data of the components along the sound source direction being processed and the voice model which is obtained by making a prescribed model for the voice data and voice recognition is conducted based on the obtained estimated value. COPYRIGHT: (C)2004,JPO
Abstract:
PURPOSE: To facilitate adaptation to environment different from that at learning time by converting a speech for adaptation into a label series for adaptation, making it correspond to each state or each state transition of a corresponding Markov model, and finding the values of respective parameters relating to a label group for adaptation. CONSTITUTION: The speech for adaptation is labeled and the correspondence relation between the label series of the speech for previously adaptation and the respective states of the Markov model, estimated by using a large amount of speech data, on a time series is found. On the basis of the correspondence relation, the frequency of correspondence between labels and state transitions is newly counted for all speeches for adaptation and the conditioned probability between the labels and state transitions is estimated from the count. Then this conditioned probability is used to convert parameters of the Markov model which are found previously, thereby estimating new parameters. Consequently, a speech recognition system can be adapted in a short time by using a small amount of data.
Abstract:
The present invention relates to the provision of natural-soundingphonemes and accents for text. There is provided a system that outputs phonemes and accents of texts.The system has a storage section storing a first corpus in which spellings, phonemes, and accents of a text input beforehand are recorded separately for individual segmentations of the words that are contained in the text. A text for which phonemes and accents are to be output is acquired and the first corpus is searched to retrieve at least one set of spellings that match the spellings in the text from among sets of contiguous spellings. Then, the combination of a phoneme and an accent that has a higher probability of occurrence in the first corpus than a predetermined reference probability is selected as the phonemes and accent of the text.
Abstract:
The present invention relates to a speech recognition system comprising means (4) for performing a frequency analysis of an input speech in a succession of time periods to obtain feature vectors, means (8) for producing a corresponding label train using a vector quantization code book (9), means (11) for matching a plurality of word baseforms, expressed by a train of Markov models each corresponding to labels, with said label train, means (14) for recognizing the input speech on the basis of the matching result, and means (5, 6, 7, 9) for performing an adaptation operation on the said system to improve its ability to recognise speech. According to the invention, the speech recognition system is characterised in that means for performing an adaptation operation comprises means (4) for dividing each of a plurality of input speech words into N segments (N is an integer number more than 1) and producing a representative value of the feature vector of each segment of each input speech word a means for dividing into segments word baseforms each corresponding to one of said input speech words and for producing a representative value of each segment feature vector of each word baseform on the basis of a prototype vector of the vector quantization code book, means for producing a distance vector indicating the distance between a representative value of each segment of each input speech word and a representative value of the corresponding segment of the corresponding word baseform, means for storing the degree of relation between each segment of each input speech word and each label in a label group of the vector quantization code book; and prototype adaptation means for correcting a prototype vector of each label in the label group of the vector quantization code book by each displacement vector in accordance with the degree of relation between the label and the displacement vector.
Abstract:
Eine Technik zum Extrahieren von Merkmalen, die in Bezug auf Störsignale, Mehrfachreflexion und dergleichen robuster sind, wird bereitgestellt. Eine Vorrichtung zur Extraktion von Sprachmerkmalen enthält Differenzberechnungsmittel zum Empfangen eines Spektrums eines Sprachsignals, das in Rahmen segmentiert ist, als eine Eingabe, und zum Berechnen einer Differenz des Spektrums zwischen fortlaufenden Rahmen (eine Differenz in dem linearen Bereich) für jeden Rahmen als ein Delta-Spektrum und Normierungsmittel zum Ausführen einer Normierung des Delta-Spektrums für den Rahmen durch Dividieren des Delta-Spektrums durch eine Funktion eines mittleren Spektrums. Eine Ausgabe der Normierungsmittel ist als ein Delta-Merkmal definiert.
Abstract:
This invention provides a technique for extracting, from audio signals, features that are stronger due to noises and/or reverberations. An audio feature extracting apparatus comprises: difference calculating means operative to receive the spectra of framed audio signals to calculate, as a delta spectrum, the difference in spectrum between each frame and each of the respective preceding and following frames (the difference in linear region); and normalizing means operative to divide the delta spectrum by an average-spectrum function, thereby normalizing the delta spectrum for each frame. The outputs of the normalizing means are used as delta features.