Abstract:
PROBLEM TO BE SOLVED: To synthesize with high sound quality when there are many phonemes by utilizing advantages in waveform connection type speech synthesis, and synthesize with accurate accent even with less phonemes. SOLUTION: Prosody achieving both of accuracy and high sound quality can be provided by two-pass search of phoneme search and search of a prosody correction amount. In a preferable embodiment, in regards to both of the two passes of phoneme selection and correction amount search, consistency of the prosody is evaluated by using a statistical model of a change amount of the prosody (inclination of a basic frequency) to secure the accurate accent. A prosody correction amount system, in which correction prosody cost is minimum, is searched in search of the prosody corrected amount. Thereby, a correction amount system, which can increase likelihood to the statistical model of the change amount and an absolute value of the prosody with the correction amount as small as possible, is searched. COPYRIGHT: (C)2009,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To provide a method, a means and a program for high accuracy speech recognition and naturally synthesized speech, output in a language having large variations in the speech tone. SOLUTION: A statistic model is learned, by observing F0 tilt by using a linear approximation method or a global smoothing method, of F0 of a start point and an end point of a phoneme, and the F0 tilt is evaluated in runtime, and synthesis speech in which the F0 is corrected, based on cost calculation is output. Time change of the F0 tilt in a syllable is modeled, by learning a decision tree for each region into which the syllable is suitably and equally divided. Likelihood is evaluated by estimating an error range in the observed F0 tilt. By linking these operations, high-accuracy speech recognition and natural tone synthesis speech output are obtained. COPYRIGHT: (C)2010,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To provide a method and a system for directly operating information in compressed digital audio data. SOLUTION: A system which embeds additional information in compressed audio data has (1) a means for restoring a MDCT(modified Discrete Cosine Transform) coefficient from the compressed audio data, (2) a means which finds frequency components of the audio data by using the restored MDCT coefficient, (3) a means for embedding the additional information in the found frequency components in a frequency space, and (5) a means for generating compressed audio data from the MDCT coefficient embedded in the additional information.
Abstract:
PROBLEM TO BE SOLVED: To efficiently create high quality synthesis voice by connecting a plurality of phonemes. SOLUTION: A system comprises: a phoneme storage section for storing a plurality of phoneme data; a synthesis section for creating a voice data which indicates synthesis voice of a text by reading and connecting a phoneme data corresponding to each phoneme, which indicates pronunciation of the input text, from the phoneme storage section; a calculation section for calculating an index value which indicates unnaturalness of the synthesis voice of the text, based on the voice data; a paraphrase storage section for storing a second notation which is paraphrasing of a first notation by relating it to each of the plurality of first notations; a replacing section for replacing the searched notation with the second notation corresponding to the first notation, by searching notation which corresponds to any of the first notation from the text; and a determination section in which the created voice data is output on condition that the calculated index value is smaller than a reference value, and in which the text is input to the synthesis section so that the voice data of the replaced text may be further created, on condition that the index value is the reference value or more. COPYRIGHT: (C)2008,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To provide a system for supporting text-to-speech synthesis, capable of efficiently generating high-quality synthetic speech. SOLUTION: The system supports text-to-speech synthesis, and comprises a learning data generating part which recognizes input speech and generates a 1st learning data, to make correspondence between notation and reading of a phrase; a frequency data generating part which creates a frequency data showing the appearance frequency of the notation and reading of the phrase, based on the 1st learning data; and a setting part which sets the frequency data to a language processing part which generates reading method, corresponding to the notation from the text notation, based on the appearance frequency of the reading so as to bring output speech of the text-to-speech synthesis that should be made close to the input speech. COPYRIGHT: (C)2008,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To provide a practicable and robust method and a system of digital watermark. SOLUTION: The digital watermark system consists in embedding additional information in digital data. One frame is composed of N samples fetched from the digital data and the following frame is determined, in such a way that it overlaps with the foregoing frame by M(0
Abstract:
PROBLEM TO BE SOLVED: To provide an information processor, an information processing method, an information processing system and a program for analyzing a phrase reflecting information that is not recognized explicitly with words.SOLUTION: An information processor 120 uses voice data recording dialogs to identify information that is not clearly specified with words in the voice data, and comprises: an acoustic analysis unit 208 for execute acoustic analysis of the voice data by using acoustic data; a prosodic information acquisition unit 212 for identifying a region isolated before and after the voice data by a pause, identifying a phrase in the identified region by using the acoustic analysis of the identified region, and generating one or more prosodic feature values with respect to the phrase with setting a prosodic feature value of the phrase as an element; an appearance frequency acquisition unit 210 for acquiring an appearance frequency of the phrase, which is acquired by the acoustic analysis unit 208, in the voice data; and a prosodic variation analysis unit 214 for calculating a variation degree of the prosodic feature value of the phrase with high appearance frequency in the voice data, and determining a feature phrase.
Abstract:
PROBLEM TO BE SOLVED: To efficiently and accurately recognize accent of input voice. SOLUTION: Notation data for learning showing notation of each phrase of a text for learning, utterance data for learning showing characteristics of utterance of each phrase, and boundary data for learning showing whether or not each phrase is the boundary of an accent phrase, are stored. The candidate of the boundary data is input, and first likelihood in which the boundary of the accent phrase of each phrase of the input text is coincident with the input candidate, is calculated from input notation data showing notation of the input text for showing the content of the input voice, the notation data for learning, and the boundary data for learning. Second likelihood in which utterance of each phrase of the input text becomes utterance indicated by input utterance data, when the input voice has the boundary of the accent phrase indicated by the candidate of the candidate data, from input utterance data showing characteristics of the utterance of each phrase of the input voice, the utterance data for learning, and the boundary data for learning. The candidate of the boundary data which maximizes a product of the first likelihood and the second likelihood, is searched and the result is output. COPYRIGHT: (C)2008,JPO&INPIT
Abstract:
PROBLEM TO BE SOLVED: To properly detect a digital watermark by improving ruggedness of the digital watermark embedded in processed variously voice contents. SOLUTION: This device is provided with watermark signal detection parts 11 for calculating detected values of a watermark signal by using two or more keys to a PCM data of voice contents for each channel, an adding part 12 of two or more detected values for adding detected values corresponding to each channel and each key at each possible combination of each channel and each key, and a comparison selection part for selecting and outputting one addition result from among each addition result by the adding part 12 of two or more detected values. Moreover, this device is provided with a message reconstruction part 13 which accumulates these detected values at different accumulation cycles and reconstructs a message embedded as a digital watermark from the accumulated detected values and also performs boundary detection of the voice contents, to detect the voice contents in which the digital watermark is embedded, and a detection result output part 14 which synthesizes each result processed by the message reconstruction part 13 and outputs the result. COPYRIGHT: (C)2006,JPO&NCIPI
Abstract:
PROBLEM TO BE SOLVED: To detect the presence or absence and kind of processing performed to contents while discriminating the kind by combining plural kinds of digital watermarks. SOLUTION: This system is provided with an embedding device for adding a prescribed additional signal to the contents data of digital contents and a detector for detecting the additional signal from the digital contents. The embedding device is provided with a watermark designing part 110 and embedment signal generating parts 121-123 for generating plural kinds of mutually related additional signals whose resistance to the processing of the digital contents is different and a synthesizing part 130 for adding the plural kinds of additional signals to the contents data. Also, the detector for detecting the additional signal embedded by the embedding device is provided with an individual detecting part for individually detecting the plural kinds of additional signals from the contents data and a judging part for judging the kind of processing performed on the contents data by checking the level of the deterioration of the additional signal based on the detected result of the individual detecting part.