Abstract:
PROBLEM TO BE SOLVED: To improve the naturalness of a synthesized voice and to synchronize multimedia and a text/voice converter(TTS) with each other by defining information needed to interlock additional metrical information other than a text and multimedia information and an interface between those pieces of information and the TTS by the TTS, and using them for synthesized voice generation. SOLUTION: A multimedia information input part 10 consists of synchronized information of a text, metrical information, and a moving picture. A data-by- media distributor 11 separates multimedia information by media, converts it into usable data structures, and transmits them. A language processing part 12 converts them by phonemes and estimates and symbolizes the metrical information. A metrical processing part 13 calculates the values of metrical control parameters other than metrical control parameters of the multimedia information. A synchronism adjuster 14 adjusts times by duration by phonemes for synchronizing a synthesized voice to a video signal. A signal processing part 15 receives the metrical information, etc., and generates a synthesized voice by making use of a synthesis unit data base 16.
Abstract:
The system has a distributor for receiving multimedia input information and converting it into corresponding data structures for distribution to each medium. An image output unit and a speech processor both receive image and speech information respectively from the distributor. The speech processor converts the speech into a phoneme sequence to approximate and symbolise the periodic information from which a prosodic processor computes prosodic control parameter values. A synchroniser sets the time period for each phoneme to synchronise with an image signal, using synchronising information from the distributor, and sets each time period in the prosodic processor. A signal processor generates synthesised speech and a synthesizer unit data base selects the required unit for synthesis, according to a request from the signal processor, and transfers the required data.
Abstract:
Provided is a method and apparatus for encoding/decoding a multi-channel audio signal. The apparatus for encoding a multi-channel audio signal includes a frame converter for converting the multi-channel audio signal into a framed audio signal; means for downmixing the framed audio signal; means for encoding the downmixed audio signal; a source location information estimator for estimating source location information from the framed multi-channel audio signal; means for quantizing the estimated source location information; and means for multiplexing the encoded audio signal and the quantized source location information, to generate an encoded multi-channel audio signal.
Abstract:
The present invention provides a text-to-speech conversion system (TTS) for interlocking synchronizing with multimedia and a method for organizing input data of the TTS which can enhance the natural naturalness of synthesized speech and accomplish the synchronization of multimedia with TTS by defining additional prosody information, the information required to interlock synchronize TTS with multimedia, and interface between these this information and TTS for use in the production of the synthesized speech.
Abstract:
A method of formatting and normalizing continuous lip motions to events in a moving picture besides text in a Text-To-Speech converter is provided. A synthesized speech is synchronized with a moving picture by using the method wherein the real speech data and the shape of a lip in the moving picture are analyzed, and information on the estimated lip shape and text information are directly used in generating the synthesized speech.
Abstract:
The system includes a multimedia information input unit (10) for organising text, information and individual characteristic. A data distributor (11) distributes the information of the multimedia information input unit to the information for each media. A speech processor converts the text distributed by the data distributor to a phoneme stream for estimation of prosopic information and for symbolising the information. A prosopic processor (13) calculates a value of a prosopic control parameters from the symbolised prosopic information using a rule and a table. A synchronisation adjusting unit (14) adjust the duration of the phoneme using the distributed synchronisation information. A signal processor (15) generates a synthetic speech using the prosopic control parameter and data of a synthetic data base (16). A picture output unit (17)output the distributed picture information onto a screen.
Abstract:
A method for compressing and decompressing a multi-channel signal using virtual source location information (VSLI) on a semicircular plane is provided. VSLI, rather than inter channel level difference (ICLD), is used as spatial cue information, thereby minimizing loss caused by quantization of spatial cue information, improving sound quality of a decompressed audio signal, and reproducing an excellent audio signal by reducing distortion upon decompression of an original signal at a decoder spectrum.