System and method for synthesizing photo-realistic video of a speech
Abstract:
A system and a method for obtaining a photo-realistic video from a text. The method includes: providing the text and an image of a talking person; synthesizing a speech audio from the text; extracting an acoustic feature from the speech audio by an acoustic feature extractor; and generating the photo-realistic video from the acoustic feature and the image by a video generation neural network. The video generating neural network is pre-trained by: providing a training video and a training image; extracting a training acoustic feature from training audio of the training video by the acoustic feature extractor; generating video frames from the training image and the training acoustic feature by the video generation neural network; and comparing the generated video frames with ground truth video frames using generative adversarial network (GAN). The ground truth video frames correspond to the training video frames.
Information query
Patent Agency Ranking
0/0