Systems and methods for unified vision-language understanding and generation
Abstract:
Embodiments described herein provide systems, methods, and devices for generating enhanced vison-language training data. A method may include: receiving, from a communication interface, a first training dataset of image-text pairs and a second training dataset of annotated image-text pairs; fine-tuning an image-grounded text decoder and an image-grounded text encoder using the second training dataset of annotated image-text pairs; generating, by the fine-tuned image-grounded text decoder, a predicted text based on a training image from the first training dataset; generating, by the fine-tuned image-grounded text encoder, a filtering decision based on the training image and the predicted text; adding the training image and the predicted text to form a third training dataset of image-text pairs depending on the filter decision; and training a vision-language model using the third training dataset of image-text pairs.
Information query
Patent Agency Ranking
0/0