-
公开(公告)号:US20240414394A1
公开(公告)日:2024-12-12
申请号:US18737444
申请日:2024-06-07
Applicant: SRI International
Inventor: Claire Christensen , Anirban Roy , Ajay Divakaran , Todd Grindal
IPC: H04N21/44 , H04N21/439
Abstract: A computing system is configured to obtain a video that includes text elements and visual elements. The computing system is further configured to generate a plurality of text tokens representative of audio spoken in the video and a plurality of frame tokens representative of one or more frames of the video. The computing system is further configured to generate a set of features that includes a text feature, a frame feature, and a multi-modal feature, wherein the multi-modal feature is representative of multi-modal elements of the video, and wherein generating the set of features is based on the plurality of text tokens and the plurality of frame tokens. The computing system is further configured to associate the set of features with one or more labels to generate a multi-label classification of the video. The computing system is further configured to output an indication of the multi-label classification of the video.