DEEP REINFORCEMENT LEARNING FRAMEWORK FOR CHARACTERIZING VIDEO CONTENT

    公开(公告)号:US20190163977A1

    公开(公告)日:2019-05-30

    申请号:US16171018

    申请日:2018-10-25

    Abstract: Methods and systems for performing sequence level prediction of a video scene are described. Video information in a video scene is represented as a sequence of features depicted each frame. An environment state for each time step t corresponding to each frame is represented by the video information for time step t and predicted affective information from a previous time step t−1. An action A(t) as taken with an agent controlled by a machine learning algorithm for the frame at step t, wherein an output of the action A(t) represents affective label prediction for the frame at the time step t. A pool of predicted actions is transformed to a predicted affective history at a next time step t+1. The predictive affective history is included as part of the environment state for the next time step t+1. A reward R is generated on predicted actions up to the current time step t, by comparing them against corresponding annotated movie scene affective labels.

    Deep reinforcement learning framework for characterizing video content

    公开(公告)号:US11386657B2

    公开(公告)日:2022-07-12

    申请号:US17141028

    申请日:2021-01-04

    Abstract: Methods and systems for performing sequence level prediction of a video scene are described. Video information in a video scene is represented as a sequence of features depicted each frame. One or more scene affective labels are provided at the end of the sequence. Each label pertains to the entire sequence of frames of data. An action is taken with an agent controlled by a machine learning algorithm for a current frame of the sequence at a current time step. An output of the action represents affective label prediction for the frame at the current time step. A pool of actions taken up until the current time step including the action taken with the agent is transformed into a predicted affective history for a subsequent time step. A reward is generated on predicted actions up to the current time step by comparing the predicted actions against corresponding annotated scene affective labels.

    SYSTEM AND METHOD FOR CONVERTING IMAGE DATA INTO A NATURAL LANGUAGE DESCRIPTION

    公开(公告)号:US20200175053A1

    公开(公告)日:2020-06-04

    申请号:US16206439

    申请日:2018-11-30

    Abstract: For image captioning such as for computer game images or other images, bottom-up attention is combined with top-down attention to provide a multi-level residual attention-based image captioning model. A residual attention mechanism is first applied in the Faster R-CNN network to learn better feature representations for each region by taking spatial information into consideration. In the image captioning network, taking the extracted regional features as input, a second residual attention network is implemented to fuse the regional features attentionally for subsequent caption generation.

    INITIALIZATION OF CTC SPEECH RECOGNITION WITH STANDARD HMM

    公开(公告)号:US20190013015A1

    公开(公告)日:2019-01-10

    申请号:US15645985

    申请日:2017-07-10

    Abstract: A method for improved initialization of speech recognition system comprises mapping a trained hidden markov model based recognition node network (HMM) to a Connectionist Temporal Classification (CTC) based node label scheme. The central state of each frame in the HMM are mapped to CTC-labeled output nodes and the non-central states of each frame are mapped to CTC-blank nodes to generate a CTC-labeled HMM and each central state represents a phoneme from human speech detected and extracted by a computing device. Next the CTC-labeled HMM is trained using a cost function, wherein the cost function is not part of a CTC cost function. Finally the CTC-labeled HMM is trained using a CTC cost function to produce a CTC node network. The CTC node network may be iteratively trained by repeating the initialization steps.

Patent Agency Ranking