Abstract:
A video device for predicting driving situations while a person drives a car is presented. The video device includes multi-modal sensors and knowledge data for extracting feature maps, a deep neural network trained with training data to recognize real-time traffic scenes (TSs) from a viewpoint of the car, and a user interface (UI) for displaying the real-time TSs. The real-time TSs are compared to predetermined TSs to predict the driving situations. The video device can be a video camera. The video camera can be mounted to a windshield of the car. Alternatively, the video camera can be incorporated into the dashboard or console area of the car. The video camera can calculate speed, velocity, type, and/or position information related to other cars within the real-time TS. The video camera can also include warning indicators, such as light emitting diodes (LEDs) that emit different colors for the different driving situations.
Abstract:
A computer-implemented method for training a deep neural network to recognize traffic scenes (TSs) from multi-modal sensors and knowledge data is presented. The computer-implemented method includes receiving data from the multi-modal sensors and the knowledge data and extracting feature maps from the multi-modal sensors and the knowledge data by using a traffic participant (TS) extractor to generate a first set of data, using a static objects extractor to generate a second set of data, and using an additional information extractor. The computer-implemented method further includes training the deep neural network, with training data, to recognize the TSs from a viewpoint of a vehicle.
Abstract:
Systems and methods for improving video understanding tasks based on higher-order object interactions (HOIs) between object features are provided. A plurality of frames of a video are obtained. A coarse-grained feature representation is generated by generating an image feature for each of for each of a plurality of timesteps respectively corresponding to each of the frames and performing attention based on the image features. A fine-grained feature representation is generated by generating an object feature for each of the plurality of timesteps and generating the HOIs between the object features. The coarse-grained and the fine-grained feature representations are concatenated to generate a concatenated feature representation.
Abstract:
A computer-implemented method executed by a processor for training a neural network to recognize driving scenes from sensor data received from vehicle radar is presented. The computer-implemented method includes extracting substructures from the sensor data received from the vehicle radar to define a graph having a plurality of nodes and a plurality of edges, constructing a neural network for each extracted substructure, combining the outputs of each of the constructed neural networks for each of the plurality of edges into a single vector describing a driving scene of a vehicle, and classifying the single vector into a set of one or more dangerous situations involving the vehicle.
Abstract:
Systems and methods for matching job descriptions with job applicants is provided. The method includes allocating each of one or more job applicants curriculum vitae (CV) into sections 320; applying max pooled word embedding 330 to each section of the job applicants CVs; using concatenated max-pooling and average-pooling 340 to compose the section embeddings into an applicants CV representation; allocating each of one or more job position descriptions into specified sections 220; applying max pooled word embedding 230 to each section of the job position descriptions; using concatenated max-pooling and average-pooling 240 to compose the section embeddings into a job representation; calculating a cosine similarity 250, 350 between each of the job representations and each of the CV representations to perform job-to-applicant matching; and presenting an ordered list of the one or more job applicants 360 or an ordered list of the one or more job position descriptions 260 to a user.
Abstract:
A computer-implemented method for performing mini-batching in deep learning by improving cache utilization is presented. The method includes temporally localizing a candidate clip (114) in a video stream (105) based on a natural language query (112), encoding a state, via a state processing module (120), into a joint visual and linguistic representation, feeding the joint visual and linguistic representation into a policy learning module (150), wherein the policy learning module employs a deep learning network to selectively extract features for select frames for video-text analysis and includes a fully connected linear layer (152) and a long short-term memory (LSTM) (154), outputting a value function (156) from the LSTM, generating an action policy based on the encoded state, wherein the action policy is a probabilistic distribution over a plurality of possible actions given the encoded state, and rewarding policy actions that return clips matching the natural language query.