Abstract:
Methods and apparatuses of the present invention generally relate to generating actionable data based on multimodal data from unsynchronized data sources. In an exemplary embodiment, the method comprises receiving multimodal data from one or more unsynchronized data sources, extracting concepts from the multimodal data, the concepts comprising at least one of objects, actions, scenes and emotions, indexing the concepts for searchability; and generating actionable data based on the concepts.
Abstract:
A unified framework detects and classifies people interactions in unconstrained user generated images. Previous approaches directly map people/face locations in two-dimensional image space into features for classification. Among other things, the disclosed framework estimates a camera viewpoint and people positions in 3D space and then extracts spatial configuration features from explicit three-dimensional people positions.
Abstract:
A complex video event classification, search and retrieval system can generate a semantic representation of a video or of segments within the video, based on one or more complex events that are depicted in the video, without the need for manual tagging. The system can use the semantic representations to, among other things, provide enhanced video search and retrieval capabilities.
Abstract:
Zero-shot content detection includes building/training a semantic space by embedding word-based document descriptions of a plurality of documents into a multi-dimensional space using a semantic embedding technique; detecting a plurality of features in the multimodal content by applying feature detection algorithms to the multimodal content; determining respective word-based concept descriptions for concepts identified in the multimodal content using the detected features; embedding the respective word-based concept descriptions into the semantic space; and in response to a content detection action, (i) embedding/mapping words representative of the content detection action into the semantic space, (ii) automatically determining, without the use of training examples, concepts in the semantic space relevant to the content detection action based on the embedded words, and (iii) identifying portions of the multimodal content responsive to the content detection action based on the concepts in the semantic space determined to be relevant to the content detection action.
Abstract:
Zero-shot content detection includes building/training a semantic space by embedding word-based document descriptions of a plurality of documents into a multi-dimensional space using a semantic embedding technique; detecting a plurality of features in the multimodal content by applying feature detection algorithms to the multimodal content; determining respective word-based concept descriptions for concepts identified in the multimodal content using the detected features; embedding the respective word-based concept descriptions into the semantic space; and in response to a content detection action, (i) embedding/mapping words representative of the content detection action into the semantic space, (ii) automatically determining, without the use of training examples, concepts in the semantic space relevant to the content detection action based on the embedded words, and (iii) identifying portions of the multimodal content responsive to the content detection action based on the concepts in the semantic space determined to be relevant to the content detection action.
Abstract:
A complex video event classification, search and retrieval system can generate a semantic representation of a video or of segments within the video, based on one or more complex events that are depicted in the video, without the need for manual tagging. The system can use the semantic representations to, among other things, provide enhanced video search and retrieval capabilities.
Abstract:
A computer implemented method for determining a vehicle type of a vehicle detected in an image is disclosed. An image having a detected vehicle is received. A number of vehicle models having salient feature points is projected on the detected vehicle. A first set of features derived from each of the salient feature locations of the vehicle models is compared to a second set of features derived from corresponding salient feature locations of the detected vehicle to form a set of positive match scores (p-scores) and a set of negative match scores (n-scores). The detected vehicle is classified as one of the vehicle models models based at least in part on the set of p-scores and the set of n-scores.
Abstract:
An entity interaction recognition system algorithmically recognizes a variety of different types of entity interactions that may be captured in two-dimensional images. In some embodiments, the system estimates the three-dimensional spatial configuration or arrangement of entities depicted in the image. In some embodiments, the system applies a proxemics-based analysis to determine an interaction type. In some embodiments, the system infers, from a characteristic of an entity detected in an image, an area or entity of interest in the image.
Abstract:
A computing system for recognizing salient events depicted in a video utilizes learning algorithms to detect audio and visual features of the video. The computing system identifies one or more salient events depicted in the video based on the audio and visual features.
Abstract:
An entity interaction recognition system algorithmically recognizes a variety of different types of entity interactions that may be captured in two-dimensional images. In some embodiments, the system estimates the three-dimensional spatial configuration or arrangement of entities depicted in the image. In some embodiments, the system applies a proxemics-based analysis to determine an interaction type. In some embodiments, the system infers, from a characteristic of an entity detected in an image, an area or entity of interest in the image.