Abstract:
A method and apparatus for constructing voice templates for a speaker-independent voice recognition system includes segmenting a training utterance to generate time-clustered segments, each segment being represented by a mean. The means for all utterances of a given word are quantized to generate template vectors. Each template vector is compared with testing utterances to generate a comparison result. The comparison is typically a dynamic time warping computation. The training utterances are matched with the template vectors if the comparison result exceeds at least one predefined threshold value, to generate an optimal path result, and the training utterances are partitioned in accordance with the optical path result. The partitioning is typically a K-means segmentation computation. The partitioned utterances may then be re-quantized and re-compared with the testing utterances until the at least one predefined threshold value is not exceeded.
Abstract:
Method and apparatus for selecting a code vector in an algebraic codebook wherein the analysis window for the coder is extended beyond the length of the target speech frame. An input signal is filtered by a perceptual weighting filter (76). Then, the filter is set to ring out for a number of samples equal to the length of the perceptual weighting filter (76), while a zero input vector is applied as input. By extending the analysis window, the two dimensional impulse response matrix can be stored as a one dimensional autocorrelation matrix in memory (60, 80), greatly saving on the computational complexity and memory required for the search.
Abstract:
Systems and techniques are provided for facial expression recognition. In some examples, a system receives an image frame corresponding to a face of a person. The system also determines, based on a three-dimensional model of the face, landmark feature information associated with landmark features of the face. The system then inputs, to at least one layer of a neural network trained for facial expression recognition, the image frame and the landmark feature information. The system further determines, using the neural network, a facial expression associated with the face.
Abstract:
Techniques and systems are provided for determining features for one or more objects in one or more video frames. For example, an image of an object, such as a face, can be received, and features of the object in the image can be identified. A size of the object can be determined based on the image, for example based on inter-eye distance of a face. Based on the size, either a high-resolution set of features or a low-resolution set of features is selected to compare to the features of the object. The object can be identified by matching the features of the object to matching features from the selected set of features.
Abstract:
Techniques and systems are provided for tracking objects in one or more video frames. For example, a first set of one or more bounding regions are determined for a video frame based on a trained classification network applied to the video frame. The first set of one or more bounding regions are associated with one or more objects in the video frame. One or more blobs can be detected for the video frame. A blob includes pixels of at least a portion of an object in the video frame. A second set of one or more bounding regions are determined for the video frame that are associated with the one or more blobs. A final set of one or more bounding regions is determined for the video frame using the first set of one or more bounding regions and the second set of one or more bounding regions. Object tracking can then be performed for the video frame using the final set of one or more bounding regions.
Abstract:
Techniques and systems are provided for detecting false positive faces in one or more video frames. For example, a video frame of a scene can be obtained. The video frame includes a face of a user associated with at least one characteristic feature. The face of the user is determined to match a representative face from stored representative data. The representative face is associated with the at least one characteristic feature. The face of the user is determined to match the representative face based on the at least one characteristic feature. The face of the user can then be determined to be a false positive face based on the face of the user matching the representative face.
Abstract:
An example method includes processing a file including fisheye video data, the file including a syntax structure including a plurality of syntax elements that specify attributes of the fisheye video data, wherein the plurality of syntax elements includes: a first syntax element that explicitly indicates whether the fisheye video data is monoscopic or stereoscopic, and one or more syntax elements that implicitly indicate whether the fisheye video data is monoscopic or stereoscopic; determining, based on the first syntax element, whether the fisheye video data is monoscopic or stereoscopic; and rendering, based on the determination, the fisheye video data as monoscopic or stereoscopic.
Abstract:
Techniques are described related to generating image content. A graphics processing unit (GPU) is configured to receive a first set of images generated from a first camera device in a first location, the first camera device having a first orientation, render for display the first set of images oriented to an orientation reference, receive a second, different set of images generated from a second, different camera device in a second, different location, the second camera device having a second orientation, the second orientation being different than the first orientation, and render for display the second set of images oriented to the orientation reference.
Abstract:
Techniques and systems are provided for processing video data. For example, techniques and systems are provided for determining costs for blob trackers and blobs. A blob can be detected in a video frame. The blob includes pixels of at least a portion of a foreground object. A physical distance between a blob tracker and the blob can be determined. A size ratio between the blob tracker and the blob can also be determined. A cost between the blob tracker and the blob can then be determined using the physical distance and the size ratio. In some cases, a spatial relationship between the blob tracker and the blob is determined, in which case the physical distance can be determined based on the spatial relationship. Blob trackers can be associated with blobs based on the determined costs between the blob trackers and the blobs.
Abstract:
Techniques and systems are provided for generating a background picture. The background picture can be used for coding one or more pictures. For example, a method of generating a background picture includes generating a long-term background model for one or more pixels of a background picture. The long-term background model includes a statistical model for detecting long-term motion of the one or more pixels in a sequence of pictures. The method further includes generating a short-term background model for the one or more pixels of the background picture. The short-term background model detects short-term motion of the one or more pixels between two or more pictures. The method further includes determining a value for the one or more pixels of the background picture using the long-term background model and the short-term background model.