Representation learning from video with spatial audio

Invention Grant

US11308329B2 Representation learning from video with spatial audio 有权

Please log in to see more content

Patent Title: Representation learning from video with spatial audio
Application No.: US16868805

Application Date: 2020-05-07
Publication No.: US11308329B2

Publication Date: 2022-04-19
Inventor: Justin Salamon , Bryan Russell , Karren Yang
Applicant: Adobe Inc.
Applicant Address: US CA San Jose
Assignee: Adobe Inc.
Current Assignee: Adobe Inc.
Current Assignee Address: US CA San Jose
Agency: Kilpatrick Townsend & Stockton LLP
Main IPC: G06K9/00
IPC: G06K9/00 ; H04S7/00 ; G06K9/62

Representation learning from video with spatial audio

Abstract:

A computer system is trained to understand audio-visual spatial correspondence using audio-visual clips having multi-channel audio. The computer system includes an audio subnetwork, video subnetwork, and pretext subnetwork. The audio subnetwork receives the two channels of audio from the audio-visual clips, and the video subnetwork receives the video frames from the audio-visual clips. In a subset of the audio-visual clips the audio-visual spatial relationship is misaligned, causing the audio-visual spatial cues for the audio and video to be incorrect. The audio subnetwork outputs an audio feature vector for each audio-visual clip, and the video subnetwork outputs a video feature vector for each audio-visual clip. The audio and video feature vectors for each audio-visual clip are merged and provided to the pretext subnetwork, which is configured to classify the merged vector as either having a misaligned audio-visual spatial relationship or not. The subnetworks are trained based on the loss calculated from the classification.

Public/Granted literature

US20210350135A1 REPRESENTATION LEARNING FROM VIDEO WITH SPATIAL AUDIO Public/Granted day:2021-11-11

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06K	图形数据读取（图像或视频识别或理解G06V）；数据的呈现；记录载体；处理记录载体
G06K9/00	识别模式的方法或装置（图形读取或将机械参数模式（例如力或存在）转换为电信号的方法或装置 G06K11/00）（图像或视频识别或理解 G06V）（语音识别 G10L15/00 )