-
1.
公开(公告)号:WO2023063880A2
公开(公告)日:2023-04-20
申请号:PCT/SG2022/050704
申请日:2022-09-29
Applicant: LEMON INC.
Inventor: LU, Wei Tsung , WANG, Ju-Chiang , WON, Minz , CHOI, Keunwoo , SONG, Xuchen
Abstract: Devices, systems and methods related to causing an apparatus to generate music information of audio data using a transformer-based neural network model with a multilevel transformer for audio analysis, using a spectral and a temporal transformer, are disclosed herein. The processor generates a time-frequency representation of obtained audio data to be applied as input for a transformer-based neural network model; determines spectral embeddings and first temporal embeddings of the audio data based on the time-frequency representation of the audio data; determines each vector of a second frequency class token (FCT) by passing each vector of the first FCT in the spectral embeddings through the spectral transformer; determines second temporal embeddings by adding a linear projection of the second FCT to the first temporal embeddings; determines third temporal embeddings by passing the second temporal embeddings through the temporal transformer; and generates music information based on the third temporal embeddings.
-
公开(公告)号:WO2023063881A2
公开(公告)日:2023-04-20
申请号:PCT/SG2022/050705
申请日:2022-09-29
Applicant: LEMON INC.
Inventor: SMITH, Jordan , WANG, Ju-Chiang , LU, Wei Tsung , SONG, Xuchen
Abstract: Devices, systems, and methods related to implementing supervised metric learning during a training of a deep neural network model are disclosed herein. In examples, audio input may be received, where the audio input includes a plurality of song fragments from a plurality of songs. For each song fragment, an aligning function may be performed to center the song fragment based on determined beat information, thereby creating a plurality of aligned song fragments. For each song fragment of the plurality of song fragments, an embedding vector may be obtained from the deep neural network. Thus, a batch of aligned song fragments from the plurality of aligned song fragments may be selected, such that a training tuple may be selected. A loss metric may be generated based on the selected training tuple and one or more weights of the deep neural network model may be updated based on the loss metric.
-