System and method for dynamic scheduling of distributed deep learning training jobs

Invention Grant

US11693706B2 System and method for dynamic scheduling of distributed deep learning training jobs 有权

Please log in to see more content

Patent Title: System and method for dynamic scheduling of distributed deep learning training jobs
Application No.: US16690999

Application Date: 2019-11-21
Publication No.: US11693706B2

Publication Date: 2023-07-04
Inventor: Timothy Capes , Iqbal Mohomed , Vishal Raheja , Mete Kemertas
Applicant: SAMSUNG ELECTRONICS CO., LTD.
Applicant Address: KR Gyeonggi-do
Assignee: SAMSUNG ELECTRONICS CO., LTD.
Current Assignee: SAMSUNG ELECTRONICS CO., LTD.
Current Assignee Address: KR Suwon-si
Agency: Sughrue Mion, PLLC
Main IPC: G06F9/50
IPC: G06F9/50 ; G06V10/82 ; G06N3/08 ; G06N7/08 ; G06F18/214 ; G06N5/01 ; G06V10/764 ; G06V10/94 ; G06V10/96 ; G06N3/084

System and method for dynamic scheduling of distributed deep learning training jobs

Abstract:

A scheduling algorithm for scheduling training of deep neural network (DNN) weights on processing units identifies a next job to provisionally assign a processing unit (PU) based on a doubling heuristic. The doubling heuristic makes use of an estimated number of training sets needed to complete training of weights for a given job and/or a training speed function which indicates how fast the weights are converging. The scheduling algorithm solves a problem of efficiently assigning PUs when multiple DNN weight data structures must be trained efficiently. In some embodiments, the training of the weights uses a ring-based message passing architecture. In some embodiments, performance using a nested loop approach or nested loop fashion is provided. In inner iterations of the nested loop, PUs are scheduled and jobs are launched or re-started. In outer iterations of the nested loop, jobs are stopped, parameters are updated and the inner iteration is re-entered.

Public/Granted literature

US20200159589A1 SYSTEM AND METHOD FOR DYNAMIC SCHEDULING OF DISTRIBUTED DEEP LEARNING TRAINING JOBS Public/Granted day:2020-05-21

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F9/00	程序控制装置，例如，控制单元（用于外部设备的程序控制入G06F13/10）
G06F9/06	.应用存入的程序的，即应用处理设备的内部存储来接收程序并保持程序的
G06F9/46	..多道程序装置
G06F9/50	...资源分配，例如，中央处理单元[CPU]的