Proactive cluster compute node migration at next checkpoint of cluster cluster upon predicted node failure

Invention Grant

US10776225B2 Proactive cluster compute node migration at next checkpoint of cluster cluster upon predicted node failure 有权

Please log in to see more content

Patent Title: Proactive cluster compute node migration at next checkpoint of cluster cluster upon predicted node failure
Application No.: US16022990

Application Date: 2018-06-29
Publication No.: US10776225B2

Publication Date: 2020-09-15
Inventor: Cong Xu , Naveen Muralimanohar , Harumi Kuno
Applicant: Hewlett Packard Enterprise Development LP
Applicant Address: US TX Houston
Assignee: Hewlett Packard Enterprise Development LP
Current Assignee: Hewlett Packard Enterprise Development LP
Current Assignee Address: US TX Houston
Agent Michael A. Dryja
Main IPC: G06F11/20
IPC: G06F11/20 ; G06F11/14 ; G06F11/07 ; G06F11/00 ; G06F11/36 ; G06F9/48 ; G06F9/52 ; G06F9/54 ; G06F9/455 ; G06N20/00

Proactive cluster compute node migration at next checkpoint of cluster cluster upon predicted node failure

Abstract:

While scheduled checkpoints are being taken of a cluster of active compute nodes distributively executing an application in parallel, a likelihood of failure of the active compute nodes is periodically and independently predicted. Responsive to the likelihood of failure of a given active compute node exceeding a threshold, the given active compute node is proactively migrated to a spare compute node of the cluster at a next scheduled checkpoint. Another spare compute node of the cluster can perform prediction and migration. Prediction can be based on both hardware events and software events regarding the active compute nodes.

Public/Granted literature

US20200004648A1 PROACTIVE CLUSTER COMPUTE NODE MIGRATION AT NEXT CHECKPOINT OF CLUSTER CLUSTER UPON PREDICTED NODE FAILURE Public/Granted day:2020-01-02

Information query

Espacenet

IPC分类:

G	物理
G06	计算；推算或计数
G06F	电数字数据处理（基于特定计算模型的计算机系统入G06N）
G06F11/00	错误检测；错误校正；监控（在记录载体上作出核对其正确性的方法或装置入G06K5/00；基于记录载体和传感器之间的相对运动而实现的信息存储中所用的方法或装置入G11B，例如G11B20/18；静态存储中所用的方法或装置入G11C29/00）
G06F11/07	.响应错误的产生，例如，容错
G06F11/16	..用硬件中的冗余作数据的错误检测或校正
G06F11/20	...应用积极故障掩膜，例如，用断开故障元件或接通备用元件作数据的错误检测或校正