Serializing machine check exceptions for predictive failure analysis

    公开(公告)号:US11163623B2

    公开(公告)日:2021-11-02

    申请号:US16866485

    申请日:2020-05-04

    Abstract: Upon occurrence of multiple errors in a central processing unit (CPU) package, data indicating the errors is stored in machine check (MC) banks. A timestamp corresponding to each error is stored, the timestamp indicating a time of occurrence for each error. A machine check exception (MCE) handler is generated to address the errors based on the timestamps. The timestamps can be stored in the MC banks or in a utility box (U-box). The MCE handler can then address the errors based on order of occurrence, for example by determining that the first error in time causes the remaining error. The MCE can isolate hardware/software associated with the first error to recover from a failure. The MCE can report only the first error to the operating system (OS) or other error management software/hardware. The U-Box may also convert the timestamps into real time to support user debugging.

    DISAMBIGUATION OF ERROR LOGGING DURING SYSTEM RESET

    公开(公告)号:US20180341537A1

    公开(公告)日:2018-11-29

    申请号:US15606799

    申请日:2017-05-26

    CPC classification number: G06F11/0766 G06F11/0787

    Abstract: A mechanism for disambiguation of error logging during a warm reset is disclosed. A system agent detects an error occurring during bootstrapping of a processor package. The error occurs prior to initiation of a machine check system. A wide pulse event is initiated to signal a wide pulse register to store a wide pulse time stamp counter value. The wide pulse event also signals a lap register to store a lap time stamp counter value. The wide pulse register maintains the wide pulse time stamp counter value during a warm reset, and the lap register clears the lap time stamp counter value during the warm reset. The system agent obtains the wide pulse time stamp counter value and the lap time stamp counter value after bootstrapping is complete to determine an order of occurrence of the error relative to the warm reset.

Patent Agency Ranking