Serializing machine check exceptions for predictive failure analysis

    公开(公告)号:US11687391B2

    公开(公告)日:2023-06-27

    申请号:US17516584

    申请日:2021-11-01

    Abstract: Upon occurrence of multiple errors in a central processing unit (CPU) package, data indicating the errors is stored in machine check (MC) banks. A timestamp corresponding to each error is stored, the timestamp indicating a time of occurrence for each error. A machine check exception (MCE) handler is generated to address the errors based on the timestamps. The timestamps can be stored in the MC banks or in a utility box (U-box). The MCE handler can then address the errors based on order of occurrence, for example by determining that the first error in time causes the remaining error. The MCE can isolate hardware/software associated with the first error to recover from a failure. The MCE can report only the first error to the operating system (OS) or other error management software/hardware. The U-Box may also convert the timestamps into real time to support user debugging.

    Device, system, and method to concurrently store multiple PMON counts in a single register

    公开(公告)号:US12044730B2

    公开(公告)日:2024-07-23

    申请号:US17131477

    申请日:2020-12-22

    CPC classification number: G01R31/317

    Abstract: Techniques and mechanisms for providing performance monitoring information. In an embodiment, a performance monitor circuit receives a communication which indicates a format comprising multiple fields which are each to store a respective count of monitored events. A programming of the performance monitor circuit, based on the communication, designates first bits and second bits of the register to provide, respectively, a first first field and a second field according to the format. Performance monitoring subsequent to the programming successively tallies a first count of first events which occur during a first period of time, and a second count of second events which occur during a second period of time. In another embodiment, performance monitoring results in the register concurrently storing both the first count and the second count.

    SERIALIZING MACHINE CHECK EXCEPTIONS FOR PREDICTIVE FAILURE ANALYSIS

    公开(公告)号:US20180150345A1

    公开(公告)日:2018-05-31

    申请号:US15362522

    申请日:2016-11-28

    Abstract: Upon occurrence of multiple errors in a central processing unit (CPU) package, data indicating the errors is stored in machine check (MC) banks. A timestamp corresponding to each error is stored, the timestamp indicating a time of occurrence for each error. A machine check exception (MCE) handler is generated to address the errors based on the timestamps. The timestamps can be stored in the MC banks or in a utility box (U-box). The MCE handler can then address the errors based on order of occurrence, for example by determining that the first error in time causes the remaining error. The MCE can isolate hardware/software associated with the first error to recover from a failure. The MCE can report only the first error to the operating system (OS) or other error management software/hardware. The U-Box may also convert the timestamps into real time to support user debugging.

    DEVICE, SYSTEM, AND METHOD TO CONCURRENTLY STORE MULTIPLE PMON COUNTS IN A SINGLE REGISTER

    公开(公告)号:US20220196733A1

    公开(公告)日:2022-06-23

    申请号:US17131477

    申请日:2020-12-22

    Abstract: Techniques and mechanisms for providing performance monitoring information. In an embodiment, a performance monitor circuit receives a communication which indicates a format comprising multiple fields which are each to store a respective count of monitored events. A programming of the performance monitor circuit, based on the communication, designates first bits and second bits of the register to provide, respectively, a first first field and a second field according to the format. Performance monitoring subsequent to the programming successively tallies a first count of first events which occur during a first period of time, and a second count of second events which occur during a second period of time. In another embodiment, performance monitoring results in the register concurrently storing both the first count and the second count.

    Disambiguation of error logging during system reset

    公开(公告)号:US10824493B2

    公开(公告)日:2020-11-03

    申请号:US15606799

    申请日:2017-05-26

    Abstract: A mechanism for disambiguation of error logging during a warm reset is disclosed. A system agent detects an error occurring during bootstrapping of a processor package. The error occurs prior to initiation of a machine check system. A wide pulse event is initiated to signal a wide pulse register to store a wide pulse time stamp counter value. The wide pulse event also signals a lap register to store a lap time stamp counter value. The wide pulse register maintains the wide pulse time stamp counter value during a warm reset, and the lap register clears the lap time stamp counter value during the warm reset. The system agent obtains the wide pulse time stamp counter value and the lap time stamp counter value after bootstrapping is complete to determine an order of occurrence of the error relative to the warm reset.

    DELAYED ERROR PROCESSING
    7.
    发明申请

    公开(公告)号:US20180349231A1

    公开(公告)日:2018-12-06

    申请号:US15610067

    申请日:2017-05-31

    Abstract: A computing apparatus, including: a hardware platform including a processor and memory; and a system management interrupt (SMI) handler; first logic configured to provide a first container and a second container via the hardware platform; and second logic configured to: detect an uncorrectable error in the first container; responsive to the detecting, generate a degraded system state; provide a degraded state message to the SMI handler; instruct the second container to seek a recoverable state; determine that the second container has entered a recoverable state; and initiate a recovery operation.

    Delayed error processing
    8.
    发明授权

    公开(公告)号:US10929232B2

    公开(公告)日:2021-02-23

    申请号:US15610067

    申请日:2017-05-31

    Abstract: A computing apparatus, including: a hardware platform including a processor and memory; and a system management interrupt (SMI) handler; first logic configured to provide a first container and a second container via the hardware platform; and second logic configured to: detect an uncorrectable error in the first container; responsive to the detecting, generate a degraded system state; provide a degraded state message to the SMI handler; instruct the second container to seek a recoverable state; determine that the second container has entered a recoverable state; and initiate a recovery operation.

    Apparatus and method for vectored machine check bank reporting

    公开(公告)号:US10824496B2

    公开(公告)日:2020-11-03

    申请号:US15857376

    申请日:2017-12-28

    Abstract: An apparatus and method for machine check bank reporting in a processor. For example, one embodiment includes a processor comprising: one or more cores to execute instructions and process data; a plurality of machine check architecture banks to store errors detected during execution of the instructions; error monitoring circuitry to detect the errors and responsively update the MCA banks; and a first error register (FERR) into which a first error vector is to be stored to identify an MCA bank containing a first error in an error sequence, the error monitoring circuitry to update the first error vector responsive to detecting the first error; and one or more next error registers (NERRs) to store one or more error vectors to one or more other MCA banks containing subsequent errors occurring after the first error.

    Serializing machine check exceptions for predictive failure analysis

    公开(公告)号:US10671465B2

    公开(公告)日:2020-06-02

    申请号:US15362522

    申请日:2016-11-28

    Abstract: Upon occurrence of multiple errors in a central processing unit (CPU) package, data indicating the errors is stored in machine check (MC) banks. A timestamp corresponding to each error is stored, the timestamp indicating a time of occurrence for each error. A machine check exception (MCE) handler is generated to address the errors based on the timestamps. The timestamps can be stored in the MC banks or in a utility box (U-box). The MCE handler can then address the errors based on order of occurrence, for example by determining that the first error in time causes the remaining error. The MCE can isolate hardware/software associated with the first error to recover from a failure. The MCE can report only the first error to the operating system (OS) or other error management software/hardware. The U-Box may also convert the timestamps into real time to support user debugging.

Patent Agency Ranking