AUTOMATIC REMEDIATION OF FAILURES WITHIN A COMPUTATIONAL ENVIRONMENT USING INDEPENDENT EXECUTION UNITS
Abstract:
A system includes a computer system, memory, and processor. The computer system includes active units of system resources, each executing a workload unit, and redundant units of system resources. The memory stores a reinforcement learning algorithm configured to generate a sequence of resets. Executing each reset includes exchanging the active unit of system resources associated with the reset with a redundant unit of system resources assigned to the active unit of system resources. The processor measures performance metric values, and determines, based on the values, that a first probability that a failure will occur is greater than a threshold. In response, the processor generates and executes a sequence of resets. The processor measures new performance metric values, and determines, based on the new values, a second probability that the failure will occur. The processor then updates the reinforcement learning algorithm based on a difference between the first and second probabilities.
Information query
Patent Agency Ranking
0/0