Fault tolerance for a distributed computing system
Abstract:
In one embodiment, a method detects a failure of a container in a controller node where the container includes a service being performed and isolated from other services being performed in other containers on the controller node. The controller node terminates the container including the service and determines a known state for the service. The known state is known to be operational without including a cause of the failure and the service operated from the known state saving changes to the known state during operation separately from the known state. The controller node restarts the service in a new container that replaces the terminated container where the restarted service starts from the known state without using the changes.
Public/Granted literature
Information query
Patent Agency Ranking
0/0