Skip to content
  • Aaron Davidson's avatar
    d5a96fec
    Standalone Scheduler fault recovery · d5a96fec
    Aaron Davidson authored
    Implements a basic form of Standalone Scheduler fault recovery. In particular,
    this allows faults to be manually recovered from by means of restarting the
    Master process on the same machine. This is the majority of the code necessary
    for general fault tolerance, which will first elect a leader and then recover
    the Master state.
    
    In order to enable fault recovery, the Master will persist a small amount of state related
    to the registration of Workers and Applications to disk. If the Master is started and
    sees that this state is still around, it will enter Recovery mode, during which time it
    will not schedule any new Executors on Workers (but it does accept the registration of
    new Clients and Workers).
    
    At this point, the Master attempts to reconnect to all Workers and Client applications
    that were registered at the time of failure. After confirming either the existence
    or nonexistence of all such nodes (within a certain timeout), the Master will exit
    Recovery mode and resume normal scheduling.
    d5a96fec
    Standalone Scheduler fault recovery
    Aaron Davidson authored
    Implements a basic form of Standalone Scheduler fault recovery. In particular,
    this allows faults to be manually recovered from by means of restarting the
    Master process on the same machine. This is the majority of the code necessary
    for general fault tolerance, which will first elect a leader and then recover
    the Master state.
    
    In order to enable fault recovery, the Master will persist a small amount of state related
    to the registration of Workers and Applications to disk. If the Master is started and
    sees that this state is still around, it will enter Recovery mode, during which time it
    will not schedule any new Executors on Workers (but it does accept the registration of
    new Clients and Workers).
    
    At this point, the Master attempts to reconnect to all Workers and Client applications
    that were registered at the time of failure. After confirming either the existence
    or nonexistence of all such nodes (within a certain timeout), the Master will exit
    Recovery mode and resume normal scheduling.
Loading