bin/stop-slaves.sh · d5a96feccb15dd290b282af9e2f94479c8e4554e · cs525-sp18-g07 / spark

Sep 17, 2013

Standalone Scheduler fault recovery · d5a96fec

Aaron Davidson authored Sep 17, 2013

Implements a basic form of Standalone Scheduler fault recovery. In particular,
this allows faults to be manually recovered from by means of restarting the
Master process on the same machine. This is the majority of the code necessary
for general fault tolerance, which will first elect a leader and then recover
the Master state.

In order to enable fault recovery, the Master will persist a small amount of state related
to the registration of Workers and Applications to disk. If the Master is started and
sees that this state is still around, it will enter Recovery mode, during which time it
will not schedule any new Executors on Workers (but it does accept the registration of
new Clients and Workers).

At this point, the Master attempts to reconnect to all Workers and Client applications
that were registered at the time of failure. After confirming either the existence
or nonexistence of all such nodes (within a certain timeout), the Master will exit
Recovery mode and resume normal scheduling.

d5a96fec

Standalone Scheduler fault recovery

Aaron Davidson authored Sep 17, 2013