-
- Downloads
[SPARK-16554][CORE] Automatically Kill Executors and Nodes when they are Blacklisted
## What changes were proposed in this pull request? In SPARK-8425, we introduced a mechanism for blacklisting executors and nodes (hosts). After a certain number of failures, these resources would be "blacklisted" and no further work would be assigned to them for some period of time. In some scenarios, it is better to fail fast, and to simply kill these unreliable resources. This changes proposes to do so by having the BlacklistTracker kill unreliable resources when they would otherwise be "blacklisted". In order to be thread safe, this code depends on the CoarseGrainedSchedulerBackend sending a message to the driver backend in order to do the actual killing. This also helps to prevent a race which would permit work to begin on a resource (executor or node), between the time the resource is marked for killing and the time at which it is finally killed. ## How was this patch tested? ./dev/run-tests Ran https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh, and checked logs to see executors and nodes being killed. Testing can likely be improved here; suggestions welcome. Author: José Hiram Soltren <jose@cloudera.com> Closes #16650 from jsoltren/SPARK-16554-submit.
Showing
- core/src/main/scala/org/apache/spark/ExecutorAllocationClient.scala 20 additions, 1 deletion...ain/scala/org/apache/spark/ExecutorAllocationClient.scala
- core/src/main/scala/org/apache/spark/internal/config/package.scala 5 additions, 0 deletions...main/scala/org/apache/spark/internal/config/package.scala
- core/src/main/scala/org/apache/spark/scheduler/BlacklistTracker.scala 28 additions, 3 deletions...n/scala/org/apache/spark/scheduler/BlacklistTracker.scala
- core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala 5 additions, 1 deletion.../scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
- core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedClusterMessage.scala 3 additions, 0 deletions...spark/scheduler/cluster/CoarseGrainedClusterMessage.scala
- core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala 33 additions, 14 deletions...ark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
- core/src/test/scala/org/apache/spark/ExecutorAllocationManagerSuite.scala 8 additions, 1 deletion...ala/org/apache/spark/ExecutorAllocationManagerSuite.scala
- core/src/test/scala/org/apache/spark/deploy/StandaloneDynamicAllocationSuite.scala 60 additions, 2 deletions...pache/spark/deploy/StandaloneDynamicAllocationSuite.scala
- core/src/test/scala/org/apache/spark/scheduler/BlacklistTrackerSuite.scala 76 additions, 3 deletions...la/org/apache/spark/scheduler/BlacklistTrackerSuite.scala
- core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala 1 addition, 1 deletion...cala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
- docs/configuration.md 9 additions, 0 deletionsdocs/configuration.md
Loading
Please register or sign in to comment