[SPARK-20540][CORE] Fix unstable executor requests.
There are two problems fixed in this commit. First, the ExecutorAllocationManager sets a timeout to avoid requesting executors too often. However, the timeout is always updated based on its value and a timeout, not the current time. If the call is delayed by locking for more than the ongoing scheduler timeout, the manager will request more executors on every run. This seems to be the main cause of SPARK-20540. The second problem is that the total number of requested executors is not tracked by the CoarseGrainedSchedulerBackend. Instead, it calculates the value based on the current status of 3 variables: the number of known executors, the number of executors that have been killed, and the number of pending executors. But, the number of pending executors is never less than 0, even though there may be more known than requested. When executors are killed and not replaced, this can cause the request sent to YARN to be incorrect because there were too many executors due to the scheduler's state being slightly out of date. This is fixed by tracking the currently requested size explicitly. ## How was this patch tested? Existing tests. Author: Ryan Blue <blue@apache.org> Closes #17813 from rdblue/SPARK-20540-fix-dynamic-allocation. (cherry picked from commit 2b2dd08e) Signed-off-by:Marcelo Vanzin <vanzin@cloudera.com>
Showing
- core/src/main/scala/org/apache/spark/ExecutorAllocationManager.scala 1 addition, 1 deletion...in/scala/org/apache/spark/ExecutorAllocationManager.scala
- core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala 28 additions, 4 deletions...ark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
- core/src/test/scala/org/apache/spark/deploy/StandaloneDynamicAllocationSuite.scala 4 additions, 2 deletions...pache/spark/deploy/StandaloneDynamicAllocationSuite.scala
Loading
Please register or sign in to comment