-
- Downloads
[SPARK-5945] Spark should not retry a stage infinitely on a FetchFailedException
The ```Stage``` class now tracks whether there were a sufficient number of consecutive failures of that stage to trigger an abort. To avoid an infinite loop of stage retries, we abort the job completely after 4 consecutive stage failures for one stage. We still allow more than 4 consecutive stage failures if there is an intervening successful attempt for the stage, so that in very long-lived applications, where a stage may get reused many times, we don't abort the job after failures that have been recovered from successfully. I've added test cases to exercise the most obvious scenarios. Author: Ilya Ganelin <ilya.ganelin@capitalone.com> Closes #5636 from ilganeli/SPARK-5945.
Showing
- core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala 12 additions, 1 deletion.../main/scala/org/apache/spark/scheduler/DAGScheduler.scala
- core/src/main/scala/org/apache/spark/scheduler/Stage.scala 29 additions, 1 deletioncore/src/main/scala/org/apache/spark/scheduler/Stage.scala
- core/src/test/scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala 279 additions, 3 deletions.../scala/org/apache/spark/scheduler/DAGSchedulerSuite.scala
Loading
Please register or sign in to comment