-
- Downloads
Merge pull request #189 from tgravescs/sparkYarnErrorHandling
Impove Spark on Yarn Error handling Improve cli error handling and only allow a certain number of worker failures before failing the application. This will help prevent users from doing foolish things and their jobs running forever. For instance using 32 bit java but trying to allocate 8G containers. This loops forever without this change, now it errors out after a certain number of retries. The number of tries is configurable. Also increase the frequency we ping the RM to increase speed at which we get containers if they die. The Yarn MR app defaults to pinging the RM every 1 seconds, so the default of 5 seconds here is fine. But that is configurable as well in case people want to change it. I do want to make sure there aren't any cases that calling stopExecutors in CoarseGrainedSchedulerBackend would cause problems? I couldn't think of any and testing on standalone cluster as well as yarn.
No related branches found
No related tags found
Showing
- core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala 1 addition, 0 deletions...ark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala
- core/src/main/scala/org/apache/spark/scheduler/cluster/SimrSchedulerBackend.scala 0 additions, 1 deletion...apache/spark/scheduler/cluster/SimrSchedulerBackend.scala
- docs/running-on-yarn.md 2 additions, 0 deletionsdocs/running-on-yarn.md
- yarn/src/main/scala/org/apache/spark/deploy/yarn/ApplicationMaster.scala 26 additions, 13 deletions...cala/org/apache/spark/deploy/yarn/ApplicationMaster.scala
- yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala 20 additions, 12 deletions.../src/main/scala/org/apache/spark/deploy/yarn/Client.scala
- yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala 12 additions, 4 deletions.../org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
Loading
Please register or sign in to comment