Skip to content
Snippets Groups Projects
Commit fdf9f94f authored by Imran Rashid's avatar Imran Rashid
Browse files

[SPARK-15865][CORE] Blacklist should not result in job hanging with less than 4 executors

## What changes were proposed in this pull request?

Before this change, when you turn on blacklisting with `spark.scheduler.executorTaskBlacklistTime`, but you have fewer than `spark.task.maxFailures` executors, you can end with a job "hung" after some task failures.

Whenever a taskset is unable to schedule anything on resourceOfferSingleTaskSet, we check whether the last pending task can be scheduled on *any* known executor.  If not, the taskset (and any corresponding jobs) are failed.
* Worst case, this is O(maxTaskFailures + numTasks).  But unless many executors are bad, this should be small
* This does not fail as fast as possible -- when a task becomes unschedulable, we keep scheduling other tasks.  This is to avoid an O(numPendingTasks * numExecutors) operation
* Also, it is conceivable this fails too quickly.  You may be 1 millisecond away from unblacklisting a place for a task to run, or acquiring a new executor.

## How was this patch tested?

Added unit test which failed before the change, ran new test 5k times manually, ran all scheduler tests manually, and the full suite via jenkins.

Author: Imran Rashid <irashid@cloudera.com>

Closes #13603 from squito/progress_w_few_execs_and_blacklist.
parent 07f46afc
No related branches found
No related tags found
No related merge requests found
Showing with 194 additions and 15 deletions
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment