Commit dccc0aa3 authored 7 years ago by Sital Kedia Committed by Wenchen Fan 7 years ago

[SPARK-19753][CORE] Un-register all shuffle output on a host in case of slave lost or fetch failure

## What changes were proposed in this pull request?

Currently, when we detect fetch failure, we only remove the shuffle files produced by the executor, while the host itself might be down and all the shuffle files are not accessible. In case we are running multiple executors on a host, any host going down currently results in multiple fetch failures and multiple retries of the stage, which is very inefficient. If we remove all the shuffle files on that host, on first fetch failure, we can rerun all the tasks on that host in a single stage retry.

## How was this patch tested?

Unit testing and also ran a job on the cluster and made sure multiple retries are gone.

Author: Sital Kedia <skedia@fb.com>
Author: Imran Rashid <irashid@cloudera.com>

Closes #18150 from sitalkedia/cleanup_shuffle.

parent 8b5b2e27

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 160 additions and 15 deletions

Please register or to comment