Commit 65e896a6 authored 8 years ago by Eric Liang Committed by Wenchen Fan 8 years ago

[SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables


## What changes were proposed in this pull request?

In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory.

This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors).

cc mallman  cloud-fan

## How was this patch tested?

Checked metrics in unit tests.

Author: Eric Liang <ekl@databricks.com>

Closes #16112 from ericl/spark-18679.

(cherry picked from commit 294163ee)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

parent a7f8ebb8

No related branches found

No related tags found

No related merge requests found

Hide whitespace changes

Inline Side-by-side

Showing with 106 additions and 34 deletions

Please register or to comment