-
- Downloads
[SPARK-18679][SQL] Fix regression in file listing performance for non-catalog tables
## What changes were proposed in this pull request? In Spark 2.1 ListingFileCatalog was significantly refactored (and renamed to InMemoryFileIndex). This introduced a regression where parallelism could only be introduced at the very top of the tree. However, in many cases (e.g. `spark.read.parquet(topLevelDir)`), the top of the tree is only a single directory. This PR simplifies and fixes the parallel recursive listing code to allow parallelism to be introduced at any level during recursive descent (though note that once we decide to list a sub-tree in parallel, the sub-tree is listed in serial on executors). cc mallman cloud-fan ## How was this patch tested? Checked metrics in unit tests. Author: Eric Liang <ekl@databricks.com> Closes #16112 from ericl/spark-18679.
Showing
- core/src/main/scala/org/apache/spark/metrics/source/StaticSources.scala 8 additions, 0 deletions...scala/org/apache/spark/metrics/source/StaticSources.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala 45 additions, 34 deletions...ql/execution/datasources/PartitioningAwareFileIndex.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileIndexSuite.scala 53 additions, 0 deletions...ache/spark/sql/execution/datasources/FileIndexSuite.scala
Loading
Please register or sign in to comment