-
- Downloads
[SPARK-14259] [SQL] Merging small files together based on the cost of opening
## What changes were proposed in this pull request? This PR basically re-do the things in #12068 but with a different model, which should work better in case of small files with different sizes. ## How was this patch tested? Updated existing tests. Ran a query on thousands of partitioned small files locally, with all default settings (the cost to open a file should be over estimated), the durations of tasks become smaller and smaller, which is good (the last few tasks will be shortest). Author: Davies Liu <davies@databricks.com> Closes #12095 from davies/file_cost.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala 5 additions, 8 deletions.../spark/sql/execution/datasources/FileSourceStrategy.scala
- sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 8 additions, 5 deletions...rc/main/scala/org/apache/spark/sql/internal/SQLConf.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala 8 additions, 6 deletions...k/sql/execution/datasources/FileSourceStrategySuite.scala
Loading
Please register or sign in to comment