-
- Downloads
[SPARK-15307][SQL] speed up listing files for data source
## What changes were proposed in this pull request? Currently, listing files is very slow if there is thousands files, especially on local file system, because: 1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout. 2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow). This PR improve these by: 1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources. 2) Only create an JobConf once within one task. ## How was this patch tested? Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now). Author: Davies Liu <davies@databricks.com> Closes #13094 from davies/listing.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala 5 additions, 4 deletions.../spark/sql/execution/datasources/ListingFileCatalog.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala 28 additions, 10 deletions...park/sql/execution/datasources/fileSourceInterfaces.scala
Loading
Please register or sign in to comment