-
- Downloads
[SPARK-8125] [SQL] Accelerates Parquet schema merging and partition discovery
This PR tries to accelerate Parquet schema discovery and `HadoopFsRelation` partition discovery. The acceleration is done by the following means: - Turning off schema merging by default Schema merging is not the most common case, but requires reading footers of all Parquet part-files and can be very slow. - Avoiding `FileSystem.globStatus()` call when possible `FileSystem.globStatus()` may issue multiple synchronous RPC calls, and can be very slow (esp. on S3). This PR adds `SparkHadoopUtil.globPathIfNecessary()`, which only issues RPC calls when the path contain glob-pattern specific character(s) (`{}[]*?\`). This is especially useful when converting a metastore Parquet table with lots of partitions, since Spark SQL adds all partition directories as the input paths, and currently we do a `globStatus` call on each input path sequentially. - Listing leaf files in parallel when the number of input paths exceeds a threshold Listing leaf files is required by partition discovery. Currently it is done on driver side, and can be slow when there are lots of (nested) directories, since each `FileSystem.listStatus()` call issues an RPC. In this PR, we list leaf files in a BFS style, and resort to a Spark job once we found that the number of directories need to be listed exceed a threshold. The threshold is controlled by `SQLConf` option `spark.sql.sources.parallelPartitionDiscovery.threshold`, which defaults to 32. - Discovering Parquet schema in parallel Currently, schema merging is also done on driver side, and needs to read footers of all part-files. This PR uses a Spark job to do schema merging. Together with task side metadata reading in Parquet 1.7.0, we never read any footers on driver side now. Author: Cheng Lian <lian@databricks.com> Closes #7396 from liancheng/accel-parquet and squashes the following commits: 5598efc [Cheng Lian] Uses ParquetInputFormat[InternalRow] instead of ParquetInputFormat[Row] ff32cd0 [Cheng Lian] Excludes directories while listing leaf files 3c580f1 [Cheng Lian] Fixes test failure caused by making "mergeSchema" default to "false" b1646aa [Cheng Lian] Should allow empty input paths 32e5f0d [Cheng Lian] Moves schema merging to executor side
Showing
- core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala 8 additions, 0 deletions.../main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala
- sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala 9 additions, 3 deletions...src/main/scala/org/apache/spark/sql/DataFrameReader.scala
- sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala 9 additions, 1 deletionsql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
- sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala 1 addition, 13 deletions...org/apache/spark/sql/parquet/ParquetTableOperations.scala
- sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala 114 additions, 44 deletions.../main/scala/org/apache/spark/sql/parquet/newParquet.scala
- sql/core/src/main/scala/org/apache/spark/sql/sources/ddl.scala 6 additions, 2 deletions...ore/src/main/scala/org/apache/spark/sql/sources/ddl.scala
- sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 94 additions, 26 deletions.../main/scala/org/apache/spark/sql/sources/interfaces.scala
- sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetPartitionDiscoverySuite.scala 17 additions, 1 deletion...he/spark/sql/parquet/ParquetPartitionDiscoverySuite.scala
Loading
Please register or sign in to comment