-
- Downloads
[SPARK-13664][SQL] Add a strategy for planning partitioned and bucketed scans of files
This PR adds a new strategy, `FileSourceStrategy`, that can be used for planning scans of collections of files that might be partitioned or bucketed. Compared with the existing planning logic in `DataSourceStrategy` this version has the following desirable properties: - It removes the need to have `RDD`, `broadcastedHadoopConf` and other distributed concerns in the public API of `org.apache.spark.sql.sources.FileFormat` - Partition column appending is delegated to the format to avoid an extra copy / devectorization when appending partition columns - It minimizes the amount of data that is shipped to each executor (i.e. it does not send the whole list of files to every worker in the form of a hadoop conf) - it natively supports bucketing files into partitions, and thus does not require coalescing / creating a `UnionRDD` with the correct partitioning. - Small files are automatically coalesced into fewer tasks using an approximate bin-packing algorithm. Currently only a testing source is planned / tested using this strategy. In follow-up PRs we will port the existing formats to this API. A stub for `FileScanRDD` is also added, but most methods remain unimplemented. Other minor cleanups: - partition pruning is pushed into `FileCatalog` so both the new and old code paths can use this logic. This will also allow future implementations to use indexes or other tricks (i.e. a MySQL metastore) - The partitions from the `FileCatalog` now propagate information about file sizes all the way up to the planner so we can intelligently spread files out. - `Array` -> `Seq` in some internal APIs to avoid unnecessary `toArray` calls - Rename `Partition` to `PartitionDirectory` to differentiate partitions used earlier in pruning from those where we have already enumerated the files and their sizes. Author: Michael Armbrust <michael@databricks.com> Closes #11646 from marmbrus/fileStrategy.
Showing
- mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala 1 addition, 1 deletion...la/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/CatalystConf.scala 14 additions, 0 deletions...in/scala/org/apache/spark/sql/catalyst/CatalystConf.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala 18 additions, 0 deletions...apache/spark/sql/catalyst/plans/logical/LogicalPlan.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/ExistingRDD.scala 1 addition, 1 deletion...in/scala/org/apache/spark/sql/execution/ExistingRDD.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanner.scala 2 additions, 1 deletion...n/scala/org/apache/spark/sql/execution/SparkPlanner.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala 17 additions, 15 deletions...g/apache/spark/sql/execution/datasources/DataSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSourceStrategy.scala 9 additions, 37 deletions.../spark/sql/execution/datasources/DataSourceStrategy.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala 57 additions, 0 deletions.../apache/spark/sql/execution/datasources/FileScanRDD.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala 202 additions, 0 deletions.../spark/sql/execution/datasources/FileSourceStrategy.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningUtils.scala 12 additions, 6 deletions...e/spark/sql/execution/datasources/PartitioningUtils.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVRelation.scala 1 addition, 1 deletion...che/spark/sql/execution/datasources/csv/CSVRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/DefaultSource.scala 1 addition, 1 deletion...e/spark/sql/execution/datasources/csv/DefaultSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JSONRelation.scala 2 additions, 2 deletions...e/spark/sql/execution/datasources/json/JSONRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetRelation.scala 1 addition, 1 deletion...k/sql/execution/datasources/parquet/ParquetRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/DefaultSource.scala 1 addition, 1 deletion.../spark/sql/execution/datasources/text/DefaultSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 8 additions, 1 deletion...rc/main/scala/org/apache/spark/sql/internal/SQLConf.scala
- sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 103 additions, 9 deletions.../main/scala/org/apache/spark/sql/sources/interfaces.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala 345 additions, 0 deletions...k/sql/execution/datasources/FileSourceStrategySuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetPartitionDiscoverySuite.scala 1 addition, 1 deletion.../datasources/parquet/ParquetPartitionDiscoverySuite.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala 6 additions, 6 deletions...cala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
Loading
Please register or sign in to comment