-
- Downloads
[SPARK-8890] [SQL] Fallback on sorting when writing many dynamic partitions
Previously, we would open a new file for each new dynamic written out using `HadoopFsRelation`. For formats like parquet this is very costly due to the buffers required to get good compression. In this PR I refactor the code allowing us to fall back on an external sort when many partitions are seen. As such each task will open no more than `spark.sql.sources.maxFiles` files. I also did the following cleanup: - Instead of keying the file HashMap on an expensive to compute string representation of the partition, we now use a fairly cheap UnsafeProjection that avoids heap allocations. - The control flow for instantiating and invoking a writer container has been simplified. Now instead of switching in two places based on the use of partitioning, the specific writer container must implement a single method `writeRows` that is invoked using `runJob`. - `InternalOutputWriter` has been removed. Instead we have a `private[sql]` method `writeInternal` that converts and calls the public method. This method can be overridden by internal datasources to avoid the conversion. This change remove a lot of code duplication and per-row `asInstanceOf` checks. - `commands.scala` has been split up. Author: Michael Armbrust <michael@databricks.com> Closes #8010 from marmbrus/fsWriting and squashes the following commits: 00804fe [Michael Armbrust] use shuffleMemoryManager.pageSizeBytes 775cc49 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into fsWriting 17b690e [Michael Armbrust] remove comment 40f0372 [Michael Armbrust] address comments f5675bd [Michael Armbrust] char -> string 7e2d0a4 [Michael Armbrust] make sure we close current writer 8100100 [Michael Armbrust] delete empty commands.scala 71cc717 [Michael Armbrust] update comment 8ec75ac [Michael Armbrust] [SPARK-8890][SQL] Fallback on sorting when writing many dynamic partitions
Showing
- sql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala 6 additions, 2 deletionssql/core/src/main/scala/org/apache/spark/sql/SQLConf.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoDataSource.scala 64 additions, 0 deletions...park/sql/execution/datasources/InsertIntoDataSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelation.scala 165 additions, 0 deletions...ql/execution/datasources/InsertIntoHadoopFsRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/WriterContainer.scala 404 additions, 0 deletions...che/spark/sql/execution/datasources/WriterContainer.scala
- sql/core/src/main/scala/org/apache/spark/sql/json/JSONRelation.scala 4 additions, 2 deletions...c/main/scala/org/apache/spark/sql/json/JSONRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala 4 additions, 2 deletions.../scala/org/apache/spark/sql/parquet/ParquetRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 8 additions, 9 deletions.../main/scala/org/apache/spark/sql/sources/interfaces.scala
- sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala 56 additions, 0 deletions.../org/apache/spark/sql/sources/PartitionedWriteSuite.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala 4 additions, 2 deletions...ain/scala/org/apache/spark/sql/hive/orc/OrcRelation.scala
Loading
Please register or sign in to comment