-
- Downloads
[SPARK-18775][SQL] Limit the max number of records written per file
## What changes were proposed in this pull request? Currently, Spark writes a single file out per task, sometimes leading to very large files. It would be great to have an option to limit the max number of records written per file in a task, to avoid humongous files. This patch introduces a new write config option `maxRecordsPerFile` (default to a session-wide setting `spark.sql.files.maxRecordsPerFile`) that limits the max number of records written to a single file. A non-positive value indicates there is no limit (same behavior as not having this flag). ## How was this patch tested? Added test cases in PartitionedWriteSuite for both dynamic partition insert and non-dynamic partition insert. Author: Reynold Xin <rxin@databricks.com> Closes #16204 from rxin/SPARK-18775.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala 79 additions, 30 deletions...he/spark/sql/execution/datasources/FileFormatWriter.scala
- sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 17 additions, 9 deletions...rc/main/scala/org/apache/spark/sql/internal/SQLConf.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/BucketingUtilsSuite.scala 46 additions, 0 deletions...spark/sql/execution/datasources/BucketingUtilsSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala 37 additions, 0 deletions.../org/apache/spark/sql/sources/PartitionedWriteSuite.scala
Loading
Please register or sign in to comment