-
- Downloads
[SPARK-20703][SQL] Associate metrics with data writes onto DataFrameWriter operations
## What changes were proposed in this pull request? Right now in the UI, after SPARK-20213, we can show the operations to write data out. However, there is no way to associate metrics with data writes. We should show relative metrics on the operations. #### Supported commands This change supports updating metrics for file-based data writing operations, including `InsertIntoHadoopFsRelationCommand`, `InsertIntoHiveTable`. Supported metrics: * number of written files * number of dynamic partitions * total bytes of written data * total number of output rows * average writing data out time (ms) * (TODO) min/med/max number of output rows per file/partition * (TODO) min/med/max bytes of written data per file/partition #### Commands not supported `InsertIntoDataSourceCommand`, `SaveIntoDataSourceCommand`: The two commands uses DataSource APIs to write data out, i.e., the logic of writing data out is delegated to the DataSource implementations, such as `InsertableRelation.insert` and `CreatableRelationProvider.createRelation`. So we can't obtain metrics from delegated methods for now. `CreateHiveTableAsSelectCommand`, `CreateDataSourceTableAsSelectCommand` : The two commands invokes other commands to write data out. The invoked commands can even write to non file-based data source. We leave them as future TODO. #### How to update metrics of writing files out A `RunnableCommand` which wants to update metrics, needs to override its `metrics` and provide the metrics data structure to `ExecutedCommandExec`. The metrics are prepared during the execution of `FileFormatWriter`. The callback function passed to `FileFormatWriter` will accept the metrics and update accordingly. There is a metrics updating function in `RunnableCommand`. In runtime, the function will be bound to the spark context and `metrics` of `ExecutedCommandExec` and pass to `FileFormatWriter`. ## How was this patch tested? Updated unit tests. Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #18159 from viirya/SPARK-20703-2.
Showing
- core/src/main/scala/org/apache/spark/util/Utils.scala 9 additions, 0 deletionscore/src/main/scala/org/apache/spark/util/Utils.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/DataWritingCommand.scala 75 additions, 0 deletions...ache/spark/sql/execution/command/DataWritingCommand.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/command/commands.scala 12 additions, 0 deletions...ala/org/apache/spark/sql/execution/command/commands.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileFormatWriter.scala 101 additions, 20 deletions...he/spark/sql/execution/datasources/FileFormatWriter.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/InsertIntoHadoopFsRelationCommand.scala 14 additions, 4 deletions...ution/datasources/InsertIntoHadoopFsRelationCommand.scala
- sql/core/src/test/scala/org/apache/spark/sql/sources/PartitionedWriteSuite.scala 7 additions, 14 deletions.../org/apache/spark/sql/sources/PartitionedWriteSuite.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/InsertIntoHiveTable.scala 5 additions, 3 deletions...apache/spark/sql/hive/execution/InsertIntoHiveTable.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLMetricsSuite.scala 139 additions, 0 deletions...org/apache/spark/sql/hive/execution/SQLMetricsSuite.scala
Loading
Please register or sign in to comment