-
- Downloads
[SPARK-20770][SQL] Improve ColumnStats
## What changes were proposed in this pull request? This PR improves the implementation of `ColumnStats` by using the following appoaches. 1. Declare subclasses of `ColumnStats` as `final` 2. Remove unnecessary call of `row.isNullAt(ordinal)` 3. Remove the dependency on `GenericInternalRow` For 1., this declaration encourages method inlining and other optimizations of JIT compiler For 2., in `gatherStats()`, while previous code in subclasses of `ColumnStats` always calls `row.isNullAt()` twice, the PR just calls `row.isNullAt()` only once. For 3., `collectedStatistics()` returns `Array[Any]` instead of `GenericInternalRow`. This removes the dependency of unnecessary package and reduces the number of allocations of `GenericInternalRow`. In addition to that, in the future, `gatherValueStats()`, which is specialized for each data type, can be effectively called from the generated code without using generic data structure `InternalRow`. ## How was this patch tested? Tested by existing test suite Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Closes #18002 from kiszk/SPARK-20770.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/ColumnStats.scala 155 additions, 86 deletions...org/apache/spark/sql/execution/columnar/ColumnStats.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/InMemoryRelation.scala 2 additions, 2 deletions...pache/spark/sql/execution/columnar/InMemoryRelation.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/columnar/ColumnStatsSuite.scala 23 additions, 27 deletions...pache/spark/sql/execution/columnar/ColumnStatsSuite.scala
Loading
Please register or sign in to comment