-
- Downloads
[SPARK-19634][ML] Multivariate summarizer - dataframes API
## What changes were proposed in this pull request? This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics. ## How was this patch tested? Testcases added. ## Performance Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan ### Performance data (test on my laptop, use 2 partitions. tries out = 20, warm up = 10) The unit of test results is records/milliseconds (higher is better) Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000 ----|------|----|---|----|---- Dataframe | 15149 | 7441 | 2118 | 224 | 21 RDD from Dataframe | 4992 | 4440 | 2328 | 320 | 33 raw RDD | 53931 | 20683 | 3966 | 528 | 53 Author: WeichenXu <WeichenXu123@outlook.com> Closes #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.
Showing
- mllib/src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala 13 additions, 11 deletions...src/main/scala/org/apache/spark/ml/linalg/VectorUDT.scala
- mllib/src/main/scala/org/apache/spark/ml/stat/Summarizer.scala 596 additions, 0 deletions.../src/main/scala/org/apache/spark/ml/stat/Summarizer.scala
- mllib/src/test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala 582 additions, 0 deletions...test/scala/org/apache/spark/ml/stat/SummarizerSuite.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Projection.scala 6 additions, 0 deletions...rg/apache/spark/sql/catalyst/expressions/Projection.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala 6 additions, 0 deletions...spark/sql/catalyst/expressions/aggregate/interfaces.scala
Loading
Please register or sign in to comment