-
- Downloads
Implement ApproximateCountDistinct for SparkSql
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions. A simple serializer and test cases are added as well. Author: larvaboy <larvaboy@gmail.com> Closes #737 from larvaboy/master and squashes the following commits: bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct. 9ba8360 [larvaboy] Fix alignment and null handling issues. 95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct. f57917d [larvaboy] Add the parser for the approximate count. a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions. 7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog. 1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class. 653542b [larvaboy] Fix a couple of minor typos.
Showing
- core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala 3 additions, 3 deletions...rc/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala 7 additions, 0 deletions.../main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala 76 additions, 2 deletions...rg/apache/spark/sql/catalyst/expressions/aggregates.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlSerializer.scala 17 additions, 0 deletions...a/org/apache/spark/sql/execution/SparkSqlSerializer.scala
- sql/core/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala 19 additions, 2 deletions...e/src/test/scala/org/apache/spark/sql/SQLQuerySuite.scala
Please register or sign in to comment