-
- Downloads
[SPARK-18559][SQL] Fix HLL++ with small relative error
## What changes were proposed in this pull request? In `HyperLogLogPlusPlus`, if the relative error is so small that p >= 19, it will cause ArrayIndexOutOfBoundsException in `THRESHOLDS(p-4)` . We should check `p` and when p >= 19, regress to the original HLL result and use the small range correction they use. The pr also fixes the upper bound in the log info in `require()`. The upper bound is computed by: ``` val relativeSD = 1.106d / Math.pow(Math.E, p * Math.log(2.0d) / 2.0d) ``` which is derived from the equation for computing `p`: ``` val p = 2.0d * Math.log(1.106d / relativeSD) / Math.log(2.0d) ``` ## How was this patch tested? add test cases for: 1. checking validity of parameter relatvieSD 2. estimation with smaller relative error so that p >= 19 Author: Zhenhua Wang <wzh_zju@163.com> Author: wangzhenhua <wangzhenhua@huawei.com> Closes #15990 from wzhfy/hllppRsd.
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala 6 additions, 3 deletions.../catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala 8 additions, 1 deletion...lyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala
Loading
Please register or sign in to comment