Skip to content
Snippets Groups Projects
Commit 8b8e70eb authored by Reynold Xin's avatar Reynold Xin
Browse files

Merge pull request #73 from falaki/ApproximateDistinctCount

Approximate distinct count

Added countApproxDistinct() to RDD and countApproxDistinctByKey() to PairRDDFunctions to approximately count distinct number of elements and distinct number of values per key, respectively. Both functions use HyperLogLog from stream-lib for counting. Both functions take a parameter that controls the trade-off between accuracy and memory consumption. Also added Scala docs and test suites for both methods.
parents 63b411dd bee445c9
No related branches found
No related tags found
No related merge requests found
Showing
with 1595 additions and 233 deletions
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment