-
- Downloads
[SPARK-19118][SQL] Percentile support for frequency distribution table
## What changes were proposed in this pull request? I have a frequency distribution table with following entries Age, No of person 21, 10 22, 15 23, 18 .. .. 30, 14 Moreover it is common to have data in frequency distribution format to further calculate Percentile, Median. With current implementation It would be very difficult and complex to find the percentile. Therefore i am proposing enhancement to current Percentile and Approx Percentile implementation to take frequency distribution column into consideration ## How was this patch tested? 1) Enhanced /sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala to cover the additional functionality 2) Run some performance benchmark test with 20 million row in local environment and did not see any performance degradation Please review http://spark.apache.org/contributing.html before opening a pull request. Author: gagan taneja <tanejagagan@gagans-MacBook-Pro.local> Closes #16497 from tanejagagan/branch-18940.
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/Percentile.scala 34 additions, 13 deletions...spark/sql/catalyst/expressions/aggregate/Percentile.scala
- sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/PercentileSuite.scala 107 additions, 42 deletions.../sql/catalyst/expressions/aggregate/PercentileSuite.scala
Please register or sign in to comment