Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    3cedc4f4
    [SPARK-2871] [PySpark] add histgram() API · 3cedc4f4
    Davies Liu authored
    RDD.histogram(buckets)
    
            Compute a histogram using the provided buckets. The buckets
            are all open to the right except for the last which is closed.
            e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
            which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
            and 50 we would have a histogram of 1,0,1.
    
            If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
            this can be switched from an O(log n) inseration to O(1) per
            element(where n = # buckets).
    
            Buckets must be sorted and not contain any duplicates, must be
            at least two elements.
    
            If `buckets` is a number, it will generates buckets which is
            evenly spaced between the minimum and maximum of the RDD. For
            example, if the min value is 0 and the max is 100, given buckets
            as 2, the resulting buckets will be [0,50) [50,100]. buckets must
            be at least 1 If the RDD contains infinity, NaN throws an exception
            If the elements in RDD do not vary (max == min) always returns
            a single bucket.
    
            It will return an tuple of buckets and histogram.
    
            >>> rdd = sc.parallelize(range(51))
            >>> rdd.histogram(2)
            ([0, 25, 50], [25, 26])
            >>> rdd.histogram([0, 5, 25, 50])
            ([0, 5, 25, 50], [5, 20, 26])
            >>> rdd.histogram([0, 15, 30, 45, 60], True)
            ([0, 15, 30, 45, 60], [15, 15, 15, 6])
            >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
            >>> rdd.histogram(("a", "b", "c"))
            (('a', 'b', 'c'), [2, 2])
    
    closes #122, it's duplicated.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #2091 from davies/histgram and squashes the following commits:
    
    a322f8a [Davies Liu] fix deprecation of e.message
    84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
    d9a0722 [Davies Liu] address comments
    0e18a2d [Davies Liu] add histgram() API
    3cedc4f4
    History
    [SPARK-2871] [PySpark] add histgram() API
    Davies Liu authored
    RDD.histogram(buckets)
    
            Compute a histogram using the provided buckets. The buckets
            are all open to the right except for the last which is closed.
            e.g. [1,10,20,50] means the buckets are [1,10) [10,20) [20,50],
            which means 1<=x<10, 10<=x<20, 20<=x<=50. And on the input of 1
            and 50 we would have a histogram of 1,0,1.
    
            If your histogram is evenly spaced (e.g. [0, 10, 20, 30]),
            this can be switched from an O(log n) inseration to O(1) per
            element(where n = # buckets).
    
            Buckets must be sorted and not contain any duplicates, must be
            at least two elements.
    
            If `buckets` is a number, it will generates buckets which is
            evenly spaced between the minimum and maximum of the RDD. For
            example, if the min value is 0 and the max is 100, given buckets
            as 2, the resulting buckets will be [0,50) [50,100]. buckets must
            be at least 1 If the RDD contains infinity, NaN throws an exception
            If the elements in RDD do not vary (max == min) always returns
            a single bucket.
    
            It will return an tuple of buckets and histogram.
    
            >>> rdd = sc.parallelize(range(51))
            >>> rdd.histogram(2)
            ([0, 25, 50], [25, 26])
            >>> rdd.histogram([0, 5, 25, 50])
            ([0, 5, 25, 50], [5, 20, 26])
            >>> rdd.histogram([0, 15, 30, 45, 60], True)
            ([0, 15, 30, 45, 60], [15, 15, 15, 6])
            >>> rdd = sc.parallelize(["ab", "ac", "b", "bd", "ef"])
            >>> rdd.histogram(("a", "b", "c"))
            (('a', 'b', 'c'), [2, 2])
    
    closes #122, it's duplicated.
    
    Author: Davies Liu <davies.liu@gmail.com>
    
    Closes #2091 from davies/histgram and squashes the following commits:
    
    a322f8a [Davies Liu] fix deprecation of e.message
    84e85fa [Davies Liu] remove evenBuckets, add more tests (including str)
    d9a0722 [Davies Liu] address comments
    0e18a2d [Davies Liu] add histgram() API