Skip to content
  • Doris Xin's avatar
    dc965364
    [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size · dc965364
    Doris Xin authored
    Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD for sampling without replacement and three passes for sampling with replacement.
    
    Author: Doris Xin <doris.s.xin@gmail.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #1025 from dorx/stratified and squashes the following commits:
    
    245439e [Doris Xin] moved minSamplingRate to getUpperBound
    eaf5771 [Doris Xin] bug fixes.
    17a381b [Doris Xin] fixed a merge issue and a failed unit
    ea7d27f [Doris Xin] merge master
    b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java
    b3013a4 [Xiangrui Meng] move math3 back to test scope
    eecee5f [Doris Xin] Merge branch 'master' into stratified
    f4c21f3 [Doris Xin] Reviewer comments
    a10e68d [Doris Xin] style fix
    a2bf756 [Doris Xin] Merge branch 'master' into stratified
    680b677 [Doris Xin] use mapPartitionWithIndex instead
    9884a9f [Doris Xin] style fix
    bbfb8c9 [Doris Xin] Merge branch 'master' into stratified
    ee9d260 [Doris Xin] addressed reviewer comments
    6b5b10b [Doris Xin] Merge branch 'master' into stratified
    254e03c [Doris Xin] minor fixes and Java API.
    4ad516b [Doris Xin] remove unused imports from PairRDDFunctions
    bd9dc6e [Doris Xin] unit bug and style violation fixed
    1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check
    944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate
    0214a76 [Doris Xin] cleanUp
    90d94c0 [Doris Xin] merge master
    9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey
    7327611 [Doris Xin] merge master
    50581fc [Doris Xin] added a TODO for logging in python
    46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function
    7e1a481 [Doris Xin] changed the permission on SamplingUtil
    1d413ce [Doris Xin] fixed checkstyle issues
    9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size
    e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
    7cab53a [Doris Xin] fixed import bug in rdd.py
    ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
    1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
    dc965364
    [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size
    Doris Xin authored
    Implemented stratified sampling that guarantees exact sample size using ScaRSR with two passes over the RDD for sampling without replacement and three passes for sampling with replacement.
    
    Author: Doris Xin <doris.s.xin@gmail.com>
    Author: Xiangrui Meng <meng@databricks.com>
    
    Closes #1025 from dorx/stratified and squashes the following commits:
    
    245439e [Doris Xin] moved minSamplingRate to getUpperBound
    eaf5771 [Doris Xin] bug fixes.
    17a381b [Doris Xin] fixed a merge issue and a failed unit
    ea7d27f [Doris Xin] merge master
    b223529 [Xiangrui Meng] use approx bounds for poisson fix poisson mean for waitlisting add unit tests for Java
    b3013a4 [Xiangrui Meng] move math3 back to test scope
    eecee5f [Doris Xin] Merge branch 'master' into stratified
    f4c21f3 [Doris Xin] Reviewer comments
    a10e68d [Doris Xin] style fix
    a2bf756 [Doris Xin] Merge branch 'master' into stratified
    680b677 [Doris Xin] use mapPartitionWithIndex instead
    9884a9f [Doris Xin] style fix
    bbfb8c9 [Doris Xin] Merge branch 'master' into stratified
    ee9d260 [Doris Xin] addressed reviewer comments
    6b5b10b [Doris Xin] Merge branch 'master' into stratified
    254e03c [Doris Xin] minor fixes and Java API.
    4ad516b [Doris Xin] remove unused imports from PairRDDFunctions
    bd9dc6e [Doris Xin] unit bug and style violation fixed
    1fe1cff [Doris Xin] Changed fractionByKey to a map to enable arg check
    944a10c [Doris Xin] [SPARK-2145] Add lower bound on sampling rate
    0214a76 [Doris Xin] cleanUp
    90d94c0 [Doris Xin] merge master
    9e74ab5 [Doris Xin] Separated out most of the logic in sampleByKey
    7327611 [Doris Xin] merge master
    50581fc [Doris Xin] added a TODO for logging in python
    46f6c8c [Doris Xin] fixed the NPE caused by closures being cleaned before being passed into the aggregate function
    7e1a481 [Doris Xin] changed the permission on SamplingUtil
    1d413ce [Doris Xin] fixed checkstyle issues
    9ee94ee [Doris Xin] [SPARK-2082] stratified sampling in PairRDDFunctions that guarantees exact sample size
    e3fd6a6 [Doris Xin] Merge branch 'master' into takeSample
    7cab53a [Doris Xin] fixed import bug in rdd.py
    ffea61a [Doris Xin] SPARK-1939: Refactor takeSample method in RDD
    1441977 [Doris Xin] SPARK-1939 Refactor takeSample method in RDD to use ScaSRS
Loading