Skip to content
Snippets Groups Projects
  • jose.cambronero's avatar
    9c507577
    [SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs · 9c507577
    jose.cambronero authored
    This contribution is my original work and I license it to the project under it's open source license.
    
    Author: jose.cambronero <jose.cambronero@cloudera.com>
    
    Closes #6994 from josepablocam/master and squashes the following commits:
    
    bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
    0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
    1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
    a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
    1bb44bd [jose.cambronero]  style and doc changes. Factored out ks test into 2 separate tests
    2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
    a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
    7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
    e760ebd [jose.cambronero] line length changes to fit style check
    3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
    9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
    1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
    9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
    3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
    992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
    6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
    4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
    0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
    16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
    c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
    f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
    b9cff3a [jose.cambronero] made small changes to pass style check
    ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
    4da189b [jose.cambronero] added user facing ks test functions
    c659ea1 [jose.cambronero] created KS test class
    13dfe4d [jose.cambronero] created test result class for ks test
    9c507577
    History
    [SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs
    jose.cambronero authored
    This contribution is my original work and I license it to the project under it's open source license.
    
    Author: jose.cambronero <jose.cambronero@cloudera.com>
    
    Closes #6994 from josepablocam/master and squashes the following commits:
    
    bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
    0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
    1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
    a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
    1bb44bd [jose.cambronero]  style and doc changes. Factored out ks test into 2 separate tests
    2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
    a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
    7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
    e760ebd [jose.cambronero] line length changes to fit style check
    3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
    9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
    1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
    9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
    3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
    992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
    6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
    4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
    0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
    16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
    c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
    f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
    b9cff3a [jose.cambronero] made small changes to pass style check
    ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
    4da189b [jose.cambronero] added user facing ks test functions
    c659ea1 [jose.cambronero] created KS test class
    13dfe4d [jose.cambronero] created test result class for ks test