Skip to content
Snippets Groups Projects
  1. Nov 30, 2016
    • Yanbo Liang's avatar
      [SPARK-18318][ML] ML, Graph 2.1 QA: API: New Scala APIs, docs · 60022bfd
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      API review for 2.1, except ```LSH``` related classes which are still under development.
      
      ## How was this patch tested?
      Only doc changes, no new tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16009 from yanboliang/spark-18318.
      60022bfd
    • Anthony Truchet's avatar
      [SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFun · c5a64d76
      Anthony Truchet authored
      ## What changes were proposed in this pull request?
      
      Fix a broadcasted variable leak occurring at each invocation of CostFun in L-BFGS.
      
      ## How was this patch tested?
      
      UTests + check that fixed fatal memory consumption on Criteo's use cases.
      
      This contribution is made on behalf of Criteo S.A.
      (http://labs.criteo.com/) under the terms of the Apache v2 License.
      
      Author: Anthony Truchet <a.truchet@criteo.com>
      
      Closes #16040 from AnthonyTruchet/SPARK-18612-lbfgs-cost-fun.
      Unverified
      c5a64d76
  2. Nov 29, 2016
  3. Nov 28, 2016
    • Yun Ni's avatar
      [SPARK-18408][ML] API Improvements for LSH · 05f7c6ff
      Yun Ni authored
      ## What changes were proposed in this pull request?
      
      (1) Change output schema to `Array of Vector` instead of `Vectors`
      (2) Use `numHashTables` as the dimension of Array
      (3) Rename `RandomProjection` to `BucketedRandomProjectionLSH`, `MinHash` to `MinHashLSH`
      (4) Make `randUnitVectors/randCoefficients` private
      (5) Make Multi-Probe NN Search and `hashDistance` private for future discussion
      
      Saved for future PRs:
      (1) AND-amplification and `numHashFunctions` as the dimension of Vector are saved for a future PR.
      (2) `hashDistance` and MultiProbe NN Search needs more discussion. The current implementation is just a backward compatible one.
      
      ## How was this patch tested?
      Related unit tests are modified to make sure the performance of LSH are ensured, and the outputs of the APIs meets expectation.
      
      Author: Yun Ni <yunn@uber.com>
      Author: Yunni <Euler57721@gmail.com>
      
      Closes #15874 from Yunni/SPARK-18408-yunn-api-improvements.
      05f7c6ff
  4. Nov 26, 2016
  5. Nov 25, 2016
    • Zakaria_Hili's avatar
      [SPARK-18356][ML] Improve MLKmeans Performance · 445d4d9e
      Zakaria_Hili authored
      ## What changes were proposed in this pull request?
      
      Spark Kmeans fit() doesn't cache the RDD which generates a lot of warnings :
       WARN KMeans: The input data is not directly cached, which may hurt performance if its parent RDDs are also uncached.
      So, Kmeans should cache the internal rdd before calling the Mllib.Kmeans algo, this helped to improve spark kmeans performance by 14%
      
      https://github.com/ZakariaHili/spark/commit/a9cf905cf7dbd50eeb9a8b4f891f2f41ea672472
      
      hhbyyh
      ## How was this patch tested?
      Pass Kmeans tests and existing tests
      
      Author: Zakaria_Hili <zakahili@gmail.com>
      Author: HILI Zakaria <zakahili@gmail.com>
      
      Closes #15965 from ZakariaHili/zakbranch.
      Unverified
      445d4d9e
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will... · 51b1c155
      hyukjinkwon authored
      [SPARK-3359][BUILD][DOCS] More changes to resolve javadoc 8 errors that will help unidoc/genjavadoc compatibility
      
      ## What changes were proposed in this pull request?
      
      This PR only tries to fix things that looks pretty straightforward and were fixed in other previous PRs before.
      
      This PR roughly fixes several things as below:
      
      - Fix unrecognisable class and method links in javadoc by changing it from `[[..]]` to `` `...` ``
      
        ```
        [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/DataStreamReader.java:226: error: reference not found
        [error]    * Loads text files and returns a {link DataFrame} whose schema starts with a string column named
        ```
      
      - Fix an exception annotation and remove code backticks in `throws` annotation
      
        Currently, sbt unidoc with Java 8 complains as below:
      
        ```
        [error] .../java/org/apache/spark/sql/streaming/StreamingQuery.java:72: error: unexpected text
        [error]    * throws StreamingQueryException, if <code>this</code> query has terminated with an exception.
        ```
      
        `throws` should specify the correct class name from `StreamingQueryException,` to `StreamingQueryException` without backticks. (see [JDK-8007644](https://bugs.openjdk.java.net/browse/JDK-8007644)).
      
      - Fix `[[http..]]` to `<a href="http..."></a>`.
      
        ```diff
        -   * [[https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https Oracle
        -   * blog page]].
        +   * <a href="https://blogs.oracle.com/java-platform-group/entry/diagnosing_tls_ssl_and_https">
        +   * Oracle blog page</a>.
        ```
      
         `[[http...]]` link markdown in scaladoc is unrecognisable in javadoc.
      
      - It seems class can't have `return` annotation. So, two cases of this were removed.
      
        ```
        [error] .../java/org/apache/spark/mllib/regression/IsotonicRegression.java:27: error: invalid use of return
        [error]    * return New instance of IsotonicRegression.
        ```
      
      - Fix < to `&lt;` and > to `&gt;` according to HTML rules.
      
      - Fix `</p>` complaint
      
      - Exclude unrecognisable in javadoc, `constructor`, `todo` and `groupname`.
      
      ## How was this patch tested?
      
      Manually tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not yet make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15999 from HyukjinKwon/SPARK-3359-errors.
      Unverified
      51b1c155
  6. Nov 24, 2016
  7. Nov 22, 2016
    • Yanbo Liang's avatar
      [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data · 982b82e3
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
      * Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".
      
      ## How was this patch tested?
      Add unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15930 from yanboliang/spark-18501.
      982b82e3
  8. Nov 21, 2016
    • sethah's avatar
      [SPARK-18282][ML][PYSPARK] Add python clustering summaries for GMM and BKM · e811fbf9
      sethah authored
      ## What changes were proposed in this pull request?
      
      Add model summary APIs for `GaussianMixtureModel` and `BisectingKMeansModel` in pyspark.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15777 from sethah/pyspark_cluster_summaries.
      e811fbf9
  9. Nov 19, 2016
    • sethah's avatar
      [SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training · 856e0042
      sethah authored
      ## What changes were proposed in this pull request?
      
      This is a follow up to some of the discussion [here](https://github.com/apache/spark/pull/15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.
      
      Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.
      
      ## How was this patch tested?
      
      This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15893 from sethah/logreg_refactor.
      Unverified
      856e0042
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · d5b1d5fc
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      ## What changes were proposed in this pull request?
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png)
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      Unverified
      d5b1d5fc
  10. Nov 17, 2016
    • Zheng RuiFeng's avatar
      [SPARK-18480][DOCS] Fix wrong links for ML guide docs · cdaf4ce9
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      1, There are two `[Graph.partitionBy]` in `graphx-programming-guide.md`, the first one had no effert.
      2, `DataFrame`, `Transformer`, `Pipeline` and `Parameter`  in `ml-pipeline.md` were linked to `ml-guide.html` by mistake.
      3, `PythonMLLibAPI` in `mllib-linear-methods.md` was not accessable, because class `PythonMLLibAPI` is private.
      4, Other link updates.
      ## How was this patch tested?
       manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15912 from zhengruifeng/md_fix.
      Unverified
      cdaf4ce9
    • VinceShieh's avatar
      [SPARK-17462][MLLIB]use VersionUtils to parse Spark version strings · de77c677
      VinceShieh authored
      ## What changes were proposed in this pull request?
      
      Several places in MLlib use custom regexes or other approaches to parse Spark versions.
      Those should be fixed to use the VersionUtils. This PR replaces custom regexes with
      VersionUtils to get Spark version numbers.
      ## How was this patch tested?
      
      Existing tests.
      
      Signed-off-by: VinceShieh vincent.xieintel.com
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #15055 from VinceShieh/SPARK-17462.
      Unverified
      de77c677
  11. Nov 16, 2016
    • Zheng RuiFeng's avatar
      [SPARK-18434][ML] Add missing ParamValidations for ML algos · c68f1a38
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add missing ParamValidations for ML algos
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15881 from zhengruifeng/arg_checking.
      c68f1a38
    • Yanbo Liang's avatar
      [SPARK-18438][SPARKR][ML] spark.mlp should support RFormula. · 95eb06bd
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```spark.mlp``` should support ```RFormula``` like other ML algorithm wrappers.
      BTW, I did some cleanup and improvement for ```spark.mlp```.
      
      ## How was this patch tested?
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15883 from yanboliang/spark-18438.
      95eb06bd
  12. Nov 14, 2016
    • actuaryzhang's avatar
      [SPARK-18166][MLLIB] Fix Poisson GLM bug due to wrong requirement of response values · ae6cddb7
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      
      The current implementation of Poisson GLM seems to allow only positive values. This is incorrect since the support of Poisson includes the origin. The bug is easily fixed by changing the test of the Poisson variable from  'require(y **>** 0.0' to  'require(y **>=** 0.0'.
      
      mengxr  srowen
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      Author: actuaryzhang <actuaryzhang@uber.com>
      
      Closes #15683 from actuaryzhang/master.
      Unverified
      ae6cddb7
  13. Nov 13, 2016
    • Yanbo Liang's avatar
      [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms training on libsvm data · 07be232e
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Fix the following exceptions which throws when ```spark.randomForest```(classification), ```spark.gbt```(classification), ```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on libsvm data.
      ```
      java.lang.IllegalArgumentException: requirement failed: If label column already exists, forceIndexLabel can not be set with true.
      ```
      See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for more detail about how to reproduce this bug.
      * Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of ML algorithm wrappers use this function.
      * Drop some unwanted columns when making prediction.
      
      ## How was this patch tested?
      Add unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15851 from yanboliang/spark-18412.
      07be232e
  14. Nov 12, 2016
    • Yanbo Liang's avatar
      [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes · 22cb3a06
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      * Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` call into it.
      * Avoid capturing the outer object for ```modelType```.
      * Move ```requireNonnegativeValues``` and ```requireZeroOneBernoulliValues``` to companion object.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15826 from yanboliang/spark-14077-2.
      22cb3a06
  15. Nov 11, 2016
    • sethah's avatar
      [SPARK-18060][ML] Avoid unnecessary computation for MLOR · 46b2550b
      sethah authored
      ## What changes were proposed in this pull request?
      
      Before this patch, the gradient updates for multinomial logistic regression were computed by an outer loop over the number of classes and an inner loop over the number of features. Inside the inner loop, we standardized the feature value (`value / featuresStd(index)`), which means we performed the computation `numFeatures * numClasses` times. We only need to perform that computation `numFeatures` times, however. If we re-order the inner and outer loop, we can avoid this, but then we lose sequential memory access. In this patch, we instead lay out the coefficients in column major order while we train, so that we can avoid the extra computation and retain sequential memory access. We convert back to row-major order when we create the model.
      
      ## How was this patch tested?
      
      This is an implementation detail only, so the original behavior should be maintained. All tests pass. I ran some performance tests to verify speedups. The results are below, and show significant speedups.
      ## Performance Tests
      
      **Setup**
      
      3 node bare-metal cluster
      120 cores total
      384 gb RAM total
      
      **Results**
      
      NOTE: The `currentMasterTime` and `thisPatchTime` are times in seconds for a single iteration of L-BFGS or OWL-QN.
      
      |    |   numPoints |   numFeatures |   numClasses |   regParam |   elasticNetParam |   currentMasterTime (sec) |   thisPatchTime (sec) |   pctSpeedup |
      |----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------|
      |  0 |       1e+07 |           100 |          500 |       0.5  |                 0 |                        90 |                    18 |           80 |
      |  1 |       1e+08 |           100 |           50 |       0.5  |                 0 |                        90 |                    19 |           78 |
      |  2 |       1e+08 |           100 |           50 |       0.05 |                 1 |                        72 |                    19 |           73 |
      |  3 |       1e+06 |           100 |         5000 |       0.5  |                 0 |                        93 |                    53 |           43 |
      |  4 |       1e+07 |           100 |         5000 |       0.5  |                 0 |                       900 |                   390 |           56 |
      |  5 |       1e+08 |           100 |          500 |       0.5  |                 0 |                       840 |                   174 |           79 |
      |  6 |       1e+08 |           100 |          200 |       0.5  |                 0 |                       360 |                    72 |           80 |
      |  7 |       1e+08 |          1000 |            5 |       0.5  |                 0 |                         9 |                     3 |           66 |
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15593 from sethah/MLOR_PERF_COL_MAJOR_COEF.
      Unverified
      46b2550b
  16. Nov 10, 2016
  17. Nov 08, 2016
    • Felix Cheung's avatar
      [SPARK-18239][SPARKR] Gradient Boosted Tree for R · 55964c15
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Gradient Boosted Tree in R.
      With a few minor improvements to RandomForest in R.
      
      Since this is relatively isolated I'd like to target this for branch-2.1
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15746 from felixcheung/rgbt.
      55964c15
    • Joseph K. Bradley's avatar
      [SPARK-17748][ML] Minor cleanups to one-pass linear regression with elastic net · 26e1c53a
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      * Made SingularMatrixException private ml
      * WeightedLeastSquares: Changed to allow tol >= 0 instead of only tol > 0
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15779 from jkbradley/wls-cleanups.
      26e1c53a
  18. Nov 07, 2016
    • Hyukjin Kwon's avatar
      [SPARK-14914][CORE] Fix Resource not closed after using, mostly for unit tests · 8f0ea011
      Hyukjin Kwon authored
      ## What changes were proposed in this pull request?
      
      Close `FileStreams`, `ZipFiles` etc to release the resources after using. Not closing the resources will cause IO Exception to be raised while deleting temp files.
      ## How was this patch tested?
      
      Existing tests
      
      Author: U-FAREAST\tl <tl@microsoft.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Tao LI <tl@microsoft.com>
      
      Closes #15618 from HyukjinKwon/SPARK-14914-1.
      8f0ea011
    • Yanbo Liang's avatar
      [SPARK-18291][SPARKR][ML] SparkR glm predict should output original label when family = binomial. · daa975f4
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      SparkR ```spark.glm``` predict should output original label when family = "binomial".
      
      ## How was this patch tested?
      Add unit test.
      You can also run the following code to test:
      ```R
      training <- suppressWarnings(createDataFrame(iris))
      training <- training[training$Species %in% c("versicolor", "virginica"), ]
      model <- spark.glm(training, Species ~ Sepal_Length + Sepal_Width,family = binomial(link = "logit"))
      showDF(predict(model, training))
      ```
      Before this change:
      ```
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|         prediction|
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      |         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| 0.8271421517601544|
      |         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| 0.6044595910413112|
      |         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| 0.7916340858281998|
      |         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|0.16080518180591158|
      |         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| 0.6112229217050189|
      |         5.7|        2.8|         4.5|        1.3|versicolor|  0.0| 0.2555087295500885|
      |         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| 0.5681507664364834|
      |         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|0.05990570219972002|
      |         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| 0.6644434078306246|
      |         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|0.11293577405862379|
      |         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|0.06152372321585971|
      |         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|0.35250697207602555|
      |         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|0.32267018290814303|
      |         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|  0.433391153814592|
      |         5.6|        2.9|         3.6|        1.3|versicolor|  0.0| 0.2280744262436993|
      |         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| 0.7219848389339459|
      |         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|0.23527698971404695|
      |         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|  0.285024533520016|
      |         6.2|        2.2|         4.5|        1.5|versicolor|  0.0| 0.4107047877447493|
      |         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|0.20083561961645083|
      +------------+-----------+------------+-----------+----------+-----+-------------------+
      ```
      After this change:
      ```
      +------------+-----------+------------+-----------+----------+-----+----------+
      |Sepal_Length|Sepal_Width|Petal_Length|Petal_Width|   Species|label|prediction|
      +------------+-----------+------------+-----------+----------+-----+----------+
      |         7.0|        3.2|         4.7|        1.4|versicolor|  0.0| virginica|
      |         6.4|        3.2|         4.5|        1.5|versicolor|  0.0| virginica|
      |         6.9|        3.1|         4.9|        1.5|versicolor|  0.0| virginica|
      |         5.5|        2.3|         4.0|        1.3|versicolor|  0.0|versicolor|
      |         6.5|        2.8|         4.6|        1.5|versicolor|  0.0| virginica|
      |         5.7|        2.8|         4.5|        1.3|versicolor|  0.0|versicolor|
      |         6.3|        3.3|         4.7|        1.6|versicolor|  0.0| virginica|
      |         4.9|        2.4|         3.3|        1.0|versicolor|  0.0|versicolor|
      |         6.6|        2.9|         4.6|        1.3|versicolor|  0.0| virginica|
      |         5.2|        2.7|         3.9|        1.4|versicolor|  0.0|versicolor|
      |         5.0|        2.0|         3.5|        1.0|versicolor|  0.0|versicolor|
      |         5.9|        3.0|         4.2|        1.5|versicolor|  0.0|versicolor|
      |         6.0|        2.2|         4.0|        1.0|versicolor|  0.0|versicolor|
      |         6.1|        2.9|         4.7|        1.4|versicolor|  0.0|versicolor|
      |         5.6|        2.9|         3.6|        1.3|versicolor|  0.0|versicolor|
      |         6.7|        3.1|         4.4|        1.4|versicolor|  0.0| virginica|
      |         5.6|        3.0|         4.5|        1.5|versicolor|  0.0|versicolor|
      |         5.8|        2.7|         4.1|        1.0|versicolor|  0.0|versicolor|
      |         6.2|        2.2|         4.5|        1.5|versicolor|  0.0|versicolor|
      |         5.6|        2.5|         3.9|        1.1|versicolor|  0.0|versicolor|
      +------------+-----------+------------+-----------+----------+-----+----------+
      ```
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15788 from yanboliang/spark-18291.
      daa975f4
  19. Nov 06, 2016
    • Wojciech Szymanski's avatar
      [SPARK-18210][ML] Pipeline.copy does not create an instance with the same UID · b89d0556
      Wojciech Szymanski authored
      ## What changes were proposed in this pull request?
      
      Motivation:
      `org.apache.spark.ml.Pipeline.copy(extra: ParamMap)` does not create an instance with the same UID. It does not conform to the method specification from its base class `org.apache.spark.ml.param.Params.copy(extra: ParamMap)`
      
      Solution:
      - fix for Pipeline UID
      - introduced new tests for `org.apache.spark.ml.Pipeline.copy`
      - minor improvements in test for `org.apache.spark.ml.PipelineModel.copy`
      
      ## How was this patch tested?
      
      Introduced new unit test: `org.apache.spark.ml.PipelineSuite."Pipeline.copy"`
      Improved existing unit test: `org.apache.spark.ml.PipelineSuite."PipelineModel.copy"`
      
      Author: Wojciech Szymanski <wk.szymanski@gmail.com>
      
      Closes #15759 from wojtek-szymanski/SPARK-18210.
      b89d0556
    • sethah's avatar
      [SPARK-18276][ML] ML models should copy the training summary and set parent · 23ce0d1e
      sethah authored
      ## What changes were proposed in this pull request?
      
      Only some of the models which contain a training summary currently set the summaries in the copy method. Linear/Logistic regression do, GLR, GMM, KM, and BKM do not. Additionally, these copy methods did not set the parent pointer of the copied model. This patch modifies the copy methods of the four models mentioned above to copy the training summary and set the parent.
      
      ## How was this patch tested?
      
      Add unit tests in Linear/Logistic/GeneralizedLinear regression and GaussianMixture/KMeans/BisectingKMeans to check the parent pointer of the copied model and check that the copied model has a summary.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15773 from sethah/SPARK-18276.
      23ce0d1e
  20. Nov 02, 2016
  21. Nov 01, 2016
    • Joseph K. Bradley's avatar
      [SPARK-18088][ML] Various ChiSqSelector cleanups · 91c33a0c
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      - Renamed kbest to numTopFeatures
      - Renamed alpha to fpr
      - Added missing Since annotations
      - Doc cleanups
      ## How was this patch tested?
      
      Added new standardized unit tests for spark.ml.
      Improved existing unit test coverage a bit.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15647 from jkbradley/chisqselector-follow-ups.
      91c33a0c
    • Zheng RuiFeng's avatar
      [SPARK-17848][ML] Move LabelCol datatype cast into Predictor.fit · 8ac09108
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      1, move cast to `Predictor`
      2, and then, remove unnecessary cast
      ## How was this patch tested?
      
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15414 from zhengruifeng/move_cast.
      8ac09108
    • Reynold Xin's avatar
      [SPARK-18024][SQL] Introduce an internal commit protocol API · d9d14650
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch introduces an internal commit protocol API that is used by the batch data source to do write commits. It currently has only one implementation that uses Hadoop MapReduce's OutputCommitter API. In the future, this commit API can be used to unify streaming and batch commits.
      
      ## How was this patch tested?
      Should be covered by existing write tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15707 from rxin/SPARK-18024-2.
      d9d14650
  22. Oct 30, 2016
    • Felix Cheung's avatar
      [SPARK-16137][SPARKR] randomForest for R · b6879b8b
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Random Forest Regression and Classification for R
      Clean-up/reordering generics.R
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15607 from felixcheung/rrandomforest.
      b6879b8b
    • Sean Owen's avatar
      [SPARK-3261][MLLIB] KMeans clusterer can return duplicate cluster centers · a489567e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Return potentially fewer than k cluster centers in cases where k distinct centroids aren't available or aren't selected.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15450 from srowen/SPARK-3261.
      Unverified
      a489567e
  23. Oct 28, 2016
    • Yunni's avatar
      [SPARK-5992][ML] Locality Sensitive Hashing · ac26e9cf
      Yunni authored
      ## What changes were proposed in this pull request?
      
      Implement Locality Sensitive Hashing along with approximate nearest neighbors and approximate similarity join based on the [design doc](https://docs.google.com/document/d/1D15DTDMF_UWTTyWqXfG7y76iZalky4QmifUYQ6lH5GM/edit).
      
      Detailed changes are as follows:
      (1) Implement abstract LSH, LSHModel classes as Estimator-Model
      (2) Implement approxNearestNeighbors and approxSimilarityJoin in the abstract LSHModel
      (3) Implement Random Projection as LSH subclass for Euclidean distance, Min Hash for Jaccard Distance
      (4) Implement unit test utility methods including checkLshProperty, checkNearestNeighbor and checkSimilarityJoin
      
      Things that will be implemented in a follow-up PR:
       - Bit Sampling for Hamming Distance, SignRandomProjection for Cosine Distance
       - PySpark Integration for the scala classes and methods.
      
      ## How was this patch tested?
      Unit test is implemented for all the implemented classes and algorithms. A scalability test on Uber's dataset was performed internally.
      
      Tested the methods on [WEX dataset](https://aws.amazon.com/items/2345) from AWS, with the steps and results [here](https://docs.google.com/document/d/19BXg-67U83NVB3M0I84HVBVg3baAVaESD_mrg_-vLro/edit).
      
      ## References
      Gionis, Aristides, Piotr Indyk, and Rajeev Motwani. "Similarity search in high dimensions via hashing." VLDB 7 Sep. 1999: 518-529.
      Wang, Jingdong et al. "Hashing for similarity search: A survey." arXiv preprint arXiv:1408.2927 (2014).
      
      Author: Yunni <Euler57721@gmail.com>
      Author: Yun Ni <yunn@uber.com>
      
      Closes #15148 from Yunni/SPARK-5992-yunn-lsh.
      ac26e9cf
    • Zheng RuiFeng's avatar
      [SPARK-18109][ML] Add instrumentation to GMM · 569788a5
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      Add instrumentation to GMM
      
      ## How was this patch tested?
      
      Test in spark-shell
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15636 from zhengruifeng/gmm_instr.
      569788a5
  24. Oct 27, 2016
    • VinceShieh's avatar
      [SPARK-17219][ML] enhanced NaN value handling in Bucketizer · 0b076d4c
      VinceShieh authored
      ## What changes were proposed in this pull request?
      
      This PR is an enhancement of PR with commit ID:57dc326b.
      NaN is a special type of value which is commonly seen as invalid. But We find that there are certain cases where NaN are also valuable, thus need special handling. We provided user when dealing NaN values with 3 options, to either reserve an extra bucket for NaN values, or remove the NaN values, or report an error, by setting handleNaN "keep", "skip", or "error"(default) respectively.
      
      '''Before:
      val bucketizer: Bucketizer = new Bucketizer()
                .setInputCol("feature")
                .setOutputCol("result")
                .setSplits(splits)
      '''After:
      val bucketizer: Bucketizer = new Bucketizer()
                .setInputCol("feature")
                .setOutputCol("result")
                .setSplits(splits)
                .setHandleNaN("keep")
      
      ## How was this patch tested?
      Tests added in QuantileDiscretizerSuite, BucketizerSuite and DataFrameStatSuite
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      Author: Vincent Xie <vincent.xie@intel.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15428 from VinceShieh/spark-17219_followup.
      0b076d4c
Loading