Skip to content
Snippets Groups Projects
  1. Sep 14, 2017
    • Ming Jiang's avatar
      [SPARK-21854] Added LogisticRegressionTrainingSummary for... · 8d8641f1
      Ming Jiang authored
      [SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API
      
      ## What changes were proposed in this pull request?
      
      Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API
      
      ## How was this patch tested?
      
      Added unit test
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Ming Jiang <mjiang@fanatics.com>
      Author: Ming Jiang <jmwdpk@gmail.com>
      Author: jmwdpk <jmwdpk@gmail.com>
      
      Closes #19185 from jmwdpk/SPARK-21854.
      8d8641f1
  2. Sep 13, 2017
    • Zheng RuiFeng's avatar
      [SPARK-21690][ML] one-pass imputer · 0fa5b7ca
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      parallelize the computation of all columns
      
      performance tests:
      
      |numColums| Mean(Old) | Median(Old) | Mean(RDD) | Median(RDD) | Mean(DF) | Median(DF) |
      |------|----------|------------|----------|------------|----------|------------|
      |1|0.0771394713|0.0658712813|0.080779802|0.048165981499999996|0.10525509870000001|0.0499620203|
      |10|0.7234340630999999|0.5954440414|0.0867935197|0.13263428659999998|0.09255724889999999|0.1573943635|
      |100|7.3756451568|6.2196631259|0.1911931552|0.8625376817000001|0.5557462431|1.7216837982000002|
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #18902 from zhengruifeng/parallelize_imputer.
      0fa5b7ca
    • WeichenXu's avatar
      [SPARK-21027][MINOR][FOLLOW-UP] add missing since tag · f6c5d8f6
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      add missing since tag for `setParallelism` in #19110
      
      ## How was this patch tested?
      
      N/A
      
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19214 from WeichenXu123/minor01.
      f6c5d8f6
  3. Sep 12, 2017
    • Zheng RuiFeng's avatar
      [SPARK-18608][ML] Fix double caching · c5f9b89d
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      `df.rdd.getStorageLevel` => `df.storageLevel`
      
      using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.
      
      Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, https://github.com/apache/spark/pull/17014
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #19197 from zhengruifeng/double_caching.
      c5f9b89d
    • Ajay Saini's avatar
      [SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark · 720c94fe
      Ajay Saini authored
      # What changes were proposed in this pull request?
      
      Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism.
      
      I take this PR #18281 over because the original author is busy but we need merge this PR soon.
      After this been merged, we can close #18281 .
      
      ## How was this patch tested?
      
      Test suite added.
      
      Author: Ajay Saini <ajays725@gmail.com>
      Author: WeichenXu <weichen.xu@databricks.com>
      
      Closes #19110 from WeichenXu123/spark-21027.
      720c94fe
    • Marco Gaido's avatar
      [SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine... · dd781675
      Marco Gaido authored
      [SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette.
      
      ## What changes were proposed in this pull request?
      
      This PR adds the ClusteringEvaluator Evaluator which contains two metrics:
       - **cosineSilhouette**: the Silhouette measure using the cosine distance;
       - **squaredSilhouette**: the Silhouette measure using the squared Euclidean distance.
      
      The implementation of the two metrics refers to the algorithm proposed and explained [here](https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view). These algorithms have been thought for a distributed and parallel environment, thus they have reasonable performance, unlike a naive Silhouette implementation following its definition.
      
      ## How was this patch tested?
      
      The patch has been tested with the additional unit tests added (comparing the results with the ones provided by [Python sklearn library](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)).
      
      Author: Marco Gaido <mgaido@hortonworks.com>
      
      Closes #18538 from mgaido91/SPARK-14516.
      dd781675
  4. Sep 11, 2017
    • caoxuewen's avatar
      [MINOR][SQL] remove unuse import class · dc74c0e6
      caoxuewen authored
      ## What changes were proposed in this pull request?
      
      this PR describe remove the import class that are unused.
      
      ## How was this patch tested?
      
      N/A
      
      Author: caoxuewen <cao.xuewen@zte.com.cn>
      
      Closes #19131 from heary-cao/unuse_import.
      dc74c0e6
  5. Sep 08, 2017
  6. Sep 06, 2017
    • Bryan Cutler's avatar
      [SPARK-19357][ML] Adding parallel model evaluation in ML tuning · 16c4c03c
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Modified `CrossValidator` and `TrainValidationSplit` to be able to evaluate models in parallel for a given parameter grid.  The level of parallelism is controlled by a parameter `numParallelEval` used to schedule a number of models to be trained/evaluated so that the jobs can be run concurrently.  This is a naive approach that does not check the cluster for needed resources, so care must be taken by the user to tune the parameter appropriately.  The default value is `1` which will train/evaluate in serial.
      
      ## How was this patch tested?
      Added unit tests for CrossValidator and TrainValidationSplit to verify that model selection is the same when run in serial vs parallel.  Manual testing to verify tasks run in parallel when param is > 1. Added parameter usage to relevant examples.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16774 from BryanCutler/parallel-model-eval-SPARK-19357.
      16c4c03c
  7. Sep 01, 2017
    • WeichenXu's avatar
      [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure... · 900f14f6
      WeichenXu authored
      [SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns
      
      ## What changes were proposed in this pull request?
      
      Add test for prediction using the model with all combinations of output columns turned on/off.
      Make sure the output column values match, presumably by comparing vs. the case with all 3 output columns turned on.
      
      ## How was this patch tested?
      
      Test updated.
      
      Author: WeichenXu <weichen.xu@databricks.com>
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #19065 from WeichenXu123/generic_test_for_prob_classifier.
      900f14f6
    • Sean Owen's avatar
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e
      Sean Owen authored
      [SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation
      
      …build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure
      
      ## What changes were proposed in this pull request?
      
      This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.
      
      In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.
      
      It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.
      
      - Scalatest 2.x -> 3.0.3
      - Chill 0.8.0 -> 0.8.4
      - Clapper 1.0.x -> 1.1.2
      - json4s 3.2.x -> 3.4.2
      - Jackson 2.6.x -> 2.7.9 (required by json4s)
      
      This change does _not_ fully enable a Scala 2.12 build:
      
      - It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
      - It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.
      
      What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.
      
      ## How was this patch tested?
      
      Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18645 from srowen/SPARK-14280.
      12ab7f7e
  8. Aug 31, 2017
  9. Aug 30, 2017
    • Bryan Cutler's avatar
      [SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher · 4133c1b0
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.
      
      ## How was this patch tested?
      
      Manually ran examples and verified that output is consistent for different APIs
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.
      4133c1b0
    • Sean Owen's avatar
      [SPARK-21806][MLLIB] BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading · 734ed7a7
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Prepend (0,p) to precision-recall curve not (0,1) where p matches lowest recall point
      
      ## How was this patch tested?
      
      Updated tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19038 from srowen/SPARK-21806.
      734ed7a7
  10. Aug 29, 2017
  11. Aug 28, 2017
    • Weichen Xu's avatar
      [SPARK-17139][ML] Add model summary for MultinomialLogisticRegression · c7270a46
      Weichen Xu authored
      ## What changes were proposed in this pull request?
      
      Add 4 traits, using the following hierarchy:
      LogisticRegressionSummary
      LogisticRegressionTrainingSummary: LogisticRegressionSummary
      BinaryLogisticRegressionSummary: LogisticRegressionSummary
      BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary
      
      and the public method such as `def summary` only return trait type listed above.
      
      and then implement 4 concrete classes:
      LogisticRegressionSummaryImpl (multiclass case)
      LogisticRegressionTrainingSummaryImpl (multiclass case)
      BinaryLogisticRegressionSummaryImpl (binary case).
      BinaryLogisticRegressionTrainingSummaryImpl (binary case).
      
      ## How was this patch tested?
      
      Existing tests & added tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #15435 from WeichenXu123/mlor_summary.
      c7270a46
    • WeichenXu's avatar
      [SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.variance generate negative result · 0456b405
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.
      
      **This is a serious bug because many algos in MLLib**
      **use stddev computed from** `sqrt(variance)`
      **it will generate NaN and crash the whole algorithm.**
      
      we can reproduce this bug use the following code:
      ```
          val summarizer1 = (new MultivariateOnlineSummarizer)
            .add(Vectors.dense(3.0), 0.7)
          val summarizer2 = (new MultivariateOnlineSummarizer)
            .add(Vectors.dense(3.0), 0.4)
          val summarizer3 = (new MultivariateOnlineSummarizer)
            .add(Vectors.dense(3.0), 0.5)
          val summarizer4 = (new MultivariateOnlineSummarizer)
            .add(Vectors.dense(3.0), 0.4)
      
          val summarizer = summarizer1
            .merge(summarizer2)
            .merge(summarizer3)
            .merge(summarizer4)
      
          println(summarizer.variance(0))
      ```
      This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares`
      
      ## How was this patch tested?
      
      test cases added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #19029 from WeichenXu123/fix_summarizer_var_bug.
      0456b405
  12. Aug 25, 2017
    • Sean Owen's avatar
      [MINOR][BUILD] Fix build warnings and Java lint errors · de7af295
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix build warnings and Java lint errors. This just helps a bit in evaluating (new) warnings in another PR I have open.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #19051 from srowen/JavaWarnings.
      de7af295
  13. Aug 24, 2017
  14. Aug 22, 2017
    • Weichen Xu's avatar
      [SPARK-12664][ML] Expose probability in mlp model · d6b30edd
      Weichen Xu authored
      ## What changes were proposed in this pull request?
      
      Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability  column when transforming data.
      
      ## How was this patch tested?
      
      Test added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #17373 from WeichenXu123/expose_probability_in_mlp_model.
      d6b30edd
    • Yanbo Liang's avatar
      [ML][MINOR] Make sharedParams update. · 34296190
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```sharedParams.scala``` was generated by ```SharedParamsCodeGen```, but it's not updated in master. Maybe someone manual update ```sharedParams.scala```, this PR fix this issue.
      
      ## How was this patch tested?
      Offline check.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #19011 from yanboliang/sharedParams.
      34296190
    • Weichen Xu's avatar
      [SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero · d56c2621
      Weichen Xu authored
      ## What changes were proposed in this pull request?
      
      fix bug of MLOR do not work correctly when featureStd contains zero
      
      We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)
      ```
          val multinomialDatasetWithZeroVar = {
            val nPoints = 100
            val coefficients = Array(
              -0.57997, 0.912083, -0.371077,
              -0.16624, -0.84355, -0.048509)
      
            val xMean = Array(5.843, 3.0)
            val xVariance = Array(0.6856, 0.0)  // including zero variance
      
            val testData = generateMultinomialLogisticInput(
              coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)
      
            val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
            df.cache()
            df
          }
      ```
      ## How was this patch tested?
      
      testcase added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #18896 from WeichenXu123/fix_mlor_stdvalue_zero_bug.
      d56c2621
  15. Aug 21, 2017
    • Yanbo Liang's avatar
      [SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization. · c108a5d3
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always standardize the data during training to improve the rate of convergence regardless of _standardization_ is true or false. If _standardization_ is false, we perform reverse standardization by penalizing each component differently to get effectively the same objective function when the training dataset is not standardized. We should keep these comments in the code to let developers understand how we handle it correctly.
      
      ## How was this patch tested?
      Existing tests, only adding some comments in code.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18992 from yanboliang/SPARK-19762.
      c108a5d3
    • Nick Pentreath's avatar
      [SPARK-21468][PYSPARK][ML] Python API for FeatureHasher · 988b84d7
      Nick Pentreath authored
      Add Python API for `FeatureHasher` transformer.
      
      ## How was this patch tested?
      
      New doc test.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.
      988b84d7
  16. Aug 20, 2017
  17. Aug 16, 2017
    • Peng Meng's avatar
      [SPARK-21680][ML][MLLIB] optimize Vector compress · a0345cbe
      Peng Meng authored
      ## What changes were proposed in this pull request?
      
      When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse.
      This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse.
      When the length of the vector is large, there is significant performance difference between this two method.
      
      ## How was this patch tested?
      
      The existing UT
      
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #18899 from mpjlu/optVectorCompress.
      a0345cbe
    • Nick Pentreath's avatar
      [SPARK-13969][ML] Add FeatureHasher transformer · 0bb8d1f3
      Nick Pentreath authored
      This PR adds a `FeatureHasher` transformer, modeled on [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) and [Vowpal wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction).
      
      The transformer operates on multiple input columns in one pass. Current behavior is:
      * for numerical columns, the values are assumed to be real values and the feature index is `hash(columnName)` while feature value is `feature_value`
      * for string columns, the values are assumed to be categorical and the feature index is `hash(column_name=feature_value)`, while feature value is `1.0`
      * For hash collisions, feature values will be summed
      * `null` (missing) values are ignored
      
      The following dataframe illustrates the basic semantics:
      ```
      +---+------+-----+---------+------+-----------------------------------------+
      |int|double|float|stringNum|string|features                                 |
      +---+------+-----+---------+------+-----------------------------------------+
      |3  |4.0   |5.0  |1        |foo   |(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
      |6  |7.0   |8.0  |2        |bar   |(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
      +---+------+-----+---------+------+-----------------------------------------+
      ```
      
      ## How was this patch tested?
      
      New unit tests and manual experiments.
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #18513 from MLnick/FeatureHasher.
      0bb8d1f3
    • Jan Vrsovsky's avatar
      [SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) · 8321c141
      Jan Vrsovsky authored
      ## What changes were proposed in this pull request?
      
      Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon
      
      (Maybe the usage should be forbidden when writing, in a major version change?).
      
      ## How was this patch tested?
      
      Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Jan Vrsovsky <jan.vrsovsky@firma.seznam.cz>
      
      Closes #18872 from ProtD/master.
      8321c141
  18. Aug 15, 2017
    • WeichenXu's avatar
      [SPARK-19634][ML] Multivariate summarizer - dataframes API · 07549b20
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics.
      
      ## How was this patch tested?
      
      Testcases added.
      
      ## Performance
      Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan
      
      ### Performance data
      
      (test on my laptop, use 2 partitions. tries out = 20, warm up = 10)
      
      The unit of test results is records/milliseconds (higher is better)
      
      Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000
      ----|------|----|---|----|----
      Dataframe | 15149  | 7441 | 2118 | 224 | 21
      RDD from Dataframe | 4992  | 4440 | 2328 | 320 | 33
      raw RDD | 53931  | 20683 | 3966 | 528 | 53
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.
      07549b20
    • Marcelo Vanzin's avatar
      [SPARK-21731][BUILD] Upgrade scalastyle to 0.9. · 3f958a99
      Marcelo Vanzin authored
      This version fixes a few issues in the import order checker; it provides
      better error messages, and detects more improper ordering (thus the need
      to change a lot of files in this patch). The main fix is that it correctly
      complains about the order of packages vs. classes.
      
      As part of the above, I moved some "SparkSession" import in ML examples
      inside the "$example on$" blocks; that didn't seem consistent across
      different source files to start with, and avoids having to add more on/off blocks
      around specific imports.
      
      The new scalastyle also seems to have a better header detector, so a few
      license headers had to be updated to match the expected indentation.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #18943 from vanzin/SPARK-21731.
      3f958a99
  19. Aug 10, 2017
    • Peng Meng's avatar
      [SPARK-21638][ML] Fix RF/GBT Warning message error · ca695585
      Peng Meng authored
      ## What changes were proposed in this pull request?
      
      When train RF model, there are many warning messages like this:
      
      > WARN  RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.
      
      This warning message is unnecessary and the data is not accurate.
      
      Actually, if all the nodes cannot split in one iteration, it will show this warning. For most of the case, all the nodes cannot split just in one iteration, so for most of the case, it will show this warning for each iteration.
      
      ## How was this patch tested?
      The existing UT
      
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #18868 from mpjlu/fixRFwarning.
      ca695585
  20. Aug 09, 2017
  21. Aug 07, 2017
    • Ajay Saini's avatar
      [SPARK-21542][ML][PYTHON] Python persistence helper functions · fdcee028
      Ajay Saini authored
      ## What changes were proposed in this pull request?
      
      Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters.
      
      ## How was this patch tested?
      
      Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was both done in the Python REPL and implemented in the unit tests.
      
      Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master.
      
      Author: Ajay Saini <ajays725@gmail.com>
      
      Closes #18742 from ajaysaini725/PythonPersistenceHelperFunctions.
      fdcee028
    • Peng Meng's avatar
      [SPARK-21623][ML] fix RF doc · 1426eea8
      Peng Meng authored
      ## What changes were proposed in this pull request?
      
      comments of parentStats in RF are wrong.
      parentStats is not only used for the first iteration, it is used with all the iteration for unordered features.
      
      ## How was this patch tested?
      
      Author: Peng Meng <peng.meng@intel.com>
      
      Closes #18832 from mpjlu/fixRFDoc.
      1426eea8
  22. Aug 06, 2017
  23. Aug 01, 2017
  24. Jul 31, 2017
    • wangmiao1981's avatar
      [SPARK-21381][SPARKR] SparkR: pass on setHandleInvalid for classification algorithms · 9570e81a
      wangmiao1981 authored
      ## What changes were proposed in this pull request?
      
      SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.
      
      This is a followup PR for SPARK-20307.
      
      ## How was this patch tested?
      
      New Unit tests are added.
      
      Author: wangmiao1981 <wm624@hotmail.com>
      
      Closes #18605 from wangmiao1981/class.
      9570e81a
  25. Jul 27, 2017
    • Yan Facai (颜发才)'s avatar
      [SPARK-21306][ML] OneVsRest should support setWeightCol · a5a31899
      Yan Facai (颜发才) authored
      ## What changes were proposed in this pull request?
      
      add `setWeightCol` method for OneVsRest.
      
      `weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.
      
      ## How was this patch tested?
      
      + [x] add an unit test.
      
      Author: Yan Facai (颜发才) <facai.yan@gmail.com>
      
      Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.
      a5a31899
    • actuaryzhang's avatar
      [SPARK-19270][ML] Add summary table to GLM summary · ddcd2e82
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      
      Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results.
      
      srowen yanboliang  felixcheung
      
      ## How was this patch tested?
      New tests. One for testing feature Name, and one for testing the summary Table.
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      Author: Wayne Zhang <actuaryzhang10@gmail.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #16630 from actuaryzhang/glmTable.
      ddcd2e82
Loading