Commits · 054ddb2f54ab8e6b0088fbf9d576c7770e5abcbf · cs525-sp18-g07 / spark

Sep 14, 2017

[SPARK-21854] Added LogisticRegressionTrainingSummary for... · 8d8641f1

Ming Jiang authored 7 years ago

[SPARK-21854] Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API

## What changes were proposed in this pull request?

Added LogisticRegressionTrainingSummary for MultinomialLogisticRegression in Python API

## How was this patch tested?

Added unit test

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Ming Jiang <mjiang@fanatics.com>
Author: Ming Jiang <jmwdpk@gmail.com>
Author: jmwdpk <jmwdpk@gmail.com>

Closes #19185 from jmwdpk/SPARK-21854.

8d8641f1

Sep 13, 2017

[SPARK-21690][ML] one-pass imputer · 0fa5b7ca

Zheng RuiFeng authored 7 years ago

## What changes were proposed in this pull request?
parallelize the computation of all columns

performance tests:

|numColums| Mean(Old) | Median(Old) | Mean(RDD) | Median(RDD) | Mean(DF) | Median(DF) |
|------|----------|------------|----------|------------|----------|------------|
|1|0.0771394713|0.0658712813|0.080779802|0.048165981499999996|0.10525509870000001|0.0499620203|
|10|0.7234340630999999|0.5954440414|0.0867935197|0.13263428659999998|0.09255724889999999|0.1573943635|
|100|7.3756451568|6.2196631259|0.1911931552|0.8625376817000001|0.5557462431|1.7216837982000002|

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #18902 from zhengruifeng/parallelize_imputer.

0fa5b7ca

[SPARK-21027][MINOR][FOLLOW-UP] add missing since tag · f6c5d8f6

WeichenXu authored 7 years ago

## What changes were proposed in this pull request?

add missing since tag for `setParallelism` in #19110

## How was this patch tested?

N/A

Author: WeichenXu <weichen.xu@databricks.com>

Closes #19214 from WeichenXu123/minor01.

f6c5d8f6

Sep 12, 2017

[SPARK-18608][ML] Fix double caching · c5f9b89d

Zheng RuiFeng authored 7 years ago

## What changes were proposed in this pull request?
`df.rdd.getStorageLevel` => `df.storageLevel`

using cmd `find . -name '*.scala' | xargs -i bash -c 'egrep -in "\.rdd\.getStorageLevel" {} && echo {}'` to make sure all algs involved in this issue are fixed.

Previous discussion in other PRs: https://github.com/apache/spark/pull/19107, https://github.com/apache/spark/pull/17014

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #19197 from zhengruifeng/double_caching.

c5f9b89d

[SPARK-21027][ML][PYTHON] Added tunable parallelism to one vs. rest in both Scala mllib and Pyspark · 720c94fe

Ajay Saini authored 7 years ago

# What changes were proposed in this pull request?

Added tunable parallelism to the pyspark implementation of one vs. rest classification. Added a parallelism parameter to the Scala implementation of one vs. rest along with functionality for using the parameter to tune the level of parallelism.

I take this PR #18281 over because the original author is busy but we need merge this PR soon.
After this been merged, we can close #18281 .

## How was this patch tested?

Test suite added.

Author: Ajay Saini <ajays725@gmail.com>
Author: WeichenXu <weichen.xu@databricks.com>

Closes #19110 from WeichenXu123/spark-21027.

720c94fe

[SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine... · dd781675

Marco Gaido authored 7 years ago

[SPARK-14516][ML] Adding ClusteringEvaluator with the implementation of Cosine silhouette and squared Euclidean silhouette.

## What changes were proposed in this pull request?

This PR adds the ClusteringEvaluator Evaluator which contains two metrics:
 - **cosineSilhouette**: the Silhouette measure using the cosine distance;
 - **squaredSilhouette**: the Silhouette measure using the squared Euclidean distance.

The implementation of the two metrics refers to the algorithm proposed and explained [here](https://drive.google.com/file/d/0B0Hyo%5f%5fbG%5f3fdkNvSVNYX2E3ZU0/view). These algorithms have been thought for a distributed and parallel environment, thus they have reasonable performance, unlike a naive Silhouette implementation following its definition.

## How was this patch tested?

The patch has been tested with the additional unit tests added (comparing the results with the ones provided by [Python sklearn library](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.silhouette_score.html)).

Author: Marco Gaido <mgaido@hortonworks.com>

Closes #18538 from mgaido91/SPARK-14516.

dd781675

Sep 11, 2017

[MINOR][SQL] remove unuse import class · dc74c0e6

caoxuewen authored 7 years ago

## What changes were proposed in this pull request?

this PR describe remove the import class that are unused.

## How was this patch tested?

N/A

Author: caoxuewen <cao.xuewen@zte.com.cn>

Closes #19131 from heary-cao/unuse_import.

dc74c0e6

Sep 08, 2017

[SPARK-19866][ML][PYSPARK] Add local version of Word2Vec findSynonyms for spark.ml: Python API · 31c74fec

Xin Ren authored 7 years ago

https://issues.apache.org/jira/browse/SPARK-19866

## What changes were proposed in this pull request?

Add Python API for findSynonymsArray matching Scala API.

## How was this patch tested?

Manual test
`./python/run-tests --python-executables=python2.7 --modules=pyspark-ml`

Author: Xin Ren <iamshrek@126.com>
Author: Xin Ren <renxin.ubc@gmail.com>
Author: Xin Ren <keypointt@users.noreply.github.com>

Closes #17451 from keypointt/SPARK-19866.

31c74fec

Sep 06, 2017

[SPARK-19357][ML] Adding parallel model evaluation in ML tuning · 16c4c03c

Bryan Cutler authored 7 years ago

## What changes were proposed in this pull request?
Modified `CrossValidator` and `TrainValidationSplit` to be able to evaluate models in parallel for a given parameter grid.  The level of parallelism is controlled by a parameter `numParallelEval` used to schedule a number of models to be trained/evaluated so that the jobs can be run concurrently.  This is a naive approach that does not check the cluster for needed resources, so care must be taken by the user to tune the parameter appropriately.  The default value is `1` which will train/evaluate in serial.

## How was this patch tested?
Added unit tests for CrossValidator and TrainValidationSplit to verify that model selection is the same when run in serial vs parallel.  Manual testing to verify tasks run in parallel when param is > 1. Added parameter usage to relevant examples.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #16774 from BryanCutler/parallel-model-eval-SPARK-19357.

16c4c03c

Sep 01, 2017

[SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure... · 900f14f6

WeichenXu authored 7 years ago

[SPARK-21729][ML][TEST] Generic test for ProbabilisticClassifier to ensure consistent output columns

## What changes were proposed in this pull request?

Add test for prediction using the model with all combinations of output columns turned on/off.
Make sure the output column values match, presumably by comparing vs. the case with all 3 output columns turned on.

## How was this patch tested?

Test updated.

Author: WeichenXu <weichen.xu@databricks.com>
Author: WeichenXu <WeichenXu123@outlook.com>

Closes #19065 from WeichenXu123/generic_test_for_prob_classifier.

900f14f6

[SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala... · 12ab7f7e

Sean Owen authored 7 years ago

[SPARK-14280][BUILD][WIP] Update change-version.sh and pom.xml to add Scala 2.12 profiles and enable 2.12 compilation

…build; fix some things that will be warnings or errors in 2.12; restore Scala 2.12 profile infrastructure

## What changes were proposed in this pull request?

This change adds back the infrastructure for a Scala 2.12 build, but does not enable it in the release or Python test scripts.

In order to make that meaningful, it also resolves compile errors that the code hits in 2.12 only, in a way that still works with 2.11.

It also updates dependencies to the earliest minor release of dependencies whose current version does not yet support Scala 2.12. This is in a sense covered by other JIRAs under the main umbrella, but implemented here. The versions below still work with 2.11, and are the _latest_ maintenance release in the _earliest_ viable minor release.

- Scalatest 2.x -> 3.0.3
- Chill 0.8.0 -> 0.8.4
- Clapper 1.0.x -> 1.1.2
- json4s 3.2.x -> 3.4.2
- Jackson 2.6.x -> 2.7.9 (required by json4s)

This change does _not_ fully enable a Scala 2.12 build:

- It will also require dropping support for Kafka before 0.10. Easy enough, just didn't do it yet here
- It will require recreating `SparkILoop` and `Main` for REPL 2.12, which is SPARK-14650. Possible to do here too.

What it does do is make changes that resolve much of the remaining gap without affecting the current 2.11 build.

## How was this patch tested?

Existing tests and build. Manually tested with `./dev/change-scala-version.sh 2.12` to verify it compiles, modulo the exceptions above.

Author: Sean Owen <sowen@cloudera.com>

Closes #18645 from srowen/SPARK-14280.

12ab7f7e

Aug 31, 2017

[SPARK-21862][ML] Add overflow check in PCA · f5e10a34

WeichenXu authored 7 years ago

## What changes were proposed in this pull request?

add overflow check in PCA, otherwise it is possible to throw `NegativeArraySizeException` when `k` and `numFeatures` are too large.
The overflow checking formula is here:
https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/svd.scala#L87

## How was this patch tested?

N/A

Author: WeichenXu <weichen.xu@databricks.com>

Closes #19078 from WeichenXu123/SVD_overflow_check.

f5e10a34

[SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to... · 96028e36

WeichenXu authored 7 years ago

[SPARK-17139][ML][FOLLOW-UP] Add convenient method `asBinary` for casting to BinaryLogisticRegressionSummary

## What changes were proposed in this pull request?

add an "asBinary" method to LogisticRegressionSummary for convenient casting to BinaryLogisticRegressionSummary.

## How was this patch tested?

Testcase updated.

Author: WeichenXu <weichen.xu@databricks.com>

Closes #19072 from WeichenXu123/mlor_summary_as_binary.

96028e36

Aug 30, 2017

[SPARK-21469][ML][EXAMPLES] Adding Examples for FeatureHasher · 4133c1b0

Bryan Cutler authored 7 years ago

## What changes were proposed in this pull request?

This PR adds ML examples for the FeatureHasher transform in Scala, Java, Python.

## How was this patch tested?

Manually ran examples and verified that output is consistent for different APIs

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #19024 from BryanCutler/ml-examples-FeatureHasher-SPARK-21810.

4133c1b0

[SPARK-21806][MLLIB] BinaryClassificationMetrics pr(): first point (0.0, 1.0) is misleading · 734ed7a7

Sean Owen authored 7 years ago

## What changes were proposed in this pull request?

Prepend (0,p) to precision-recall curve not (0,1) where p matches lowest recall point

## How was this patch tested?

Updated tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #19038 from srowen/SPARK-21806.

734ed7a7

Aug 29, 2017

[MINOR][ML] Document treatment of instance weights in logreg summary · 840ba053

Joseph K. Bradley authored 7 years ago

## What changes were proposed in this pull request?

Add Scaladoc noting that instance weights are currently ignored in the logistic regression summary traits.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #19071 from jkbradley/lr-summary-minor.

840ba053

Aug 28, 2017

[SPARK-17139][ML] Add model summary for MultinomialLogisticRegression · c7270a46

Weichen Xu authored 7 years ago

## What changes were proposed in this pull request?

Add 4 traits, using the following hierarchy:
LogisticRegressionSummary
LogisticRegressionTrainingSummary: LogisticRegressionSummary
BinaryLogisticRegressionSummary: LogisticRegressionSummary
BinaryLogisticRegressionTrainingSummary: LogisticRegressionTrainingSummary, BinaryLogisticRegressionSummary

and the public method such as `def summary` only return trait type listed above.

and then implement 4 concrete classes:
LogisticRegressionSummaryImpl (multiclass case)
LogisticRegressionTrainingSummaryImpl (multiclass case)
BinaryLogisticRegressionSummaryImpl (binary case).
BinaryLogisticRegressionTrainingSummaryImpl (binary case).

## How was this patch tested?

Existing tests & added tests.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #15435 from WeichenXu123/mlor_summary.

c7270a46

[SPARK-21818][ML][MLLIB] Fix bug of MultivariateOnlineSummarizer.variance generate negative result · 0456b405

WeichenXu authored 7 years ago

## What changes were proposed in this pull request?

Because of numerical error, MultivariateOnlineSummarizer.variance is possible to generate negative variance.

**This is a serious bug because many algos in MLLib**
**use stddev computed from** `sqrt(variance)`
**it will generate NaN and crash the whole algorithm.**

we can reproduce this bug use the following code:
```
    val summarizer1 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.7)
    val summarizer2 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)
    val summarizer3 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.5)
    val summarizer4 = (new MultivariateOnlineSummarizer)
      .add(Vectors.dense(3.0), 0.4)

    val summarizer = summarizer1
      .merge(summarizer2)
      .merge(summarizer3)
      .merge(summarizer4)

    println(summarizer.variance(0))
```
This PR fix the bugs in `mllib.stat.MultivariateOnlineSummarizer.variance` and `ml.stat.SummarizerBuffer.variance`, and several places in `WeightedLeastSquares`

## How was this patch tested?

test cases added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #19029 from WeichenXu123/fix_summarizer_var_bug.

0456b405

Aug 25, 2017

[MINOR][BUILD] Fix build warnings and Java lint errors · de7af295

Sean Owen authored 7 years ago

## What changes were proposed in this pull request?

Fix build warnings and Java lint errors. This just helps a bit in evaluating (new) warnings in another PR I have open.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #19051 from srowen/JavaWarnings.

de7af295

Aug 24, 2017

[SPARK-21108][ML] convert LinearSVC to aggregator framework · f3676d63

Yuhao Yang authored 7 years ago

## What changes were proposed in this pull request?

convert LinearSVC to new aggregator framework

## How was this patch tested?

existing unit test.

Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #18315 from hhbyyh/svcAggregator.

f3676d63

Aug 22, 2017

[SPARK-12664][ML] Expose probability in mlp model · d6b30edd

Weichen Xu authored 7 years ago

## What changes were proposed in this pull request?

Modify MLP model to inherit `ProbabilisticClassificationModel` and so that it can expose the probability  column when transforming data.

## How was this patch tested?

Test added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #17373 from WeichenXu123/expose_probability_in_mlp_model.

d6b30edd

[ML][MINOR] Make sharedParams update. · 34296190

Yanbo Liang authored 7 years ago

## What changes were proposed in this pull request?
```sharedParams.scala``` was generated by ```SharedParamsCodeGen```, but it's not updated in master. Maybe someone manual update ```sharedParams.scala```, this PR fix this issue.

## How was this patch tested?
Offline check.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #19011 from yanboliang/sharedParams.

34296190

[SPARK-21681][ML] fix bug of MLOR do not work correctly when featureStd contains zero · d56c2621

Weichen Xu authored 7 years ago

## What changes were proposed in this pull request?

fix bug of MLOR do not work correctly when featureStd contains zero

We can reproduce the bug through such dataset (features including zero variance), will generate wrong result (all coefficients becomes 0)
```
    val multinomialDatasetWithZeroVar = {
      val nPoints = 100
      val coefficients = Array(
        -0.57997, 0.912083, -0.371077,
        -0.16624, -0.84355, -0.048509)

      val xMean = Array(5.843, 3.0)
      val xVariance = Array(0.6856, 0.0)  // including zero variance

      val testData = generateMultinomialLogisticInput(
        coefficients, xMean, xVariance, addIntercept = true, nPoints, seed)

      val df = sc.parallelize(testData, 4).toDF().withColumn("weight", lit(1.0))
      df.cache()
      df
    }
```
## How was this patch tested?

testcase added.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #18896 from WeichenXu123/fix_mlor_stdvalue_zero_bug.

d56c2621

Aug 21, 2017

[SPARK-19762][ML][FOLLOWUP] Add necessary comments to L2Regularization. · c108a5d3

Yanbo Liang authored 7 years ago

## What changes were proposed in this pull request?
MLlib ```LinearRegression/LogisticRegression/LinearSVC``` always standardize the data during training to improve the rate of convergence regardless of _standardization_ is true or false. If _standardization_ is false, we perform reverse standardization by penalizing each component differently to get effectively the same objective function when the training dataset is not standardized. We should keep these comments in the code to let developers understand how we handle it correctly.

## How was this patch tested?
Existing tests, only adding some comments in code.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #18992 from yanboliang/SPARK-19762.

c108a5d3

[SPARK-21468][PYSPARK][ML] Python API for FeatureHasher · 988b84d7

Nick Pentreath authored 7 years ago

Add Python API for `FeatureHasher` transformer.

## How was this patch tested?

New doc test.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #18970 from MLnick/SPARK-21468-pyspark-hasher.

988b84d7

Aug 20, 2017

[MINOR] Correct validateAndTransformSchema in GaussianMixture and AFTSurvivalRegression · 73e04ecc

Cédric Pelvet authored 7 years ago

## What changes were proposed in this pull request?

The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) did not modify the variable schema, hence only the last line had any effect. A temporary variable is used to correctly append the two columns predictionCol and probabilityCol.

## How was this patch tested?

Manually.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Cédric Pelvet <cedric.pelvet@gmail.com>

Closes #18980 from sharp-pixel/master.

73e04ecc

Aug 16, 2017

[SPARK-21680][ML][MLLIB] optimize Vector compress · a0345cbe

Peng Meng authored 7 years ago

## What changes were proposed in this pull request?

When use Vector.compressed to change a Vector to SparseVector, the performance is very low comparing with Vector.toSparse.
This is because you have to scan the value three times using Vector.compressed, but you just need two times when use Vector.toSparse.
When the length of the vector is large, there is significant performance difference between this two method.

## How was this patch tested?

The existing UT

Author: Peng Meng <peng.meng@intel.com>

Closes #18899 from mpjlu/optVectorCompress.

a0345cbe

[SPARK-13969][ML] Add FeatureHasher transformer · 0bb8d1f3

Nick Pentreath authored 7 years ago

This PR adds a `FeatureHasher` transformer, modeled on [scikit-learn](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.FeatureHasher.html) and [Vowpal wabbit](https://github.com/JohnLangford/vowpal_wabbit/wiki/Feature-Hashing-and-Extraction).

The transformer operates on multiple input columns in one pass. Current behavior is:
* for numerical columns, the values are assumed to be real values and the feature index is `hash(columnName)` while feature value is `feature_value`
* for string columns, the values are assumed to be categorical and the feature index is `hash(column_name=feature_value)`, while feature value is `1.0`
* For hash collisions, feature values will be summed
* `null` (missing) values are ignored

The following dataframe illustrates the basic semantics:
```
+---+------+-----+---------+------+-----------------------------------------+
|int|double|float|stringNum|string|features                                 |
+---+------+-----+---------+------+-----------------------------------------+
|3  |4.0   |5.0  |1        |foo   |(16,[0,8,11,12,15],[5.0,3.0,1.0,4.0,1.0])|
|6  |7.0   |8.0  |2        |bar   |(16,[0,8,11,12,15],[8.0,6.0,1.0,7.0,1.0])|
+---+------+-----+---------+------+-----------------------------------------+
```

## How was this patch tested?

New unit tests and manual experiments.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #18513 from MLnick/FeatureHasher.

0bb8d1f3

[SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures) · 8321c141

Jan Vrsovsky authored 7 years ago

## What changes were proposed in this pull request?

Check the option "numFeatures" only when reading LibSVM, not when writing. When writing, Spark was raising an exception. After the change it will ignore the option completely. liancheng HyukjinKwon

(Maybe the usage should be forbidden when writing, in a major version change?).

## How was this patch tested?

Manual test, that loading and writing LibSVM files work fine, both with and without the numFeatures option.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Jan Vrsovsky <jan.vrsovsky@firma.seznam.cz>

Closes #18872 from ProtD/master.

8321c141

Aug 15, 2017

[SPARK-19634][ML] Multivariate summarizer - dataframes API · 07549b20

WeichenXu authored 7 years ago

## What changes were proposed in this pull request?

This patch adds the DataFrames API to the multivariate summarizer (mean, variance, etc.). In addition to all the features of MultivariateOnlineSummarizer, it also allows the user to select a subset of the metrics.

## How was this patch tested?

Testcases added.

## Performance
Resolve several performance issues in #17419, further optimization pending on SQL team's work. One of the SQL layer performance issue related to these feature has been resolved in #18712, thanks liancheng and cloud-fan

### Performance data

(test on my laptop, use 2 partitions. tries out = 20, warm up = 10)

The unit of test results is records/milliseconds (higher is better)

Vector size/records number | 1/10000000 | 10/1000000 | 100/1000000 | 1000/100000 | 10000/10000
----|------|----|---|----|----
Dataframe | 15149  | 7441 | 2118 | 224 | 21
RDD from Dataframe | 4992  | 4440 | 2328 | 320 | 33
raw RDD | 53931  | 20683 | 3966 | 528 | 53

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #18798 from WeichenXu123/SPARK-19634-dataframe-summarizer.

07549b20

[SPARK-21731][BUILD] Upgrade scalastyle to 0.9. · 3f958a99

Marcelo Vanzin authored 7 years ago

This version fixes a few issues in the import order checker; it provides
better error messages, and detects more improper ordering (thus the need
to change a lot of files in this patch). The main fix is that it correctly
complains about the order of packages vs. classes.

As part of the above, I moved some "SparkSession" import in ML examples
inside the "$example on$" blocks; that didn't seem consistent across
different source files to start with, and avoids having to add more on/off blocks
around specific imports.

The new scalastyle also seems to have a better header detector, so a few
license headers had to be updated to match the expected indentation.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18943 from vanzin/SPARK-21731.

3f958a99

Aug 10, 2017

[SPARK-21638][ML] Fix RF/GBT Warning message error · ca695585

Peng Meng authored 7 years ago

## What changes were proposed in this pull request?

When train RF model, there are many warning messages like this:

> WARN  RandomForest: Tree learning is using approximately 268492800 bytes per iteration, which exceeds requested limit maxMemoryUsage=268435456. This allows splitting 2622 nodes in this iteration.

This warning message is unnecessary and the data is not accurate.

Actually, if all the nodes cannot split in one iteration, it will show this warning. For most of the case, all the nodes cannot split just in one iteration, so for most of the case, it will show this warning for each iteration.

## How was this patch tested?
The existing UT

Author: Peng Meng <peng.meng@intel.com>

Closes #18868 from mpjlu/fixRFwarning.

ca695585

Aug 09, 2017

[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong wolfe line search · b35660dd

WeichenXu authored 7 years ago

## What changes were proposed in this pull request?

Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
https://github.com/scalanlp/breeze/pull/651

## How was this patch tested?

N/A

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #18797 from WeichenXu123/update-breeze.

b35660dd

Aug 07, 2017

[SPARK-21542][ML][PYTHON] Python persistence helper functions · fdcee028

Ajay Saini authored 7 years ago

## What changes were proposed in this pull request?

Added DefaultParamsWriteable, DefaultParamsReadable, DefaultParamsWriter, and DefaultParamsReader to Python to support Python-only persistence of Json-serializable parameters.

## How was this patch tested?

Instantiated an estimator with Json-serializable parameters (ex. LogisticRegression), saved it using the added helper functions, and loaded it back, and compared it to the original instance to make sure it is the same. This test was both done in the Python REPL and implemented in the unit tests.

Note to reviewers: there are a few excess comments that I left in the code for clarity but will remove before the code is merged to master.

Author: Ajay Saini <ajays725@gmail.com>

Closes #18742 from ajaysaini725/PythonPersistenceHelperFunctions.

fdcee028

[SPARK-21623][ML] fix RF doc · 1426eea8

Peng Meng authored 7 years ago

## What changes were proposed in this pull request?

comments of parentStats in RF are wrong.
parentStats is not only used for the first iteration, it is used with all the iteration for unordered features.

## How was this patch tested?

Author: Peng Meng <peng.meng@intel.com>

Closes #18832 from mpjlu/fixRFDoc.

1426eea8

Aug 06, 2017

[SPARK-21622][ML][SPARKR] Support offset in SparkR GLM · 55aa4da2

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?
Support offset in SparkR GLM #16699

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #18831 from actuaryzhang/sparkROffset.

55aa4da2

Aug 01, 2017

[SPARK-21388][ML][PYSPARK] GBTs inherit from HasStepSize & LInearSVC from HasThreshold · 253a07e4

Zheng RuiFeng authored 7 years ago

## What changes were proposed in this pull request?
GBTs inherit from HasStepSize & LInearSVC/Binarizer from HasThreshold

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>
Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #18612 from zhengruifeng/override_HasXXX.

253a07e4

Jul 31, 2017

[SPARK-21381][SPARKR] SparkR: pass on setHandleInvalid for classification algorithms · 9570e81a

wangmiao1981 authored 7 years ago

## What changes were proposed in this pull request?

SPARK-20307 Added handleInvalid option to RFormula for tree-based classification algorithms. We should add this parameter for other classification algorithms in SparkR.

This is a followup PR for SPARK-20307.

## How was this patch tested?

New Unit tests are added.

Author: wangmiao1981 <wm624@hotmail.com>

Closes #18605 from wangmiao1981/class.

9570e81a

Jul 27, 2017

[SPARK-21306][ML] OneVsRest should support setWeightCol · a5a31899

Yan Facai (颜发才) authored 7 years ago

## What changes were proposed in this pull request?

add `setWeightCol` method for OneVsRest.

`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.

## How was this patch tested?

+ [x] add an unit test.

Author: Yan Facai (颜发才) <facai.yan@gmail.com>

Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.

a5a31899

[SPARK-19270][ML] Add summary table to GLM summary · ddcd2e82

actuaryzhang authored 7 years ago

## What changes were proposed in this pull request?

Add R-like summary table to GLM summary, which includes feature name (if exist), parameter estimate, standard error, t-stat and p-value. This allows scala users to easily gather these commonly used inference results.

srowen yanboliang  felixcheung

## How was this patch tested?
New tests. One for testing feature Name, and one for testing the summary Table.

Author: actuaryzhang <actuaryzhang10@gmail.com>
Author: Wayne Zhang <actuaryzhang10@gmail.com>
Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16630 from actuaryzhang/glmTable.

ddcd2e82