Skip to content
Snippets Groups Projects
  1. Aug 24, 2016
    • Xin Ren's avatar
      [SPARK-16445][MLLIB][SPARKR] Multilayer Perceptron Classifier wrapper in SparkR · 2fbdb606
      Xin Ren authored
      https://issues.apache.org/jira/browse/SPARK-16445
      
      ## What changes were proposed in this pull request?
      
      Create Multilayer Perceptron Classifier wrapper in SparkR
      
      ## How was this patch tested?
      
      Tested manually on local machine
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14447 from keypointt/SPARK-16445.
      2fbdb606
    • VinceShieh's avatar
      [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer... · 92c0eaf3
      VinceShieh authored
      [SPARK-17086][ML] Fix InvalidArgumentException issue in QuantileDiscretizer when some quantiles are duplicated
      
      ## What changes were proposed in this pull request?
      
      In cases when QuantileDiscretizerSuite is called upon a numeric array with duplicated elements,  we will  take the unique elements generated from approxQuantiles as input for Bucketizer.
      
      ## How was this patch tested?
      
      An unit test is added in QuantileDiscretizerSuite
      
      QuantileDiscretizer.fit will throw an illegal exception when calling setSplits on a list of splits
      with duplicated elements. Bucketizer.setSplits should only accept either a numeric vector of two
      or more unique cut points, although that may produce less number of buckets than requested.
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #14747 from VinceShieh/SPARK-17086.
      92c0eaf3
  2. Aug 23, 2016
    • Zheng RuiFeng's avatar
      [TRIVIAL] Typo Fix · 6555ef0c
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Fix a typo
      
      ## How was this patch tested?
      no tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #14772 from zhengruifeng/minor_numClasses.
      6555ef0c
    • Jagadeesan's avatar
      [SPARK-17095] [Documentation] [Latex and Scala doc do not play nicely] · 97d461b7
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      In Latex, it is common to find "}}}" when closing several expressions at once. [SPARK-16822](https://issues.apache.org/jira/browse/SPARK-16822) added Mathjax to render Latex equations in scaladoc. However, when scala doc sees "}}}" or "{{{" it treats it as a special character for code block. This results in some very strange output.
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #14688 from jagadeesanas2/SPARK-17095.
      97d461b7
  3. Aug 22, 2016
    • hqzizania's avatar
      [SPARK-17090][FOLLOW-UP][ML] Add expert param support to SharedParamsCodeGen · 37f0ab70
      hqzizania authored
      ## What changes were proposed in this pull request?
      
      Add expert param support to SharedParamsCodeGen where aggregationDepth a expert param is added.
      
      Author: hqzizania <hqzizania@gmail.com>
      
      Closes #14738 from hqzizania/SPARK-17090-minor.
      37f0ab70
    • Holden Karau's avatar
      [SPARK-15113][PYSPARK][ML] Add missing num features num classes · b264cbb1
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Add missing `numFeatures` and `numClasses` to the wrapped Java models in PySpark ML pipelines. Also tag `DecisionTreeClassificationModel` as Expiremental to match Scala doc.
      
      ## How was this patch tested?
      
      Extended doctests
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12889 from holdenk/SPARK-15113-add-missing-numFeatures-numClasses.
      b264cbb1
    • Wenchen Fan's avatar
      [SPARK-16498][SQL] move hive hack for data source table into HiveExternalCatalog · b2074b66
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Spark SQL doesn't have its own meta store yet, and use hive's currently. However, hive's meta store has some limitations(e.g. columns can't be too many, not case-preserving, bad decimal type support, etc.), so we have some hacks to successfully store data source table metadata into hive meta store, i.e. put all the information in table properties.
      
      This PR moves these hacks to `HiveExternalCatalog`, tries to isolate hive specific logic in one place.
      
      changes overview:
      
      1.  **before this PR**: we need to put metadata(schema, partition columns, etc.) of data source tables to table properties before saving it to external catalog, even the external catalog doesn't use hive metastore(e.g. `InMemoryCatalog`)
      **after this PR**: the table properties tricks are only in `HiveExternalCatalog`, the caller side doesn't need to take care of it anymore.
      
      2. **before this PR**: because the table properties tricks are done outside of external catalog, so we also need to revert these tricks when we read the table metadata from external catalog and use it. e.g. in `DescribeTableCommand` we will read schema and partition columns from table properties.
      **after this PR**: The table metadata read from external catalog is exactly the same with what we saved to it.
      
      bonus: now we can create data source table using `SessionCatalog`, if schema is specified.
      breaks: `schemaStringLengthThreshold` is not configurable anymore. `hive.default.rcfile.serde` is not configurable anymore.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #14155 from cloud-fan/catalog-table.
      b2074b66
  4. Aug 20, 2016
    • hqzizania's avatar
      [SPARK-17090][ML] Make tree aggregation level in linear/logistic regression configurable · 61ef74f2
      hqzizania authored
      ## What changes were proposed in this pull request?
      
      Linear/logistic regression use treeAggregate with default depth (always = 2) for collecting coefficient gradient updates to the driver. For high dimensional problems, this can cause OOM error on the driver. This patch makes it configurable to avoid this problem if users' input data has many features. It adds a HasTreeDepth API in `sharedParams.scala`, and extends it to both Linear regression and logistic regression in .ml
      
      Author: hqzizania <hqzizania@gmail.com>
      
      Closes #14717 from hqzizania/SPARK-17090.
      61ef74f2
  5. Aug 19, 2016
    • Junyang Qian's avatar
      [SPARK-16443][SPARKR] Alternating Least Squares (ALS) wrapper · acac7a50
      Junyang Qian authored
      ## What changes were proposed in this pull request?
      
      Add Alternating Least Squares wrapper in SparkR. Unit tests have been updated.
      
      ## How was this patch tested?
      
      SparkR unit tests.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      ![screen shot 2016-07-27 at 3 50 31 pm](https://cloud.githubusercontent.com/assets/15318264/17195347/f7a6352a-5411-11e6-8e21-61a48070192a.png)
      ![screen shot 2016-07-27 at 3 50 46 pm](https://cloud.githubusercontent.com/assets/15318264/17195348/f7a7d452-5411-11e6-845f-6d292283bc28.png)
      
      Author: Junyang Qian <junyangq@databricks.com>
      
      Closes #14384 from junyangq/SPARK-16443.
      acac7a50
    • Yanbo Liang's avatar
      [SPARK-17141][ML] MinMaxScaler should remain NaN value. · 864be935
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      In the existing code, ```MinMaxScaler``` handle ```NaN``` value indeterminately.
      * If a column has identity value, that is ```max == min```, ```MinMaxScalerModel``` transformation will output ```0.5``` for all rows even the original value is ```NaN```.
      * Otherwise, it will remain ```NaN``` after transformation.
      
      I think we should unify the behavior by remaining ```NaN``` value at any condition, since we don't know how to transform a ```NaN``` value. In Python sklearn, it will throw exception when there is ```NaN``` in the dataset.
      
      ## How was this patch tested?
      Unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14716 from yanboliang/spark-17141.
      864be935
    • sethah's avatar
      [SPARK-7159][ML] Add multiclass logistic regression to Spark ML · 287bea13
      sethah authored
      ## What changes were proposed in this pull request?
      
      This patch adds a new estimator/transformer `MultinomialLogisticRegression` to spark ML.
      
      JIRA: [SPARK-7159](https://issues.apache.org/jira/browse/SPARK-7159)
      
      ## How was this patch tested?
      
      Added new test suite `MultinomialLogisticRegressionSuite`.
      
      ## Approach
      
      ### Do not use a "pivot" class in the algorithm formulation
      
      Many implementations of multinomial logistic regression treat the problem as K - 1 independent binary logistic regression models where K is the number of possible outcomes in the output variable. In this case, one outcome is chosen as a "pivot" and the other K - 1 outcomes are regressed against the pivot. This is somewhat undesirable since the coefficients returned will be different for different choices of pivot variables. An alternative approach to the problem models class conditional probabilites using the softmax function and will return uniquely identifiable coefficients (assuming regularization is applied). This second approach is used in R's glmnet and was also recommended by dbtsai.
      
      ### Separate multinomial logistic regression and binary logistic regression
      
      The initial design makes multinomial logistic regression a separate estimator/transformer than the existing LogisticRegression estimator/transformer. An alternative design would be to merge them into one.
      
      **Arguments for:**
      
      * The multinomial case without pivot is distinctly different than the current binary case since the binary case uses a pivot class.
      * The current logistic regression model in ML uses a vector of coefficients and a scalar intercept. In the multinomial case, we require a matrix of coefficients and a vector of intercepts. There are potential workarounds for this issue if we were to merge the two estimators, but none are particularly elegant.
      
      **Arguments against:**
      
      * It may be inconvenient for users to have to switch the estimator class when transitioning between binary and multiclass (although the new multinomial estimator can be used for two class outcomes).
      * Some portions of the code are repeated.
      
      This is a major design point and warrants more discussion.
      
      ### Mean centering
      
      When no regularization is applied, the coefficients will not be uniquely identifiable. This is not hard to show and is discussed in further detail [here](https://core.ac.uk/download/files/153/6287975.pdf). R's glmnet deals with this by choosing the minimum l2 regularized solution (i.e. mean centering). Additionally, the intercepts are never regularized so they are always mean centered. This is the approach taken in this PR as well.
      
      ### Feature scaling
      
      In current ML logistic regression, the features are always standardized when running the optimization algorithm. They are always returned to the user in the original feature space, however. This same approach is maintained in this patch as well, but the implementation details are different. In ML logistic regression, the unregularized feature values are divided by the column standard deviation in every gradient update iteration. In contrast, MLlib transforms the entire input dataset to the scaled space _before_ optimizaton. In ML, this means that `numFeatures * numClasses` extra scalar divisions are required in every iteration. Performance testing shows that this has significant (4x in some cases) slow downs in each iteration. This can be avoided by transforming the input to the scaled space ala MLlib once, before iteration begins. This does add some overhead initially, but can make significant time savings in some cases.
      
      One issue with this approach is that if the input data is already cached, there may not be enough memory to cache the transformed data, which would make the algorithm _much_ slower. The tradeoffs here merit more discussion.
      
      ### Specifying and inferring the number of outcome classes
      
      The estimator checks the dataframe label column for metadata which specifies the number of values. If they are not specified, the length of the `histogram` variable is used, which is essentially the maximum value found in the column. The assumption then, is that the labels are zero-indexed when they are provided to the algorithm.
      
      ## Performance
      
      Below are some performance tests I have run so far. I am happy to add more cases or trials if we deem them necessary.
      
      Test cluster: 4 bare metal nodes, 128 GB RAM each, 48 cores each
      
      Notes:
      
      * Time in units of seconds
      * Metric is classification accuracy
      
      | algo   |   elasticNetParam | fitIntercept   |   metric |   maxIter |   numPoints |   numClasses |   numFeatures |    time | standardization   |   regParam |
      |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
      | ml     |                 0 | true           | 0.746415 |        30 |      100000 |            3 |        100000 | 327.923 | true              |          0 |
      | mllib  |                 0 | true           | 0.743785 |        30 |      100000 |            3 |        100000 | 390.217 | true              |          0 |
      
      | algo   |   elasticNetParam | fitIntercept   |   metric |   maxIter |   numPoints |   numClasses |   numFeatures |    time | standardization   |   regParam |
      |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
      | ml     |                 0 | true           | 0.973238 |        30 |     2000000 |            3 |         10000 | 385.476 | true              |          0 |
      | mllib  |                 0 | true           | 0.949828 |        30 |     2000000 |            3 |         10000 | 550.403 | true              |          0 |
      
      | algo   |   elasticNetParam | fitIntercept   |   metric |   maxIter |   numPoints |   numClasses |   numFeatures |    time | standardization   |   regParam |
      |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
      | mllib  |                 0 | true           | 0.864358 |        30 |     2000000 |            3 |         10000 | 543.359 | true              |        0.1 |
      | ml     |                 0 | true           | 0.867418 |        30 |     2000000 |            3 |         10000 | 401.955 | true              |        0.1 |
      
      | algo   |   elasticNetParam | fitIntercept   |   metric |   maxIter |   numPoints |   numClasses |   numFeatures |    time | standardization   |   regParam |
      |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
      | ml     |                 1 | true           | 0.807449 |        30 |     2000000 |            3 |         10000 | 334.892 | true              |       0.05 |
      
      | algo   |   elasticNetParam | fitIntercept   |   metric |   maxIter |   numPoints |   numClasses |   numFeatures |    time | standardization   |   regParam |
      |--------|-------------------|----------------|----------|-----------|-------------|--------------|---------------|---------|-------------------|------------|
      | ml     |                 0 | true           | 0.602006 |        30 |     2000000 |          500 |           100 | 112.319 | true              |          0 |
      | mllib  |                 0 | true           | 0.567226 |        30 |     2000000 |          500 |           100 | 263.768 | true              |          0 |e           | 0.567226 |        30 |     2000000 |          500 |           100 | 263.768 | true              |          0 |
      
      ## References
      
      Friedman, et al. ["Regularization Paths for Generalized Linear Models via Coordinate Descent"](https://core.ac.uk/download/files/153/6287975.pdf)
      [http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html](http://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html)
      
      ## Follow up items
      * Consider using level 2 BLAS routines in the gradient computations - [SPARK-17134](https://issues.apache.org/jira/browse/SPARK-17134)
      * Add model summary for MLOR - [SPARK-17139](https://issues.apache.org/jira/browse/SPARK-17139)
      * Add initial model to MLOR and add test for intercept priors - [SPARK-17140](https://issues.apache.org/jira/browse/SPARK-17140)
      * Python API - [SPARK-17138](https://issues.apache.org/jira/browse/SPARK-17138)
      * Consider changing the tree aggregation level for MLOR/BLOR or making it user configurable to avoid memory problems with high dimensional data - [SPARK-17090](https://issues.apache.org/jira/browse/SPARK-17090)
      * Refactor helper classes out of `LogisticRegression.scala` - [SPARK-17135](https://issues.apache.org/jira/browse/SPARK-17135)
      * Design optimizer interface for added flexibility in ML algos - [SPARK-17136](https://issues.apache.org/jira/browse/SPARK-17136)
      * Support compressing the coefficients and intercepts for MLOR models - [SPARK-17137](https://issues.apache.org/jira/browse/SPARK-17137)
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #13796 from sethah/SPARK-7159_M.
      287bea13
  6. Aug 18, 2016
    • Xusen Yin's avatar
      [SPARK-16447][ML][SPARKR] LDA wrapper in SparkR · b72bb62d
      Xusen Yin authored
      ## What changes were proposed in this pull request?
      
      Add LDA Wrapper in SparkR with the following interfaces:
      
      - spark.lda(data, ...)
      
      - spark.posterior(object, newData, ...)
      
      - spark.perplexity(object, ...)
      
      - summary(object)
      
      - write.ml(object)
      
      - read.ml(path)
      
      ## How was this patch tested?
      
      Test with SparkR unit test.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #14229 from yinxusen/SPARK-16447.
      b72bb62d
  7. Aug 17, 2016
    • Yanbo Liang's avatar
      [SPARK-16446][SPARKR][ML] Gaussian Mixture Model wrapper in SparkR · 4d92af31
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Gaussian Mixture Model wrapper in SparkR, similarly to R's ```mvnormalmixEM```.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14392 from yanboliang/spark-16446.
      4d92af31
    • wm624@hotmail.com's avatar
      [SPARK-16444][SPARKR] Isotonic Regression wrapper in SparkR · 363793f2
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      
      Add Isotonic Regression wrapper in SparkR
      
      Wrappers in R and Scala are added.
      Unit tests
      Documentation
      
      ## How was this patch tested?
      Manually tested with sudo ./R/run-tests.sh
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #14182 from wangmiao1981/isoR.
      363793f2
  8. Aug 15, 2016
  9. Aug 14, 2016
  10. Aug 12, 2016
    • Yanbo Liang's avatar
      [SPARK-17033][ML][MLLIB] GaussianMixture should use treeAggregate to improve performance · bbae20ad
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```GaussianMixture``` should use ```treeAggregate``` rather than ```aggregate``` to improve performance and scalability. In my test of dataset with 200 features and 1M instance, I found there is 20% increased performance.
      BTW, we should destroy broadcast variable ```compute``` at the end of each iteration.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14621 from yanboliang/spark-17033.
      bbae20ad
  11. Aug 10, 2016
    • Yanbo Liang's avatar
      [SPARK-16710][SPARKR][ML] spark.glm should support weightCol · d4a91224
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Training GLMs on weighted dataset is very important use cases, but it is not supported by SparkR currently. Users can pass argument ```weights``` to specify the weights vector in native R. For ```spark.glm```, we can pass in the ```weightCol``` which is consistent with MLlib.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14346 from yanboliang/spark-16710.
      d4a91224
  12. Aug 09, 2016
  13. Aug 08, 2016
  14. Aug 05, 2016
    • Yanbo Liang's avatar
      [SPARK-16750][FOLLOW-UP][ML] Add transformSchema for... · 6cbde337
      Yanbo Liang authored
      [SPARK-16750][FOLLOW-UP][ML] Add transformSchema for StringIndexer/VectorAssembler and fix failed tests.
      
      ## What changes were proposed in this pull request?
      This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review.
      The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process.
      
      ## How was this patch tested?
      Modified unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14455 from yanboliang/transformSchema.
      6cbde337
  15. Aug 04, 2016
    • Zheng RuiFeng's avatar
      [SPARK-16863][ML] ProbabilisticClassifier.fit check threshoulds' length · 0e2e5d7d
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier
      
      ## How was this patch tested?
      
      unit tests and manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #14470 from zhengruifeng/classifier_check_setThreshoulds_length.
      0e2e5d7d
    • WeichenXu's avatar
      [SPARK-16880][ML][MLLIB] make ann training data persisted if needed · 462784ff
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      To Make sure ANN layer input training data to be persisted,
      so that it can avoid overhead cost if the RDD need to be computed from lineage.
      
      ## How was this patch tested?
      
      Existing Tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14483 from WeichenXu123/add_ann_persist_training_data.
      462784ff
  16. Aug 02, 2016
  17. Aug 01, 2016
    • Shuai Lin's avatar
      [SPARK-16485][DOC][ML] Remove useless latex in a log messge. · 2a0de7dc
      Shuai Lin authored
      ## What changes were proposed in this pull request?
      
      Removed useless latex in a log messge.
      
      ## How was this patch tested?
      
      Check generated scaladoc.
      
      Author: Shuai Lin <linshuai2012@gmail.com>
      
      Closes #14380 from lins05/fix-docs-formatting.
      2a0de7dc
  18. Jul 30, 2016
    • WeichenXu's avatar
      [SPARK-16696][ML][MLLIB] destroy KMeans bcNewCenters when loop finished and... · bce354c1
      WeichenXu authored
      [SPARK-16696][ML][MLLIB] destroy KMeans bcNewCenters when loop finished and update code where should release unused broadcast/RDD in proper time
      
      ## What changes were proposed in this pull request?
      
      update unused broadcast in KMeans/Word2Vec,
      use destroy(false) to release memory in time.
      
      and several place destroy() update to destroy(false) so that it will be async-called,
      it will better than blocking called.
      
      and update bcNewCenters in KMeans to make it destroy in correct time.
      I use a list to store all historical `bcNewCenters` generated in each loop iteration and delay them to release at the end of loop.
      
      fix TODO in `BisectingKMeans.run` "unpersist old indices",
      Implements the pattern "persist current step RDD, and unpersist previous one" in the loop iteration.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14333 from WeichenXu123/broadvar_unpersist_to_destroy.
      bce354c1
    • Sean Owen's avatar
      [SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose... · 0dc4310b
      Sean Owen authored
      [SPARK-16694][CORE] Use for/foreach rather than map for Unit expressions whose side effects are required
      
      ## What changes were proposed in this pull request?
      
      Use foreach/for instead of map where operation requires execution of body, not actually defining a transformation
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14332 from srowen/SPARK-16694.
      0dc4310b
  19. Jul 29, 2016
    • Yanbo Liang's avatar
      [SPARK-16750][ML] Fix GaussianMixture training failed due to feature column type mistake · 0557a454
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ML ```GaussianMixture``` training failed due to feature column type mistake. The feature column type should be ```ml.linalg.VectorUDT``` but got ```mllib.linalg.VectorUDT``` by mistake.
      See [SPARK-16750](https://issues.apache.org/jira/browse/SPARK-16750) for how to reproduce this bug.
      Why the unit tests did not complain this errors? Because some estimators/transformers missed calling ```transformSchema(dataset.schema)``` firstly during ```fit``` or ```transform```. I will also add this function to all estimators/transformers who missed in this PR.
      
      ## How was this patch tested?
      No new tests, should pass existing ones.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14378 from yanboliang/spark-16750.
      0557a454
  20. Jul 27, 2016
    • krishnakalyan3's avatar
      [SPARK-15254][DOC] Improve ML pipeline Cross Validation Scaladoc & PyDoc · 7e8279fd
      krishnakalyan3 authored
      ## What changes were proposed in this pull request?
      Updated ML pipeline Cross Validation Scaladoc & PyDoc.
      
      ## How was this patch tested?
      
      Documentation update
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: krishnakalyan3 <krishnakalyan3@gmail.com>
      
      Closes #13894 from krishnakalyan3/kfold-cv.
      7e8279fd
    • Yanbo Liang's avatar
      [MINOR][ML] Fix some mistake in LinearRegression formula. · 3c3371bb
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Fix some mistake in ```LinearRegression``` formula.
      
      ## How was this patch tested?
      Documents change, no tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #14369 from yanboliang/LiR-formula.
      3c3371bb
  21. Jul 26, 2016
    • WeichenXu's avatar
      [SPARK-16697][ML][MLLIB] improve LDA submitMiniBatch method to avoid redundant RDD computation · 4c969559
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      In `LDAOptimizer.submitMiniBatch`, do persist on `stats: RDD[(BDM[Double], List[BDV[Double]])]`
      and also move the place of unpersisting `expElogbetaBc` broadcast variable,
      to avoid the `expElogbetaBc` broadcast variable to be unpersisted too early,
      and update previous `expElogbetaBc.unpersist()` into `expElogbetaBc.destroy(false)`
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14335 from WeichenXu123/improve_LDA.
      4c969559
  22. Jul 25, 2016
    • WeichenXu's avatar
      [SPARK-16653][ML][OPTIMIZER] update ANN convergence tolerance param default to 1e-6 · ad3708e7
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      replace ANN convergence tolerance param default
      from 1e-4 to 1e-6
      
      so that it will be the same with other algorithms in MLLib which use LBFGS as optimizer.
      
      ## How was this patch tested?
      
      Existing Test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14286 from WeichenXu123/update_ann_tol.
      ad3708e7
  23. Jul 23, 2016
    • WeichenXu's avatar
      [SPARK-16561][MLLIB] fix multivarOnlineSummary min/max bug · 25db5167
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      renaming var names to make code more clear:
      nnz => weightSum
      weightSum => totalWeightSum
      
      and add a new member vector `nnz` (not `nnz` in previous code, which renamed to `weightSum`) to count each dimensions non-zero value number.
      using `nnz` which I added above instead of `weightSum` when calculating min/max so that it fix several numerical error in some extreme case.
      
      ## How was this patch tested?
      
      A new testcase added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #14216 from WeichenXu123/multivarOnlineSummary.
      25db5167
  24. Jul 20, 2016
    • Anthony Truchet's avatar
      [SPARK-16440][MLLIB] Destroy broadcasted variables even on driver · 0dc79ffd
      Anthony Truchet authored
      ## What changes were proposed in this pull request?
      Forgotten broadcasted variables were persisted into a previous #PR 14153). This PR turns those `unpersist()` into `destroy()` so that memory is freed even on the driver.
      
      ## How was this patch tested?
      Unit Tests in Word2VecSuite were run locally.
      
      This contribution is done on behalf of Criteo, according to the
      terms of the Apache license 2.0.
      
      Author: Anthony Truchet <a.truchet@criteo.com>
      
      Closes #14268 from AnthonyTruchet/SPARK-16440.
      0dc79ffd
  25. Jul 19, 2016
Loading