Skip to content
Snippets Groups Projects
  1. Apr 24, 2017
  2. Apr 21, 2017
    • WeichenXu's avatar
      [SPARK-20423][ML] fix MLOR coeffs centering when reg == 0 · eb00378f
      WeichenXu authored
      ## What changes were proposed in this pull request?
      
      When reg == 0, MLOR has multiple solutions and we need to centralize the coeffs to get identical result.
      BUT current implementation centralize the `coefficientMatrix` by the global coeffs means.
      
      In fact the `coefficientMatrix` should be centralized on each feature index itself.
      Because, according to the MLOR probability distribution function, it can be proven easily that:
      suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
      then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent solution.
      `c` is an arbitrary vector of `numFeatures` dimension.
      reference
      https://core.ac.uk/download/pdf/6287975.pdf
      
      So that we need to centralize the `coefficientMatrix` on each feature dimension separately.
      
      **We can also confirm this through R library `glmnet`, that MLOR in `glmnet` always generate coefficients result that the sum of each dimension is all `zero`, when reg == 0.**
      
      ## How was this patch tested?
      
      Tests added.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #17706 from WeichenXu123/mlor_center.
      Unverified
      eb00378f
  3. Apr 13, 2017
    • Syrux's avatar
      [SPARK-20265][MLLIB] Improve Prefix'span pre-processing efficiency · 095d1cb3
      Syrux authored
      ## What changes were proposed in this pull request?
      
      Improve PrefixSpan pre-processing efficency by preventing sequences of zero in the cleaned database.
      The efficiency gain is reflected in the following graph : https://postimg.org/image/9x6ireuvn/
      
      ## How was this patch tested?
      
      Using MLlib's PrefixSpan existing tests and tests of my own on the 8 datasets shown in the graph. All
      result obtained were stricly the same as the original implementation (without this change).
      dev/run-tests was also runned, no error were found.
      
      Author : Cyril de Vogelaere <cyril.devogelaeregmail.com>
      
      Author: Syrux <pokcyril@hotmail.com>
      
      Closes #17575 from Syrux/SPARK-20265.
      095d1cb3
  4. Apr 12, 2017
    • hyukjinkwon's avatar
      [SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins · ceaf77ae
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.
      
      There are several problems with it:
      
      - It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".
      
      - > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.
      
        (see  joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))
      
      To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.
      
      There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013
      
      Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.
      
      ## How was this patch tested?
      
      Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.
      
      This was tested via manually adding `time.time()` as below:
      
      ```diff
           profiles_and_goals = build_profiles + sbt_goals
      
           print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
                 " ".join(profiles_and_goals))
      
      +    import time
      +    st = time.time()
           exec_sbt(profiles_and_goals)
      +    print("Elapsed :[%s]" % str(time.time() - st))
      ```
      
      produces
      
      ```
      ...
      ========================================================================
      Building Unidoc API Documentation
      ========================================================================
      ...
      [info] Main Java API documentation successful.
      ...
      Elapsed :[94.8746569157]
      ...
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17477 from HyukjinKwon/SPARK-18692.
      ceaf77ae
  5. Apr 11, 2017
  6. Apr 10, 2017
    • Sean Owen's avatar
      [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish... · a26e3ed5
      Sean Owen authored
      [SPARK-20156][CORE][SQL][STREAMING][MLLIB] Java String toLowerCase "Turkish locale bug" causes Spark problems
      
      ## What changes were proposed in this pull request?
      
      Add Locale.ROOT to internal calls to String `toLowerCase`, `toUpperCase`, to avoid inadvertent locale-sensitive variation in behavior (aka the "Turkish locale problem").
      
      The change looks large but it is just adding `Locale.ROOT` (the locale with no country or language specified) to every call to these methods.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17527 from srowen/SPARK-20156.
      a26e3ed5
  7. Apr 09, 2017
    • Vijay Ramesh's avatar
      [SPARK-20260][MLLIB] String interpolation required for error message · 261eaf51
      Vijay Ramesh authored
      ## What changes were proposed in this pull request?
      This error message doesn't get properly formatted because of a missing `s`.  Currently the error looks like:
      
      ```
      Caused by: java.lang.IllegalArgumentException: requirement failed: indices should be one-based and in ascending order; found current=$current, previous=$previous; line="$line"
      ```
      (note the literal `$current` instead of the interpolated value)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Vijay Ramesh <vramesh@demandbase.com>
      
      Closes #17572 from vijaykramesh/master.
      261eaf51
  8. Apr 07, 2017
  9. Apr 06, 2017
    • Bryan Cutler's avatar
      [SPARK-19953][ML] Random Forest Models use parent UID when being fit · e156b5dd
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      
      The ML `RandomForestClassificationModel` and `RandomForestRegressionModel` were not using the estimator parent UID when being fit.  This change fixes that so the models can be properly be identified with their parents.
      
      ## How was this patch tested?Existing tests.
      
      Added check to verify that model uid matches that of the parent, then renamed `checkCopy` to `checkCopyAndUids` and verified that it was called by one test for each ML algorithm.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17296 from BryanCutler/rfmodels-use-parent-uid-SPARK-19953.
      e156b5dd
  10. Apr 04, 2017
    • Yuhao Yang's avatar
      [SPARK-20003][ML] FPGrowthModel setMinConfidence should affect rules generation and transform · b28bbffb
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      jira: https://issues.apache.org/jira/browse/SPARK-20003
      I was doing some test and found the issue. ml.fpm.FPGrowthModel `setMinConfidence` should always affect rules generation and transform.
      Currently associationRules in FPGrowthModel is a lazy val and `setMinConfidence` in FPGrowthModel has no impact once associationRules got computed .
      
      I try to cache the associationRules to avoid re-computation if `minConfidence` is not changed, but this makes FPGrowthModel somehow stateful. Let me know if there's any concern.
      
      ## How was this patch tested?
      
      new unit test and I strength the unit test for model save/load to ensure the cache mechanism.
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17336 from hhbyyh/fpmodelminconf.
      b28bbffb
    • Seth Hendrickson's avatar
      [SPARK-20183][ML] Added outlierRatio arg to MLTestingUtils.testOutliersWithSmallWeights · a59759e6
      Seth Hendrickson authored
      ## What changes were proposed in this pull request?
      
      This is a small piece from https://github.com/apache/spark/pull/16722 which ultimately will add sample weights to decision trees.  This is to allow more flexibility in testing outliers since linear models and trees behave differently.
      
      Note: The primary author when this is committed should be sethah since this is taken from his code.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #17501 from jkbradley/SPARK-20183.
      a59759e6
    • zero323's avatar
      [SPARK-19825][R][ML] spark.ml R API for FPGrowth · b34f7665
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Adds SparkR API for FPGrowth: [SPARK-19825](https://issues.apache.org/jira/browse/SPARK-19825):
      
      - `spark.fpGrowth` -model training.
      - `freqItemsets` and `associationRules` methods with new corresponding generics.
      - Scala helper: `org.apache.spark.ml.r. FPGrowthWrapper`
      - unit tests.
      
      ## How was this patch tested?
      
      Feature specific unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17170 from zero323/SPARK-19825.
      b34f7665
  11. Apr 03, 2017
    • Yuhao Yang's avatar
      [SPARK-19969][ML] Imputer doc and example · 4d28e843
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      Add docs and examples for spark.ml.feature.Imputer. Currently scala and Java examples are included. Python example will be added after https://github.com/apache/spark/pull/17316
      
      ## How was this patch tested?
      
      local doc generation and example execution
      
      Author: Yuhao Yang <yuhao.yang@intel.com>
      
      Closes #17324 from hhbyyh/imputerdoc.
      4d28e843
    • Bryan Cutler's avatar
      [SPARK-19985][ML] Fixed copy method for some ML Models · 2a903a1e
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Some ML Models were using `defaultCopy` which expects a default constructor, and others were not setting the parent estimator.  This change fixes these by creating a new instance of the model and explicitly setting values and parent.
      
      ## How was this patch tested?
      Added `MLTestingUtils.checkCopy` to the offending models to tests to verify the copy is made and parent is set.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #17326 from BryanCutler/ml-model-copy-error-SPARK-19985.
      2a903a1e
  12. Mar 30, 2017
    • Jacek Laskowski's avatar
      [DOCS] Docs-only improvements · 0197262a
      Jacek Laskowski authored
      …adoc
      
      ## What changes were proposed in this pull request?
      
      Use recommended values for row boundaries in Window's scaladoc, i.e. `Window.unboundedPreceding`, `Window.unboundedFollowing`, and `Window.currentRow` (that were introduced in 2.1.0).
      
      ## How was this patch tested?
      
      Local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #17417 from jaceklaskowski/window-expression-scaladoc.
      0197262a
  13. Mar 28, 2017
  14. Mar 25, 2017
    • sethah's avatar
      [SPARK-17137][ML][WIP] Compress logistic regression coefficients · be85245a
      sethah authored
      ## What changes were proposed in this pull request?
      
      Use the new `compressed` method on matrices to store the logistic regression coefficients as sparse or dense - whichever is requires less memory.
      
      Marked as WIP so we can add some performance test results. Basically, we should see if prediction is slower because of using a sparse matrix over a dense one. This can happen since sparse matrices do not use native BLAS operations when computing the margins.
      
      ## How was this patch tested?
      
      Unit tests added.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #17426 from sethah/SPARK-17137.
      Unverified
      be85245a
  15. Mar 24, 2017
    • Nick Pentreath's avatar
      [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark · d9f4ce69
      Nick Pentreath authored
      Add Python wrapper for `Imputer` feature transformer.
      
      ## How was this patch tested?
      
      New doc tests and tweak to PySpark ML `tests.py`
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.
      d9f4ce69
  16. Mar 23, 2017
    • Timothy Hunter's avatar
      [SPARK-19636][ML] Feature parity for correlation statistics in MLlib · d27daa54
      Timothy Hunter authored
      ## What changes were proposed in this pull request?
      
      This patch adds the Dataframes-based support for the correlation statistics found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following the design doc discussed in the JIRA ticket.
      
      The current implementation is a simple wrapper around the `spark.mllib` implementation. Future optimizations can be implemented at a later stage.
      
      ## How was this patch tested?
      
      ```
      build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite"
      ```
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #17108 from thunterdb/19636.
      d27daa54
    • hyukjinkwon's avatar
      [MINOR][BUILD] Fix javadoc8 break · aefe7989
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Several javadoc8 breaks have been introduced. This PR proposes fix those instances so that we can build Scala/Java API docs.
      
      ```
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:6: error: reference not found
      [error]  * <code>flatMapGroupsWithState</code> operations on {link KeyValueGroupedDataset}.
      [error]                                                             ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:10: error: reference not found
      [error]  * Both, <code>mapGroupsWithState</code> and <code>flatMapGroupsWithState</code> in {link KeyValueGroupedDataset}
      [error]                                                                                            ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:51: error: reference not found
      [error]  *    {link GroupStateTimeout.ProcessingTimeTimeout}) or event time (i.e.
      [error]              ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:52: error: reference not found
      [error]  *    {link GroupStateTimeout.EventTimeTimeout}).
      [error]              ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:158: error: reference not found
      [error]  *           Spark SQL types (see {link Encoder} for more details).
      [error]                                          ^
      [error] .../spark/mllib/target/java/org/apache/spark/ml/fpm/FPGrowthParams.java:26: error: bad use of '>'
      [error]    * Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and
      [error]                            ^
      [error] .../spark/sql/core/src/main/java/org/apache/spark/api/java/function/FlatMapGroupsWithStateFunction.java:30: error: reference not found
      [error]  * {link org.apache.spark.sql.KeyValueGroupedDataset#flatMapGroupsWithState(
      [error]           ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:211: error: reference not found
      [error]    * See {link GroupState} for more details.
      [error]                 ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:232: error: reference not found
      [error]    * See {link GroupState} for more details.
      [error]                 ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:254: error: reference not found
      [error]    * See {link GroupState} for more details.
      [error]                 ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:277: error: reference not found
      [error]    * See {link GroupState} for more details.
      [error]                 ^
      [error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
      [error]  * {link TaskMetrics} &amp; {link MetricsSystem} objects are not thread safe.
      [error]           ^
      [error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
      [error]  * {link TaskMetrics} &amp; {link MetricsSystem} objects are not thread safe.
      [error]                                     ^
      [info] 13 errors
      ```
      
      ```
      jekyll 3.3.1 | Error:  Unidoc generation failed
      ```
      
      ## How was this patch tested?
      
      Manually via `jekyll build`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17389 from HyukjinKwon/minor-javadoc8-fix.
      aefe7989
  17. Mar 21, 2017
    • Joseph K. Bradley's avatar
      [SPARK-20039][ML] rename ChiSquare to ChiSquareTest · ae4b91d1
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      I realized that since ChiSquare is in the package stat, it's pretty unclear if it's the hypothesis test, distribution, or what. This PR renames it to ChiSquareTest to clarify this.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #17368 from jkbradley/SPARK-20039.
      ae4b91d1
    • Zheng RuiFeng's avatar
      [SPARK-20041][DOC] Update docs for NaN handling in approxQuantile · 63f077fb
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Update docs for NaN handling in approxQuantile.
      
      ## How was this patch tested?
      existing tests.
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #17369 from zhengruifeng/doc_quantiles_nan.
      63f077fb
    • christopher snow's avatar
      [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter · 7620aed8
      christopher snow authored
      ## What changes were proposed in this pull request?
      
      API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter.
      
       - [DOCS] was previously: "rank is the number of latent factors in the model."
       - [API] was previously:  "rank - number of features to use"
      
      This change describes rank in both places consistently as:
      
       - "Number of features to use (also referred to as the number of latent factors)"
      
      Author: Chris Snow <chris.snowuk.ibm.com>
      
      Author: christopher snow <chsnow123@gmail.com>
      
      Closes #17345 from snowch/SPARK-20011.
      7620aed8
  18. Mar 20, 2017
  19. Mar 16, 2017
    • Joseph K. Bradley's avatar
      [SPARK-19635][ML] DataFrame-based API for chi square test · 4c320054
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Wrapper taking and return a DataFrame
      
      ## How was this patch tested?
      
      Copied unit tests from RDD-based API
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #17110 from jkbradley/df-hypotests.
      4c320054
    • Yuhao Yang's avatar
      [SPARK-13568][ML] Create feature transformer to impute missing values · d647aae2
      Yuhao Yang authored
      ## What changes were proposed in this pull request?
      
      jira: https://issues.apache.org/jira/browse/SPARK-13568
      It is quite common to encounter missing values in data sets. It would be useful to implement a Transformer that can impute missing data points, similar to e.g. Imputer in scikit-learn.
      Initially, options for imputation could include mean, median and most frequent, but we could add various other approaches, where possible existing DataFrame code can be used (e.g. for approximate quantiles etc).
      
      Currently this PR supports imputation for Double and Vector (null and NaN in Vector).
      ## How was this patch tested?
      
      new unit tests and manual test
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      Author: Yuhao Yang <yuhao.yang@intel.com>
      Author: Yuhao <yuhao.yang@intel.com>
      
      Closes #11601 from hhbyyh/imputer.
      d647aae2
  20. Mar 14, 2017
    • Menglong TAN's avatar
      [SPARK-11569][ML] Fix StringIndexer to handle null value properly · 85941ecf
      Menglong TAN authored
      ## What changes were proposed in this pull request?
      
      This PR is to enhance StringIndexer with NULL values handling.
      
      Before the PR, StringIndexer will throw an exception when encounters NULL values.
      With this PR:
      - handleInvalid=error: Throw an exception as before
      - handleInvalid=skip: Skip null values as well as unseen labels
      - handleInvalid=keep: Give null values an additional index as well as unseen labels
      
      BTW, I noticed someone was trying to solve the same problem ( #9920 ) but seems getting no progress or response for a long time. Would you mind to give me a chance to solve it ? I'm eager to help. :-)
      
      ## How was this patch tested?
      
      new unit tests
      
      Author: Menglong TAN <tanmenglong@renrenche.com>
      Author: Menglong TAN <tanmenglong@gmail.com>
      
      Closes #17233 from crackcell/11569_StringIndexer_NULL.
      85941ecf
    • zero323's avatar
      [SPARK-19940][ML][MINOR] FPGrowthModel.transform should skip duplicated items · d4a637cd
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This commit moved `distinct` in its intended place to avoid duplicated predictions and adds unit test covering the issue.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17283 from zero323/SPARK-19940.
      d4a637cd
    • Asher Krim's avatar
      [SPARK-19922][ML] small speedups to findSynonyms · 5e96a57b
      Asher Krim authored
      Currently generating synonyms using a large model (I've tested with 3m words) is very slow. These efficiencies have sped things up for us by ~17%
      
      I wasn't sure if such small changes were worthy of a jira, but the guidelines seemed to suggest that that is the preferred approach
      
      ## What changes were proposed in this pull request?
      
      Address a few small issues in the findSynonyms logic:
      1) remove usage of ``Array.fill`` to zero out the ``cosineVec`` array. The default float value in Scala and Java is 0.0f, so explicitly setting the values to zero is not needed
      2) use Floats throughout. The conversion to Doubles before doing the ``priorityQueue`` is totally superfluous, since all the similarity computations are done using Floats anyway. Creating a second large array just serves to put extra strain on the GC
      3) convert the slow ``for(i <- cosVec.indices)`` to an ugly, but faster, ``while`` loop
      
      These efficiencies are really only apparent when working with a large model
      ## How was this patch tested?
      
      Existing unit tests + some in-house tests to time the difference
      
      cc jkbradley MLNick srowen
      
      Author: Asher Krim <krim.asher@gmail.com>
      Author: Asher Krim <krim.asher@gmail>
      
      Closes #17263 from Krimit/fasterFindSynonyms.
      5e96a57b
    • actuaryzhang's avatar
      [SPARK-19391][SPARKR][ML] Tweedie GLM API for SparkR · f6314eab
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Port Tweedie GLM  #16344  to SparkR
      
      felixcheung yanboliang
      
      ## How was this patch tested?
      new test in SparkR
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #16729 from actuaryzhang/sparkRTweedie.
      f6314eab
  21. Mar 13, 2017
    • Joseph K. Bradley's avatar
      [MINOR][ML] Improve MLWriter overwrite error message · 72c66dbb
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      Give proper syntax for Java and Python in addition to Scala.
      
      ## How was this patch tested?
      
      Manually.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #17215 from jkbradley/write-err-msg.
      72c66dbb
  22. Mar 12, 2017
    • Xin Ren's avatar
      [SPARK-19282][ML][SPARKR] RandomForest Wrapper and GBT Wrapper return param "maxDepth" to R models · 9f8ce482
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      RandomForest R Wrapper and GBT R Wrapper return param `maxDepth` to R models.
      
      Below 4 R wrappers are changed:
      * `RandomForestClassificationWrapper`
      * `RandomForestRegressionWrapper`
      * `GBTClassificationWrapper`
      * `GBTRegressionWrapper`
      
      ## How was this patch tested?
      
      Test manually on my local machine.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #17207 from keypointt/SPARK-19282.
      9f8ce482
  23. Mar 08, 2017
  24. Mar 07, 2017
    • Asher Krim's avatar
      [SPARK-17629][ML] methods to return synonyms directly · 56e1bd33
      Asher Krim authored
      ## What changes were proposed in this pull request?
      provide methods to return synonyms directly, without wrapping them in a dataframe
      
      In performance sensitive applications (such as user facing apis) the roundtrip to and from dataframes is costly and unnecessary
      
      The methods are named ``findSynonymsArray`` to make the return type clear, which also implies a local datastructure
      ## How was this patch tested?
      updated word2vec tests
      
      Author: Asher Krim <akrim@hubspot.com>
      
      Closes #16811 from Krimit/w2vFindSynonymsLocal.
      56e1bd33
    • VinceShieh's avatar
      [SPARK-17498][ML] StringIndexer enhancement for handling unseen labels · 4a9034b1
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR is an enhancement to ML StringIndexer.
      Before this PR, String Indexer only supports "skip"/"error" options to deal with unseen records.
      But those unseen records might still be useful and user would like to keep the unseen labels in
      certain use cases, This PR enables StringIndexer to support keeping unseen labels as
      indices [numLabels].
      
      '''Before
      StringIndexer().setHandleInvalid("skip")
      StringIndexer().setHandleInvalid("error")
      '''After
      support the third option "keep"
      StringIndexer().setHandleInvalid("keep")
      
      ## How was this patch tested?
      Test added in StringIndexerSuite
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      (Please fill in changes proposed in this fix)
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #16883 from VinceShieh/spark-17498.
      4a9034b1
  25. Mar 06, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19382][ML] Test sparse vectors in LinearSVCSuite · 92654366
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Add unit tests for testing SparseVector.
      
      We can't add mixed DenseVector and SparseVector test case, as discussed in JIRA 19382.
      
       def merge(other: MultivariateOnlineSummarizer): this.type = {
      if (this.totalWeightSum != 0.0 && other.totalWeightSum != 0.0) {
      require(n == other.n, s"Dimensions mismatch when merging with another summarizer. " +
      s"Expecting $n but got $
      {other.n}
      
      .")
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      Author: Miao Wang <wangmiao1981@users.noreply.github.com>
      
      Closes #16784 from wangmiao1981/bk.
      92654366
Loading