Skip to content
Snippets Groups Projects
  1. May 22, 2015
    • Joseph K. Bradley's avatar
      [SPARK-7578] [ML] [DOC] User guide for spark.ml Normalizer, IDF, StandardScaler · 2728c3df
      Joseph K. Bradley authored
      Added user guide sections with code examples.
      Also added small Java unit tests to test Java example in guide.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6127 from jkbradley/feature-guide-2 and squashes the following commits:
      
      cd47f4b [Joseph K. Bradley] Updated based on code review
      f16bcec [Joseph K. Bradley] Fixed merge issues and update Python examples print calls for Python 3
      0a862f9 [Joseph K. Bradley] Added Normalizer, StandardScaler to ml-features doc, plus small Java unit tests
      a21c2d6 [Joseph K. Bradley] Updated ml-features.md with IDF
      2728c3df
    • Xiangrui Meng's avatar
      [SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4 · 8f11c611
      Xiangrui Meng authored
      Some changes to the pipeilne APIs:
      
      1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does.
      1. Move Evaluator to ml.evaluation.
      1. Mention larger metric values are better.
      1. PipelineModel doc. “compiled” -> “fitted”
      1. Hide object PolynomialExpansion.
      1. Hide object VectorAssembler.
      1. Word2Vec.minCount (and other) -> group param
      1. ParamValidators -> DeveloperApi
      1. Hide MetadataUtils/SchemaUtils.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6322 from mengxr/SPARK-7535.0 and squashes the following commits:
      
      9e9c7da [Xiangrui Meng] move JavaEvaluator to ml.evaluation as well
      e179480 [Xiangrui Meng] move Evaluation to ml.evaluation in PySpark
      08ef61f [Xiangrui Meng] update pipieline APIs
      8f11c611
  2. May 21, 2015
    • Xiangrui Meng's avatar
      [SPARK-7219] [MLLIB] Output feature attributes in HashingTF · 85b96372
      Xiangrui Meng authored
      This PR updates `HashingTF` to output ML attributes that tell the number of features in the output column. We need to expand `UnaryTransformer` to support output metadata. A `df outputMetadata: Metadata` is not sufficient because the metadata may also depends on the input data. Though this is not true for `HashingTF`, I think it is reasonable to update `UnaryTransformer` in a separate PR. `checkParams` is added to verify common requirements for params. I will send a separate PR to use it in other test suites. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6308 from mengxr/SPARK-7219 and squashes the following commits:
      
      9bd2922 [Xiangrui Meng] address comments
      e82a68a [Xiangrui Meng] remove sqlContext from test suite
      995535b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7219
      2194703 [Xiangrui Meng] add test for attributes
      178ae23 [Xiangrui Meng] update HashingTF with tests
      91a6106 [Xiangrui Meng] WIP
      85b96372
    • Xiangrui Meng's avatar
      [SPARK-7794] [MLLIB] update RegexTokenizer default settings · f5db4b41
      Xiangrui Meng authored
      The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6330 from mengxr/SPARK-7794 and squashes the following commits:
      
      5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
      f5db4b41
    • Xiangrui Meng's avatar
      [SPARK-7498] [MLLIB] add varargs back to setDefault · cdc7c055
      Xiangrui Meng authored
      We removed `varargs` due to Java compilation issues. That was a false alarm because I didn't run `build/sbt clean`. So this PR reverts the changes. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6320 from mengxr/SPARK-7498 and squashes the following commits:
      
      74a7259 [Xiangrui Meng] add varargs back to setDefault
      cdc7c055
    • Joseph K. Bradley's avatar
      [SPARK-7585] [ML] [DOC] VectorIndexer user guide section · 6d75ed7e
      Joseph K. Bradley authored
      Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6255 from jkbradley/vector-indexer-guide and squashes the following commits:
      
      dbb8c4c [Joseph K. Bradley] simplified VectorIndexerModel.javaCategoryMaps
      f692084 [Joseph K. Bradley] Added VectorIndexer section to ML user guide.  Also added javaCategoryMaps() method and Java unit test for it.
      6d75ed7e
    • Shuo Xiang's avatar
      [SPARK-7793] [MLLIB] Use getOrElse for getting the threshold of SVM model · 4f572008
      Shuo Xiang authored
      same issue and fix as in Spark-7694.
      
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #6321 from coderxiang/nb and squashes the following commits:
      
      a5e6de4 [Shuo Xiang] use getOrElse for svmmodel.tostring
      2cb0177 [Shuo Xiang] Merge remote-tracking branch 'upstream/master' into nb
      5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      4f572008
    • Xiangrui Meng's avatar
      [SPARK-7752] [MLLIB] Use lowercase letters for NaiveBayes.modelType · 13348e21
      Xiangrui Meng authored
      to be consistent with other string names in MLlib. This PR also updates the implementation to use vals instead of hardcoded strings. jkbradley leahmcguire
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6277 from mengxr/SPARK-7752 and squashes the following commits:
      
      f38b662 [Xiangrui Meng] add another case _ back in test
      ae5c66a [Xiangrui Meng] model type -> modelType
      711d1c6 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7752
      40ae53e [Xiangrui Meng] fix Java test suite
      264a814 [Xiangrui Meng] add case _ back
      3c456a8 [Xiangrui Meng] update NB user guide
      17bba53 [Xiangrui Meng] update naive Bayes to use lowercase model type strings
      13348e21
    • Xiangrui Meng's avatar
      [SPARK-7753] [MLLIB] Update KernelDensity API · 947ea1cf
      Xiangrui Meng authored
      Update `KernelDensity` API to make it extensible to different kernels in the future. `bandwidth` is used instead of `standardDeviation`. The static `kernelDensity` method is removed from `Statistics`. The implementation is updated using BLAS, while the algorithm remains the same. sryza srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6279 from mengxr/SPARK-7753 and squashes the following commits:
      
      4cdfadc [Xiangrui Meng] add example code in the doc
      767fd5a [Xiangrui Meng] update KernelDensity API
      947ea1cf
  3. May 20, 2015
    • Xiangrui Meng's avatar
      [SPARK-7774] [MLLIB] add sqlContext to MLlibTestSparkContext · ddec173c
      Xiangrui Meng authored
      to simplify test suites that require a SQLContext.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6303 from mengxr/SPARK-7774 and squashes the following commits:
      
      0622b5a [Xiangrui Meng] update some other test suites
      e1f9b8d [Xiangrui Meng] add sqlContext to MLlibTestSparkContext
      ddec173c
    • Xiangrui Meng's avatar
      [SPARK-7762] [MLLIB] set default value for outputCol · c330e52d
      Xiangrui Meng authored
      Set a default value for `outputCol` instead of forcing users to name it. This is useful for intermediate transformers in the pipeline. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6289 from mengxr/SPARK-7762 and squashes the following commits:
      
      54edebc [Xiangrui Meng] merge master
      bff8667 [Xiangrui Meng] update unit test
      171246b [Xiangrui Meng] add unit test for outputCol
      a4321bd [Xiangrui Meng] set default value for outputCol
      c330e52d
    • Xiangrui Meng's avatar
      [SPARK-7537] [MLLIB] spark.mllib API updates · 2ad4837c
      Xiangrui Meng authored
      Minor updates to the spark.mllib APIs:
      
      1. Add `DeveloperApi` to `PMMLExportable` and add `Experimental` to `toPMML` methods.
      2. Mention `RankingMetrics.of` in the `RankingMetrics` constructor.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6280 from mengxr/SPARK-7537 and squashes the following commits:
      
      1bd2583 [Xiangrui Meng] organize imports
      94afa7a [Xiangrui Meng] mark all toPMML methods experimental
      4c40da1 [Xiangrui Meng] mention the factory method for RankingMetrics for Java users
      88c62d0 [Xiangrui Meng] add DeveloperApi to PMMLExportable
      2ad4837c
    • Yanbo Liang's avatar
      [SPARK-6094] [MLLIB] Add MultilabelMetrics in PySpark/MLlib · 98a46f9d
      Yanbo Liang authored
      Add MultilabelMetrics in PySpark/MLlib
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #6276 from yanboliang/spark-6094 and squashes the following commits:
      
      b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
      98a46f9d
    • Xiangrui Meng's avatar
      [SPARK-7654] [MLLIB] Migrate MLlib to the DataFrame reader/writer API · 589b12f8
      Xiangrui Meng authored
      parquetFile -> read.parquet rxin
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6281 from mengxr/SPARK-7654 and squashes the following commits:
      
      a79b612 [Xiangrui Meng] parquetFile -> read.parquet
      589b12f8
    • Xusen Yin's avatar
      [SPARK-7663] [MLLIB] Add requirement for word2vec model · b3abf0b8
      Xusen Yin authored
      JIRA issue [link](https://issues.apache.org/jira/browse/SPARK-7663).
      
      We should check the model size of word2vec, to prevent the unexpected empty.
      
      CC srowen.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #6228 from yinxusen/SPARK-7663 and squashes the following commits:
      
      21770c5 [Xusen Yin] check the vocab size
      54ae63e [Xusen Yin] add requirement for word2vec model
      b3abf0b8
  4. May 19, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-7652] [MLLIB] Update the implementation of naive Bayes prediction with BLAS · c12dff9b
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7652
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6189 from viirya/naive_bayes_blas_prediction and squashes the following commits:
      
      ab611fd [Liang-Chi Hsieh] Remove unnecessary space.
      ddc48b9 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into naive_bayes_blas_prediction
      b5772b4 [Liang-Chi Hsieh] Fix binary compatibility.
      2f65186 [Liang-Chi Hsieh] Remove toDense.
      1b6cdfe [Liang-Chi Hsieh] Update the implementation of naive Bayes prediction with BLAS.
      c12dff9b
    • Xusen Yin's avatar
      [SPARK-7586] [ML] [DOC] Add docs of Word2Vec in ml package · 68fb2a46
      Xusen Yin authored
      CC jkbradley.
      
      JIRA [issue](https://issues.apache.org/jira/browse/SPARK-7586).
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #6181 from yinxusen/SPARK-7586 and squashes the following commits:
      
      77014c5 [Xusen Yin] comment fix
      57a4c07 [Xusen Yin] small fix for docs
      1178c8f [Xusen Yin] remove the correctness check in java suite
      1c3f389 [Xusen Yin] delete sbt commit
      1af152b [Xusen Yin] check python example code
      1b5369e [Xusen Yin] add docs of word2vec
      68fb2a46
    • Joseph K. Bradley's avatar
      [SPARK-7678] [ML] Fix default random seed in HasSeed · 7b16e9f2
      Joseph K. Bradley authored
      Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
      Also, removed fixed random seeds from Word2Vec and ALS.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6251 from jkbradley/scala-fixed-seed and squashes the following commits:
      
      0e37184 [Joseph K. Bradley] Fixed Word2VecSuite, ALSSuite in spark.ml to use original fixed random seeds
      678ec3a [Joseph K. Bradley] Removed fixed random seeds from Word2Vec and ALS. Changed shared param HasSeed to have default based on hashCode of class name, instead of random number.
      7b16e9f2
    • Joseph K. Bradley's avatar
      [SPARK-7047] [ML] ml.Model optional parent support · fb902732
      Joseph K. Bradley authored
      Made Model.parent transient.  Added Model.hasParent to test for null parent
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5914 from jkbradley/parent-optional and squashes the following commits:
      
      d501774 [Joseph K. Bradley] Made Model.parent transient.  Added Model.hasParent to test for null parent
      fb902732
    • Xusen Yin's avatar
      [SPARK-7581] [ML] [DOC] User guide for spark.ml PolynomialExpansion · 6008ec14
      Xusen Yin authored
      JIRA [here](https://issues.apache.org/jira/browse/SPARK-7581).
      
      CC jkbradley
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #6113 from yinxusen/SPARK-7581 and squashes the following commits:
      
      1a7d80d [Xusen Yin] merge with master
      892a8e9 [Xusen Yin] fix python 3 compatibility
      ec935bf [Xusen Yin] small fix
      3e9fa1d [Xusen Yin] delete note
      69fcf85 [Xusen Yin] simplify and add python example
      81d21dc [Xusen Yin] add programming guide for Polynomial Expansion
      40babfb [Xusen Yin] add java test suite for PolynomialExpansion
      6008ec14
  5. May 18, 2015
    • Liang-Chi Hsieh's avatar
      [SPARK-7681] [MLLIB] Add SparseVector support for gemv · d03638cc
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7681
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6209 from viirya/sparsevector_gemv and squashes the following commits:
      
      ce0bb8b [Liang-Chi Hsieh] Still need to scal y when beta is 0.0 because it clears out y.
      b890e63 [Liang-Chi Hsieh] Do not delete multiply for DenseVector.
      57a8c1e [Liang-Chi Hsieh] Add MimaExcludes for v1.4.
      458d1ae [Liang-Chi Hsieh] List DenseMatrix.multiply and SparseMatrix.multiply to MimaExcludes too.
      054f05d [Liang-Chi Hsieh] Fix scala style.
      410381a [Liang-Chi Hsieh] Address comments. Make Matrix.multiply more generalized.
      4616696 [Liang-Chi Hsieh] Add support for SparseVector with SparseMatrix.
      5d6d07a [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into sparsevector_gemv
      c069507 [Liang-Chi Hsieh] Add SparseVector support for gemv with DenseMatrix.
      d03638cc
    • Xiangrui Meng's avatar
      [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python · 9c7e802a
      Xiangrui Meng authored
      This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:
      
      1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
      2. Accept a list of param maps in `fit`.
      3. Use parent uid and name to identify param.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:
      
      413c463 [Xiangrui Meng] remove unnecessary doc
      4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      611c719 [Xiangrui Meng] fix python style
      68862b8 [Xiangrui Meng] update _java_obj initialization
      927ad19 [Xiangrui Meng] fix ml/tests.py
      0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
      9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
      c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
      7e0d27f [Xiangrui Meng] merge master
      46840fb [Xiangrui Meng] update wrappers
      b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
      46cb6ed [Xiangrui Meng] merge master
      a163413 [Xiangrui Meng] fix style
      1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
      9630eae [Xiangrui Meng] fix Identifiable._randomUID
      13bd70a [Xiangrui Meng] update ml/tests.py
      64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
      02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
      66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
      7431272 [Joseph K. Bradley] Rebased with master
      9c7e802a
  6. May 17, 2015
    • Shuo Xiang's avatar
      [SPARK-7694] [MLLIB] Use getOrElse for getting the threshold of LR model · 775e6f99
      Shuo Xiang authored
      The `toString` method of `LogisticRegressionModel` calls `get` method on an Option (threshold) without a safeguard. In spark-shell, the following code `val model = algorithm.run(data).clearThreshold()` in lbfgs code will fail as `toString `method will be called right after `clearThreshold()` to show the results in the REPL.
      
      Author: Shuo Xiang <shuoxiangpub@gmail.com>
      
      Closes #6224 from coderxiang/getorelse and squashes the following commits:
      
      d5f53c9 [Shuo Xiang] use getOrElse for getting the threshold of LR model
      5f109b4 [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      c5c5bfe [Shuo Xiang] Merge remote-tracking branch 'upstream/master'
      98804c9 [Shuo Xiang] fix bug in topBykey and update test
      775e6f99
  7. May 16, 2015
  8. May 15, 2015
  9. May 14, 2015
    • Xiangrui Meng's avatar
      [SPARK-7407] [MLLIB] use uid + name to identify parameters · 1b8625f4
      Xiangrui Meng authored
      A param instance is strongly attached to an parent in the current implementation. So if we make a copy of an estimator or a transformer in pipelines and other meta-algorithms, it becomes error-prone to copy the params to the copied instances. In this PR, a param is identified by its parent's UID and the param name. So it becomes loosely attached to its parent and all its derivatives. The UID is preserved during copying or fitting. All components now have a default constructor and a constructor that takes a UID as input. I keep the constructors for Param in this PR to reduce the amount of diff and moved `parent` as a mutable field.
      
      This PR still needs some clean-ups, and there are several spark.ml PRs pending. I'll try to get them merged first and then update this PR.
      
      jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6019 from mengxr/SPARK-7407 and squashes the following commits:
      
      c4c8120 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
      520f0a2 [Xiangrui Meng] address comments
      2569168 [Xiangrui Meng] fix tests
      873caca [Xiangrui Meng] fix tests in OneVsRest; fix a racing condition in shouldOwn
      409ea08 [Xiangrui Meng] minor updates
      83a163c [Xiangrui Meng] update JavaDeveloperApiExample
      5db5325 [Xiangrui Meng] update OneVsRest
      7bde7ae [Xiangrui Meng] merge master
      697fdf9 [Xiangrui Meng] update Bucketizer
      7b4f6c2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7407
      629d402 [Xiangrui Meng] fix LRSuite
      154516f [Xiangrui Meng] merge master
      aa4a611 [Xiangrui Meng] fix examples/compile
      a4794dd [Xiangrui Meng] change Param to use  to reduce the size of diff
      fdbc415 [Xiangrui Meng] all tests passed
      c255f17 [Xiangrui Meng] fix tests in ParamsSuite
      818e1db [Xiangrui Meng] merge master
      e1160cf [Xiangrui Meng] fix tests
      fbc39f0 [Xiangrui Meng] pass test:compile
      108937e [Xiangrui Meng] pass compile
      8726d39 [Xiangrui Meng] use parent uid in Param
      eaeed35 [Xiangrui Meng] update Identifiable
      1b8625f4
    • DB Tsai's avatar
      [SPARK-7620] [ML] [MLLIB] Removed calling size, length in while condition to avoid extra JVM call · d3db2fd6
      DB Tsai authored
      Author: DB Tsai <dbt@netflix.com>
      
      Closes #6137 from dbtsai/clean and squashes the following commits:
      
      185816d [DB Tsai] fix compilication issue
      f418d08 [DB Tsai] first commit
      d3db2fd6
  10. May 13, 2015
    • Xiangrui Meng's avatar
      [SPARK-7612] [MLLIB] update NB training to use mllib's BLAS · d5f18de1
      Xiangrui Meng authored
      This is similar to the changes to k-means, which gives us better control on the performance. dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6128 from mengxr/SPARK-7612 and squashes the following commits:
      
      b5c24c5 [Xiangrui Meng] merge master
      a90e3ec [Xiangrui Meng] update NB training to use mllib's BLAS
      d5f18de1
    • leahmcguire's avatar
      [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that... · 61e05fc5
      leahmcguire authored
      [SPARK-7545] [MLLIB] Added check in Bernoulli Naive Bayes to make sure that both training and predict features have values of 0 or 1
      
      Author: leahmcguire <lmcguire@salesforce.com>
      
      Closes #6073 from leahmcguire/binaryCheckNB and squashes the following commits:
      
      b8442c2 [leahmcguire] changed to if else for value checks
      911bf83 [leahmcguire] undid reformat
      4eedf1e [leahmcguire] moved bernoulli check
      9ee9e84 [leahmcguire] fixed style error
      3f3b32c [leahmcguire] fixed zero one check so only called in combiner
      831fd27 [leahmcguire] got test working
      f44bb3c [leahmcguire] removed changes from CV branch
      67253f0 [leahmcguire] added check to bernoulli to ensure feature values are zero or one
      f191c71 [leahmcguire] fixed name
      58d060b [leahmcguire] changed param name and test according to comments
      04f0d3c [leahmcguire] Added stats from cross validation as a val in the cross validation model to save them for user access
      61e05fc5
    • Burak Yavuz's avatar
      [SPARK-7593] [ML] Python Api for ml.feature.Bucketizer · 5db18ba6
      Burak Yavuz authored
      Added `ml.feature.Bucketizer` to PySpark.
      
      cc mengxr
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #6124 from brkyvz/ml-bucket and squashes the following commits:
      
      05285be [Burak Yavuz] added sphinx doc
      6abb6ed [Burak Yavuz] added support for Bucketizer
      5db18ba6
  11. May 12, 2015
    • Xiangrui Meng's avatar
      [SPARK-7528] [MLLIB] make RankingMetrics Java-friendly · 2713bc65
      Xiangrui Meng authored
      `RankingMetrics` contains a ClassTag, which is hard to create in Java. This PR adds a factory method `of` for Java users. coderxiang
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6098 from mengxr/SPARK-7528 and squashes the following commits:
      
      e5d57ae [Xiangrui Meng] make RankingMetrics Java-friendly
      2713bc65
    • Joseph K. Bradley's avatar
      [SPARK-7573] [ML] OneVsRest cleanups · 96c4846d
      Joseph K. Bradley authored
      Minor cleanups discussed with [~mengxr]:
      * move OneVsRest from reduction to classification sub-package
      * make model constructor private
      
      Some doc cleanups too
      
      CC: harsha2010  Could you please verify this looks OK?  Thanks!
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6097 from jkbradley/onevsrest-cleanup and squashes the following commits:
      
      4ecd48d [Joseph K. Bradley] org imports
      430b065 [Joseph K. Bradley] moved OneVsRest from reduction subpackage to classification.  small java doc style fixes
      9f8b9b9 [Joseph K. Bradley] Small cleanups to OneVsRest.  Made model constructor private to ml package.
      96c4846d
    • Joseph K. Bradley's avatar
      [SPARK-7557] [ML] [DOC] User guide for spark.ml HashingTF, Tokenizer · f0c1bc34
      Joseph K. Bradley authored
      Added feature transformer subsection to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide.
      
      I've run Scala, Python examples in the Spark/PySpark shells.  I ran the Java examples via the test suite (with small modifications for printing).
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #6093 from jkbradley/hashingtf-guide and squashes the following commits:
      
      d5d213f [Joseph K. Bradley] small fix
      dd6e91a [Joseph K. Bradley] fixes from code review of user guide
      33c3ff9 [Joseph K. Bradley] small fix
      bc6058c [Joseph K. Bradley] fix link
      361a174 [Joseph K. Bradley] Added subsection for feature transformers to spark.ml guide, with HashingTF and Tokenizer.  Added JavaHashingTFSuite to test Java examples in new guide
      f0c1bc34
    • Xiangrui Meng's avatar
      [SPARK-7571] [MLLIB] rename Math to math · a4874b0d
      Xiangrui Meng authored
      `scala.Math` is deprecated since 2.8. This PR only touchs `Math` usages in MLlib. dbtsai
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6092 from mengxr/SPARK-7571 and squashes the following commits:
      
      fe8f8d3 [Xiangrui Meng] Math -> math
      a4874b0d
    • Xiangrui Meng's avatar
      [SPARK-7559] [MLLIB] Bucketizer should include the right most boundary in the last bucket. · 23b9863e
      Xiangrui Meng authored
      We make special treatment for +inf in `Bucketizer`. This could be simplified by always including the largest split value in the last bucket. E.g., (x1, x2, x3) defines buckets [x1, x2) and [x2, x3]. This shouldn't affect user code much, and there are applications that need to include the right-most value. For example, we can bucketize ratings from 0 to 10 to bad, neutral, and good with splits 0, 4, 6, 10. It may reads weird if the users need to put 0, 4, 6, 10.1 (or 11).
      
      This also update the impl to use `Arrays.binarySearch` and `withClue` in test.
      
      yinxusen jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6075 from mengxr/SPARK-7559 and squashes the following commits:
      
      e28f910 [Xiangrui Meng] update bucketizer impl
      23b9863e
    • Ram Sriharsha's avatar
      [SPARK-7015] [MLLIB] [WIP] Multiclass to Binary Reduction: One Against All · 595a6758
      Ram Sriharsha authored
      initial cut of one against all. test code is a scaffolding , not fully implemented.
      This WIP is to gather early feedback.
      
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #5830 from harsha2010/reduction and squashes the following commits:
      
      5f4b495 [Ram Sriharsha] Fix Test
      386e98b [Ram Sriharsha] Style fix
      49b4a17 [Ram Sriharsha] Simplify the test
      02279cc [Ram Sriharsha] Output Label Metadata in Prediction Col
      bc78032 [Ram Sriharsha] Code Review Updates
      8ce4845 [Ram Sriharsha] Merge with Master
      2a807be [Ram Sriharsha] Merge branch 'master' into reduction
      e21bfcc [Ram Sriharsha] Style Fix
      5614f23 [Ram Sriharsha] Style Fix
      c75583a [Ram Sriharsha] Cleanup
      7a5f136 [Ram Sriharsha] Fix TODOs
      804826b [Ram Sriharsha] Merge with Master
      1448a5f [Ram Sriharsha] Style Fix
      6e47807 [Ram Sriharsha] Style Fix
      d63e46b [Ram Sriharsha] Incorporate Code Review Feedback
      ced68b5 [Ram Sriharsha] Refactor OneVsAll to implement Predictor
      78fa82a [Ram Sriharsha] extra line
      0dfa1fb [Ram Sriharsha] Fix inexhaustive match cases that may arise from UnresolvedAttribute
      a59a4f4 [Ram Sriharsha] @Experimental
      4167234 [Ram Sriharsha] Merge branch 'master' into reduction
      868a4fd [Ram Sriharsha] @Experimental
      041d905 [Ram Sriharsha] Code Review Fixes
      df188d8 [Ram Sriharsha] Style fix
      612ec48 [Ram Sriharsha] Style Fix
      6ef43d3 [Ram Sriharsha] Prefer Unresolved Attribute to Option: Java APIs are cleaner
      6bf6bff [Ram Sriharsha] Update OneHotEncoder to new API
      e29cb89 [Ram Sriharsha] Merge branch 'master' into reduction
      1c7fa44 [Ram Sriharsha] Fix Tests
      ca83672 [Ram Sriharsha] Incorporate Code Review Feedback + Rename to OneVsRestClassifier
      221beeed [Ram Sriharsha] Upgrade to use Copy method for cloning Base Classifiers
      26f1ddb [Ram Sriharsha] Merge with SPARK-5956 API changes
      9738744 [Ram Sriharsha] Merge branch 'master' into reduction
      1a3e375 [Ram Sriharsha] More efficient Implementation: Use withColumn to generate label column dynamically
      32e0189 [Ram Sriharsha] Restrict reduction to Margin Based Classifiers
      ff272da [Ram Sriharsha] Style fix
      28771f5 [Ram Sriharsha] Add Tests for Multiclass to Binary Reduction
      b60f874 [Ram Sriharsha] Fix Style issues in Test
      3191cdf [Ram Sriharsha] Remove this test, accidental commit
      23f056c [Ram Sriharsha] Fix Headers for test
      1b5e929 [Ram Sriharsha] Fix Style issues and add Header
      8752863 [Ram Sriharsha] [SPARK-7015][MLLib][WIP] Multiclass to Binary Reduction: One Against All
      595a6758
    • Marcelo Vanzin's avatar
      [SPARK-7485] [BUILD] Remove pyspark files from assembly. · 82e890fb
      Marcelo Vanzin authored
      The sbt part of the build is hacky; it basically tricks sbt
      into generating the zip by using a generator, but returns
      an empty list for the generated files so that nothing is
      actually added to the assembly.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6022 from vanzin/SPARK-7485 and squashes the following commits:
      
      22c1e04 [Marcelo Vanzin] Remove unneeded code.
      4893622 [Marcelo Vanzin] [SPARK-7485] [build] Remove pyspark files from assembly.
      82e890fb
  12. May 11, 2015
    • Xusen Yin's avatar
      [SPARK-5893] [ML] Add bucketizer · 35fb42a0
      Xusen Yin authored
      JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-5893).
      
      One thing to make clear, the `buckets` parameter, which is an array of `Double`, performs as split points. Say,
      
      ```scala
      buckets = Array(-0.5, 0.0, 0.5)
      ```
      
      splits the real number into 4 ranges, (-inf, -0.5], (-0.5, 0.0], (0.0, 0.5], (0.5, +inf), which is encoded as 0, 1, 2, 3.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #5980 from yinxusen/SPARK-5893 and squashes the following commits:
      
      dc8c843 [Xusen Yin] Merge pull request #4 from jkbradley/yinxusen-SPARK-5893
      1ca973a [Joseph K. Bradley] one more bucketizer test
      34f124a [Joseph K. Bradley] Removed lowerInclusive, upperInclusive params from Bucketizer, and used splits instead.
      eacfcfa [Xusen Yin] change ML attribute from splits into buckets
      c3cc770 [Xusen Yin] add more unit test for binary search
      3a16cc2 [Xusen Yin] refine comments and names
      ac77859 [Xusen Yin] fix style error
      fb30d79 [Xusen Yin] fix and test binary search
      2466322 [Xusen Yin] refactor Bucketizer
      11fb00a [Xusen Yin] change it into an Estimator
      998bc87 [Xusen Yin] check buckets
      4024cf1 [Xusen Yin] add test suite
      5fe190e [Xusen Yin] add bucketizer
      35fb42a0
Loading