Skip to content
Snippets Groups Projects
  1. Jul 24, 2015
    • MechCoder's avatar
      [SPARK-9222] [MLlib] Make class instantiation variables in DistributedLDAModel private[clustering] · e2531245
      MechCoder authored
      This makes it easier to test all the class variables of the DistributedLDAmodel.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7573 from MechCoder/lda_test and squashes the following commits:
      
      2f1a293 [MechCoder] [SPARK-9222] [MLlib] Make class instantiation variables in DistributedLDAModel private[clustering]
      e2531245
    • Reynold Xin's avatar
      [SPARK-9285][SQL] Remove InternalRow's inheritance from Row. · 431ca39b
      Reynold Xin authored
      I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7626 from rxin/internalRow and squashes the following commits:
      
      e124daf [Reynold Xin] Fixed test case.
      805ceb7 [Reynold Xin] Commented out the failed test suite.
      f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining.
      76d9081 [Reynold Xin] Fixed data sources.
      7807f70 [Reynold Xin] Fixed DataFrameSuite.
      cb60cd2 [Reynold Xin] Code review & small bug fixes.
      0a2948b [Reynold Xin] Fixed style.
      3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
      431ca39b
    • Ram Sriharsha's avatar
      [SPARK-8092] [ML] Allow OneVsRest Classifier feature and label column names to be configurable. · d4d762f2
      Ram Sriharsha authored
      The base classifier input and output columns are ignored in favor of  the ones specified in OneVsRest.
      
      Author: Ram Sriharsha <rsriharsha@hw11853.local>
      
      Closes #6631 from harsha2010/SPARK-8092 and squashes the following commits:
      
      6591dc6 [Ram Sriharsha] add documentation for params
      b7024b1 [Ram Sriharsha] cleanup
      f0e2bfb [Ram Sriharsha] merge with master
      108d3d7 [Ram Sriharsha] merge with master
      4f74126 [Ram Sriharsha] Allow label/ features columns to be configurable
      d4d762f2
  2. Jul 23, 2015
    • Davies Liu's avatar
      [SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType · 8a94eb23
      Davies Liu authored
      Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases).
      
      In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18).
      
      If no precision and scale is provide, it's Decimal(10, 0) as before.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7605 from davies/decimal_unlimited and squashes the following commits:
      
      aa3f115 [Davies Liu] fix tests and style
      fb0d20d [Davies Liu] address comments
      bfaae35 [Davies Liu] fix style
      df93657 [Davies Liu] address comments and clean up
      06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited
      4c28969 [Davies Liu] fix tests
      8d783cc [Davies Liu] fix tests
      788631c [Davies Liu] fix double with decimal in Union/except
      1779bde [Davies Liu] fix scala style
      c9c7c78 [Davies Liu] remove Decimal.Unlimited
      8a94eb23
    • Liang-Chi Hsieh's avatar
      [SPARK-7254] [MLLIB] Run PowerIterationClustering directly on graph · 825ab1e4
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-7254
      
      Author: Liang-Chi Hsieh <viirya@appier.com>
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #6054 from viirya/pic_on_graph and squashes the following commits:
      
      8b87b81 [Liang-Chi Hsieh] Fix scala style.
      a22fb8b [Liang-Chi Hsieh] For comment.
      ef565a0 [Liang-Chi Hsieh] Fix indentation.
      d249aa1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into pic_on_graph
      82d7351 [Liang-Chi Hsieh] Run PowerIterationClustering directly on graph.
      825ab1e4
    • Joseph K. Bradley's avatar
      [SPARK-9268] [ML] Removed varargs annotation from Params.setDefault taking multiple params · 410dd41c
      Joseph K. Bradley authored
      Removed varargs annotation from Params.setDefault taking multiple params.
      
      Though varargs is technically correct, it often requires that developers do clean assembly, rather than (not clean) assembly, which is a nuisance during development.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7604 from jkbradley/params-setdefault-varargs and squashes the following commits:
      
      6016dc6 [Joseph K. Bradley] removed varargs annotation from Params.setDefault taking multiple params
      410dd41c
  3. Jul 22, 2015
    • Josh Rosen's avatar
      [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled · b217230f
      Josh Rosen authored
      Spark has an option called spark.localExecution.enabled; according to the docs:
      
      > Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.
      
      This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.
      
      This pull request simply brings #7484 up to date.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7585 from rxin/remove-local-exec and squashes the following commits:
      
      84bd10e [Reynold Xin] Python fix.
      1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
      eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
      b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
      8975d96 [Josh Rosen] Remove local execution tests.
      ffa8c9b [Josh Rosen] Remove documentation for configuration
      b217230f
    • Reynold Xin's avatar
      [SPARK-9262][build] Treat Scala compiler warnings as errors · d71a13f4
      Reynold Xin authored
      I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings.
      
      Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop).
      
      Most of the work are done by ericl.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #7598 from rxin/warnings and squashes the following commits:
      
      beb311b [Reynold Xin] Fixed tests.
      542c031 [Reynold Xin] Fixed one more warning.
      87c354a [Reynold Xin] Fixed all non-deprecation warnings.
      78660ac [Eric Liang] first effort to fix warnings
      d71a13f4
    • martinzapletal's avatar
      [SPARK-8484] [ML] Added TrainValidationSplit for hyper-parameter tuning. · a721ee52
      martinzapletal authored
      - [X] Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive.
      - [X] Simplified replacement of https://github.com/apache/spark/pull/6996
      
      Author: martinzapletal <zapletal-martin@email.cz>
      
      Closes #7337 from zapletal-martin/SPARK-8484-TrainValidationSplit and squashes the following commits:
      
      cafc949 [martinzapletal] Review comments https://github.com/apache/spark/pull/7337.
      511b398 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8484-TrainValidationSplit
      f4fc9c4 [martinzapletal] SPARK-8484 Resolved feedback to https://github.com/apache/spark/pull/7337
      00c4f5a [martinzapletal] SPARK-8484. Styling.
      d699506 [martinzapletal] SPARK-8484. Styling.
      93ed2ee [martinzapletal] Styling.
      3bc1853 [martinzapletal] SPARK-8484. Styling.
      2aa6f43 [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
      21662eb [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
      a721ee52
    • Matei Zaharia's avatar
      [SPARK-9244] Increase some memory defaults · fe26584a
      Matei Zaharia authored
      There are a few memory limits that people hit often and that we could
      make higher, especially now that memory sizes have grown.
      
      - spark.akka.frameSize: This defaults at 10 but is often hit for map
        output statuses in large shuffles. This memory is not fully allocated
        up-front, so we can just make this larger and still not affect jobs
        that never sent a status that large. We increase it to 128.
      
      - spark.executor.memory: Defaults at 512m, which is really small. We
        increase it to 1g.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #7586 from mateiz/configs and squashes the following commits:
      
      ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
      fe26584a
    • Feynman Liang's avatar
      [SPARK-8536] [MLLIB] Generalize OnlineLDAOptimizer to asymmetric document-topic Dirichlet priors · 1aca9c13
      Feynman Liang authored
      Modify `LDA` to take asymmetric document-topic prior distributions and `OnlineLDAOptimizer` to use the asymmetric prior during variational inference.
      
      This PR only generalizes `OnlineLDAOptimizer` and the associated `LocalLDAModel`; `EMLDAOptimizer` and `DistributedLDAModel` still only support symmetric `alpha` (checked during `EMLDAOptimizer.initialize`).
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7575 from feynmanliang/SPARK-8536-LDA-asymmetric-priors and squashes the following commits:
      
      af8fbb7 [Feynman Liang] Fix merge errors
      ef5821d [Feynman Liang] Merge remote-tracking branch 'apache/master' into SPARK-8536-LDA-asymmetric-priors
      58f1d7b [Feynman Liang] Fix from review feedback
      a6dcf70 [Feynman Liang] Change docConcentration interface and move LDAOptimizer validation to initialize, add sad path tests
      72038ff [Feynman Liang] Add tests referenced against gensim
      d4284fa [Feynman Liang] Generalize OnlineLDA to asymmetric priors, no tests
      1aca9c13
    • Feynman Liang's avatar
      [SPARK-9224] [MLLIB] OnlineLDA Performance Improvements · 8486cd85
      Feynman Liang authored
      In-place updates, reduce number of transposes, and vectorize operations in OnlineLDA implementation.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7454 from feynmanliang/OnlineLDA-perf-improvements and squashes the following commits:
      
      78b0f5a [Feynman Liang] Make in-place variables vals, fix BLAS error
      7f62a55 [Feynman Liang] --amend
      c62cb1e [Feynman Liang] Outer product for stats, revert Range slicing
      aead650 [Feynman Liang] Range slice, in-place update, reduce transposes
      8486cd85
  4. Jul 21, 2015
    • MechCoder's avatar
      [SPARK-5989] [MLLIB] Model save/load for LDA · 89db3c0b
      MechCoder authored
      Add support for saving and loading LDA both the local and distributed versions.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #6948 from MechCoder/lda_save_load and squashes the following commits:
      
      49bcdce [MechCoder] minor style fixes
      cc14054 [MechCoder] minor
      4587d1d [MechCoder] Minor changes
      c753122 [MechCoder] Load and save the model in private methods
      2782326 [MechCoder] [SPARK-5989] Model save/load for LDA
      89db3c0b
    • petz2000's avatar
      [SPARK-8915] [DOCUMENTATION, MLLIB] Added @since tags to mllib.classification · df4ddb31
      petz2000 authored
      Created since tags for methods in mllib.classification
      
      Author: petz2000 <petz2000@gmail.com>
      
      Closes #7371 from petz2000/add_since_mllib.classification and squashes the following commits:
      
      39fe291 [petz2000] Removed whitespace in block comment
      c9b1e03 [petz2000] Removed @since tags again from protected and private methods
      cd759b6 [petz2000] Added @since tags to methods
      df4ddb31
    • Holden Karau's avatar
      [SPARK-9204][ML] Add default params test for linearyregression suite · 4d97be95
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #7553 from holdenk/SPARK-9204-add-default-params-test-to-linear-regression and squashes the following commits:
      
      630ba19 [Holden Karau] style fix
      faa08a3 [Holden Karau] Add default params test for linearyregression suite
      4d97be95
  5. Jul 20, 2015
    • Eric Liang's avatar
      [SPARK-9201] [ML] Initial integration of MLlib + SparkR using RFormula · 1cbdd899
      Eric Liang authored
      This exposes the SparkR:::glm() and SparkR:::predict() APIs. It was necessary to change RFormula to silently drop the label column if it was missing from the input dataset, which is kind of a hack but necessary to integrate with the Pipeline API.
      
      The umbrella design doc for MLlib + SparkR integration can be viewed here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit
      
      mengxr
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #7483 from ericl/spark-8774 and squashes the following commits:
      
      3dfac0c [Eric Liang] update
      17ef516 [Eric Liang] more comments
      1753a0f [Eric Liang] make glm generic
      b0f50f8 [Eric Liang] equivalence test
      550d56d [Eric Liang] export methods
      c015697 [Eric Liang] second pass
      117949a [Eric Liang] comments
      5afbc67 [Eric Liang] test label columns
      6b7f15f [Eric Liang] Fri Jul 17 14:20:22 PDT 2015
      3a63ae5 [Eric Liang] Fri Jul 17 13:41:52 PDT 2015
      ce61367 [Eric Liang] Fri Jul 17 13:41:17 PDT 2015
      0299c59 [Eric Liang] Fri Jul 17 13:40:32 PDT 2015
      e37603f [Eric Liang] Fri Jul 17 12:15:03 PDT 2015
      d417d0c [Eric Liang] Merge remote-tracking branch 'upstream/master' into spark-8774
      29a2ce7 [Eric Liang] Merge branch 'spark-8774-1' into spark-8774
      d1959d2 [Eric Liang] clarify comment
      2db68aa [Eric Liang] second round of comments
      dc3c943 [Eric Liang] address comments
      5765ec6 [Eric Liang] fix style checks
      1f361b0 [Eric Liang] doc
      d33211b [Eric Liang] r support
      fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
      1cbdd899
    • Meihua Wu's avatar
      [SPARK-9175] [MLLIB] BLAS.gemm fails to update matrix C when alpha==0 and beta!=1 · ff3c72db
      Meihua Wu authored
      Fix BLAS.gemm to update matrix C when alpha==0 and beta!=1
      Also include unit tests to verify the fix.
      
      mengxr brkyvz
      
      Author: Meihua Wu <meihuawu@umich.edu>
      
      Closes #7503 from rotationsymmetry/fix_BLAS_gemm and squashes the following commits:
      
      fce199c [Meihua Wu] Fix BLAS.gemm to update C when alpha==0 and beta!=1
      ff3c72db
    • MechCoder's avatar
      [SPARK-8996] [MLLIB] [PYSPARK] Python API for Kolmogorov-Smirnov Test · d0b4e93f
      MechCoder authored
      Python API for the KS-test
      
      Statistics.kolmogorovSmirnovTest(data, distName, *params)
      I'm not quite sure how to support the callable function since it is not serializable.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7430 from MechCoder/spark-8996 and squashes the following commits:
      
      2dd009d [MechCoder] minor
      021d233 [MechCoder] Remove one wrapper and other minor stuff
      49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
      d0b4e93f
    • George Dittmar's avatar
      [SPARK-7422] [MLLIB] Add argmax to Vector, SparseVector · 3f7de7db
      George Dittmar authored
      Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423.
      
      Author: George Dittmar <georgedittmar@gmail.com>
      Author: George <dittmar@Georges-MacBook-Pro.local>
      Author: dittmarg <george.dittmar@webtrends.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #6112 from GeorgeDittmar/SPARK-7422 and squashes the following commits:
      
      3e0a939 [George Dittmar] Merge pull request #1 from mengxr/SPARK-7422
      127dec5 [Xiangrui Meng] update argmax impl
      2ea6a55 [George Dittmar] Added MimaExcludes for Vectors.argmax
      98058f4 [George Dittmar] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      5fd9380 [George Dittmar] fixing style check error
      42341fb [George Dittmar] refactoring arg max check to better handle zero values
      b22af46 [George Dittmar] Fixing spaces between commas in unit test
      f2eba2f [George Dittmar] Cleaning up unit tests to be fewer lines
      aa330e3 [George Dittmar] Fixing some last if else spacing issues
      ac53c55 [George Dittmar] changing dense vector argmax unit test to be one line call vs 2
      d5b5423 [George Dittmar] Fixing code style and updating if logic on when to check for zero values
      ee1a85a [George Dittmar] Cleaning up unit tests a bit and modifying a few cases
      3ee8711 [George Dittmar] Fixing corner case issue with zeros in the active values of the sparse vector. Updated unit tests
      b1f059f [George Dittmar] Added comment before we start arg max calculation. Updated unit tests to cover corner cases
      f21dcce [George Dittmar] commit
      af17981 [dittmarg] Initial work fixing bug that was made clear in pr
      eeda560 [George] Fixing SparseVector argmax function to ignore zero values while doing the calculation.
      4526acc [George] Merge branch 'master' of github.com:apache/spark into SPARK-7422
      df9538a [George] Added argmax to sparse vector and added unit test
      3cffed4 [George] Adding unit tests for argmax functions for Dense and Sparse vectors
      04677af [George] initial work on adding argmax to Vector and SparseVector
      3f7de7db
  6. Jul 17, 2015
    • Rekha Joshi's avatar
      [SPARK-9118] [ML] Implement IntArrayParam in mllib · 10179082
      Rekha Joshi authored
      Implement IntArrayParam in mllib
      
      Author: Rekha Joshi <rekhajoshm@gmail.com>
      Author: Joshi <rekhajoshm@gmail.com>
      
      Closes #7481 from rekhajoshm/SPARK-9118 and squashes the following commits:
      
      d3b1766 [Joshi] Implement IntArrayParam
      0be142d [Rekha Joshi] Merge pull request #3 from apache/master
      106fd8e [Rekha Joshi] Merge pull request #2 from apache/master
      e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
      10179082
    • Yu ISHIKAWA's avatar
      [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines · 34a889db
      Yu ISHIKAWA authored
      I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.
      
      [SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:
      
      be752de [Yu ISHIKAWA] Add assertions
      a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
      4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
      fb2417c [Yu ISHIKAWA] Use getInt, instead of get
      f397be4 [Yu ISHIKAWA] Switch the comparisons.
      ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
      effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
      c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
      19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
      1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
      f8338bc [Yu ISHIKAWA] Add the placeholders in Python
      4a03003 [Yu ISHIKAWA] Test for contains in Python
      6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
      288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
      5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
      97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
      e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
      978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
      2ec80bc [Yu ISHIKAWA] Fit on 1 line
      e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
      b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
      f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
      3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
      4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
      2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
      19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
      4d2ad1e [Yu ISHIKAWA] Modify the indentations
      0ae422f [Yu ISHIKAWA] Add a test for `setParams`
      4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
      11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
      220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
      92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
      c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
      6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
      687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
      a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
      5bedc51 [Yu ISHIKAWA] Remve an extra new line
      444c289 [Yu ISHIKAWA] Add the validation for `runs`
      e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
      7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
      7991e15 [Yu ISHIKAWA] Add a validation for `k`
      c2df35d [Yu ISHIKAWA] Make `predict` private
      93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
      d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
      e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
      8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
      6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
      99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
      79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
      6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
      11c2a12 [Yu ISHIKAWA] Limit the imports
      badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
      f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
      85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
      aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
      c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
      598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
      63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
      34a889db
    • Bryan Cutler's avatar
      [SPARK-7127] [MLLIB] Adding broadcast of model before prediction for ensembles · 8b8be1f5
      Bryan Cutler authored
      Broadcast of ensemble models in transformImpl before call to predict
      
      Author: Bryan Cutler <bjcutler@us.ibm.com>
      
      Closes #6300 from BryanCutler/bcast-ensemble-models-7127 and squashes the following commits:
      
      86e73de [Bryan Cutler] [SPARK-7127] Replaced deprecated callUDF with udf
      40a139d [Bryan Cutler] Merge branch 'master' into bcast-ensemble-models-7127
      9afad56 [Bryan Cutler] [SPARK-7127] Simplified calls by overriding transformImpl and using broadcasted model in callUDF to make prediction
      1f34be4 [Bryan Cutler] [SPARK-7127] Removed accidental newline
      171a6ce [Bryan Cutler] [SPARK-7127] Used modelAccessor parameter in predictImpl to access broadcasted model
      6fd153c [Bryan Cutler] [SPARK-7127] Applied broadcasting to remaining ensemble models
      aaad77b [Bryan Cutler] [SPARK-7127] Removed abstract class for broadcasting model, instead passing a prediction function as param to transform
      83904bb [Bryan Cutler] [SPARK-7127] Adding broadcast of model before prediction in RandomForestClassifier
      8b8be1f5
    • Feynman Liang's avatar
      [SPARK-9090] [ML] Fix definition of residual in LinearRegressionSummary,... · 6da10696
      Feynman Liang authored
      [SPARK-9090] [ML] Fix definition of residual in LinearRegressionSummary, EnsembleTestHelper, and SquaredError
      
      Make the definition of residuals in Spark consistent with literature. We have been using `prediction - label` for residuals, but literature usually defines `residual = label - prediction`.
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7435 from feynmanliang/SPARK-9090-Fix-LinearRegressionSummary-Residuals and squashes the following commits:
      
      f4b39d8 [Feynman Liang] Fix doc
      bc12a92 [Feynman Liang] Tweak EnsembleTestHelper and SquaredError residuals
      63f0d60 [Feynman Liang] Fix definition of residual
      6da10696
    • Yanbo Liang's avatar
      [SPARK-8600] [ML] Naive Bayes API for spark.ml Pipelines · 99746428
      Yanbo Liang authored
      Naive Bayes API for spark.ml Pipelines
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7284 from yanboliang/spark-8600 and squashes the following commits:
      
      bc890f7 [Yanbo Liang] remove labels valid check
      c3de687 [Yanbo Liang] remove labels from ml.NaiveBayesModel
      a2b3088 [Yanbo Liang] address comments
      3220b82 [Yanbo Liang] trigger jenkins
      3018a41 [Yanbo Liang] address comments
      208e166 [Yanbo Liang] Naive Bayes API for spark.ml Pipelines
      99746428
    • Yuhao Yang's avatar
      [SPARK-9062] [ML] Change output type of Tokenizer to Array(String, true) · 806c579f
      Yuhao Yang authored
      jira: https://issues.apache.org/jira/browse/SPARK-9062
      
      Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default.
      
      I'm not sure what's the recommended way for Tokenizer to handle the null value in the input. Any suggestion will be welcome.
      
      Author: Yuhao Yang <hhbyyh@gmail.com>
      
      Closes #7414 from hhbyyh/tokenizer and squashes the following commits:
      
      c01bd7a [Yuhao Yang] change output type of tokenizer
      806c579f
    • Yanbo Liang's avatar
      [MINOR] [ML] fix wrong annotation of RFormula.formula · 441e072a
      Yanbo Liang authored
      fix wrong annotation of RFormula.formula
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #7470 from yanboliang/RFormula and squashes the following commits:
      
      61f1919 [Yanbo Liang] fix wrong annotation
      441e072a
    • Xiangrui Meng's avatar
      [SPARK-9126] [MLLIB] do not assert on time taken by Thread.sleep() · 358e7bf6
      Xiangrui Meng authored
      Measure lower and upper bounds for task time and use them for validation. This PR also implements `Stopwatch.toString`. This suite should finish in less than 1 second.
      
      jkbradley pwendell
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7457 from mengxr/SPARK-9126 and squashes the following commits:
      
      4b40faa [Xiangrui Meng] simplify tests
      739f5bd [Xiangrui Meng] do not assert on time taken by Thread.sleep()
      358e7bf6
    • Joseph K. Bradley's avatar
      [SPARK-7131] [ML] Copy Decision Tree, Random Forest impl to spark.ml · 322d286b
      Joseph K. Bradley authored
      This PR copies the RandomForest implementation from spark.mllib to spark.ml.  Note that this includes the DecisionTree implementation, but not the GradientBoostedTrees one (which will come later).
      
      I essentially copied a minimal amount of code to spark.ml, removed the use of bins (and only used splits), and modified code only as much as necessary to get it to compile.  The spark.ml implementation still uses some spark.mllib classes (privately), which can be moved in future PRs.
      
      This refactoring will be helpful in extending the node representation to include more information, such as class probabilities.
      
      Specifically:
      * Copied code from spark.mllib to spark.ml:
        * mllib.tree.DecisionTree, mllib.tree.RandomForest copied to ml.tree.impl.RandomForest (main implementation)
        * NodeIdCache (needed to use splits instead of bins)
        * TreePoint (use splits instead of bins)
      * Added ml.tree.LearningNode used in RandomForest training (needed vars)
      * Removed bins from implementation, and only used splits
      * Small fix in JavaDecisionTreeRegressorSuite
      
      CC: mengxr  manishamde  codedeft chouqin
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7294 from jkbradley/dt-move-impl and squashes the following commits:
      
      48749be [Joseph K. Bradley] cleanups based on code review, mostly style
      bea9703 [Joseph K. Bradley] scala style fixes.  added some scala doc
      4e6d2a4 [Joseph K. Bradley] removed unnecessary use of copyValues, setParent for trees
      9a4d721 [Joseph K. Bradley] cleanups. removed InfoGainStats from ml, using old one for now.
      836e7d4 [Joseph K. Bradley] Fixed test suite failures
      bd5e063 [Joseph K. Bradley] fixed bucketizing issue
      0df3759 [Joseph K. Bradley] Need to remove use of Bucketizer
      d5224a9 [Joseph K. Bradley] modified tree and forest to use moved impl
      cc01823 [Joseph K. Bradley] still editing RF to get it to work
      19143fb [Joseph K. Bradley] More progress, but not done yet.  Rebased with master after 1.4 release.
      322d286b
  7. Jul 15, 2015
    • Xiangrui Meng's avatar
      [SPARK-9018] [MLLIB] add stopwatches · 73d92b00
      Xiangrui Meng authored
      Add stopwatches for easy instrumentation of MLlib algorithms. This is based on the `TimeTracker` used in decision trees. The distributed version uses Spark accumulator. jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7415 from mengxr/SPARK-9018 and squashes the following commits:
      
      40b4347 [Xiangrui Meng] == -> ===
      c477745 [Xiangrui Meng] address Joseph's comments
      f981a49 [Xiangrui Meng] add stopwatches
      73d92b00
    • Eric Liang's avatar
      [SPARK-8774] [ML] Add R model formula with basic support as a transformer · 6960a793
      Eric Liang authored
      This implements minimal R formula support as a feature transformer. Both numeric and string labels are supported, but features must be numeric for now.
      
      cc mengxr
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #7381 from ericl/spark-8774-1 and squashes the following commits:
      
      d1959d2 [Eric Liang] clarify comment
      2db68aa [Eric Liang] second round of comments
      dc3c943 [Eric Liang] address comments
      5765ec6 [Eric Liang] fix style checks
      1f361b0 [Eric Liang] doc
      fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
      6960a793
    • Feynman Liang's avatar
      [SPARK-9005] [MLLIB] Fix RegressionMetrics computation of explainedVariance · 536533ca
      Feynman Liang authored
      Fixes implementation of `explainedVariance` and `r2` to be consistent with their definitions as described in [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005).
      
      Author: Feynman Liang <fliang@databricks.com>
      
      Closes #7361 from feynmanliang/SPARK-9005-RegressionMetrics-bugs and squashes the following commits:
      
      f1112fc [Feynman Liang] Add explainedVariance formula
      1a3d098 [Feynman Liang] SROwen code review comments
      08a0e1b [Feynman Liang] Fix pyspark tests
      db8605a [Feynman Liang] Style fix
      bde9761 [Feynman Liang] Fix RegressionMetrics tests, relax assumption predictor is unbiased
      c235de0 [Feynman Liang] Fix RegressionMetrics tests
      4c4e56f [Feynman Liang] Fix RegressionMetrics computation of explainedVariance and r2
      536533ca
    • Feynman Liang's avatar
      [SPARK-8997] [MLLIB] Performance improvements in LocalPrefixSpan · 1bb8accb
      Feynman Liang authored
      Improves the performance of LocalPrefixSpan by implementing optimizations proposed in [SPARK-8997](https://issues.apache.org/jira/browse/SPARK-8997)
      
      Author: Feynman Liang <fliang@databricks.com>
      Author: Feynman Liang <feynman.liang@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #7360 from feynmanliang/SPARK-8997-improve-prefixspan and squashes the following commits:
      
      59db2f5 [Feynman Liang] Merge pull request #1 from mengxr/SPARK-8997
      91e4357 [Xiangrui Meng] update LocalPrefixSpan impl
      9212256 [Feynman Liang] MengXR code review comments
      f055d82 [Feynman Liang] Fix failing scalatest
      2e00cba [Feynman Liang] Depth first projections
      70b93e3 [Feynman Liang] Performance improvements in LocalPrefixSpan, fix tests
      1bb8accb
    • FlytxtRnD's avatar
      [SPARK-8018] [MLLIB] KMeans should accept initial cluster centers as param · 3f6296fe
      FlytxtRnD authored
       This allows Kmeans to be initialized using an existing set of cluster centers provided as  a KMeansModel object. This mode of initialization performs a single run.
      
      Author: FlytxtRnD <meethu.mathew@flytxt.com>
      
      Closes #6737 from FlytxtRnD/Kmeans-8018 and squashes the following commits:
      
      94b56df [FlytxtRnD] style correction
      ef95ee2 [FlytxtRnD] style correction
      c446c58 [FlytxtRnD] documentation and numRuns warning change
      06d13ef [FlytxtRnD] numRuns corrected
      d12336e [FlytxtRnD] numRuns variable modifications
      07f8554 [FlytxtRnD] remove setRuns from setIntialModel
      e721dfe [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
      242ead1 [FlytxtRnD] corrected == to === in assert
      714acb5 [FlytxtRnD] added numRuns
      60c8ce2 [FlytxtRnD] ignore runs parameter and initialModel test suite changed
      582e6d9 [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
      3f5fc8e [FlytxtRnD] test case modified and one runs condition added
      cd5dc5c [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018
      16f1b53 [FlytxtRnD] Merge branch 'Kmeans-8018', remote-tracking branch 'upstream/master' into Kmeans-8018
      e9c35d7 [FlytxtRnD] Remove getInitialModel and match cluster count criteria
      6959861 [FlytxtRnD] Accept initial cluster centers in KMeans
      3f6296fe
    • Yu ISHIKAWA's avatar
      [SPARK-6259] [MLLIB] Python API for LDA · 46927696
      Yu ISHIKAWA authored
      I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue.
      
      TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes.
      
      Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
      
      Closes #6791 from yu-iskw/SPARK-6259 and squashes the following commits:
      
      6855f59 [Yu ISHIKAWA] LDA inherits object
      28bd165 [Yu ISHIKAWA] Change the place of testing code
      d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value
      083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer`
      9f8bed8 [Yu ISHIKAWA] Simplify casting
      faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters
      98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented.
      57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing
      73412c3 [Yu ISHIKAWA] Fix the typo
      2278829 [Yu ISHIKAWA] Fix the indentation
      39514ec [Yu ISHIKAWA] Modify how to cast the input data
      8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala`
      77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint
      68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python
      25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing
      46927696
  8. Jul 14, 2015
  9. Jul 13, 2015
  10. Jul 10, 2015
    • Joseph K. Bradley's avatar
      [SPARK-8994] [ML] tiny cleanups to Params, Pipeline · 0c5207c6
      Joseph K. Bradley authored
      Made default impl of Params.validateParams empty
      CC mengxr
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #7349 from jkbradley/pipeline-small-cleanups and squashes the following commits:
      
      4e0f013 [Joseph K. Bradley] small cleanups after SPARK-5956
      0c5207c6
    • zhangjiajin's avatar
      [SPARK-6487] [MLLIB] Add sequential pattern mining algorithm PrefixSpan to Spark MLlib · 7f6be1f2
      zhangjiajin authored
      Add parallel PrefixSpan algorithm and test file.
      Support non-temporal sequences.
      
      Author: zhangjiajin <zhangjiajin@huawei.com>
      Author: zhang jiajin <zhangjiajin@huawei.com>
      
      Closes #7258 from zhangjiajin/master and squashes the following commits:
      
      ca9c4c8 [zhangjiajin] Modified the code according to the review comments.
      574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization.
      ba5df34 [zhangjiajin] Fix a Scala style error.
      4c60fb3 [zhangjiajin] Fix some Scala style errors.
      1dd33ad [zhangjiajin] Modified the code according to the review comments.
      89bc368 [zhangjiajin] Fixed a Scala style error.
      a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala
      951fd42 [zhang jiajin] Delete Prefixspan.scala
      575995f [zhangjiajin] Modified the code according to the review comments.
      91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
      7f6be1f2
    • jose.cambronero's avatar
      [SPARK-8598] [MLLIB] Implementation of 1-sample, two-sided, Kolmogorov Smirnov Test for RDDs · 9c507577
      jose.cambronero authored
      This contribution is my original work and I license it to the project under it's open source license.
      
      Author: jose.cambronero <jose.cambronero@cloudera.com>
      
      Closes #6994 from josepablocam/master and squashes the following commits:
      
      bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name
      0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md
      1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity
      a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf)
      1bb44bd [jose.cambronero]  style and doc changes. Factored out ks test into 2 separate tests
      2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict
      a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly
      7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info
      e760ebd [jose.cambronero] line length changes to fit style check
      3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
      9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty
      1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part
      9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs
      3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity
      992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach.
      6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal)
      4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below
      0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm
      16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP
      c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly
      f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request
      b9cff3a [jose.cambronero] made small changes to pass style check
      ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite
      4da189b [jose.cambronero] added user facing ks test functions
      c659ea1 [jose.cambronero] created KS test class
      13dfe4d [jose.cambronero] created test result class for ks test
      9c507577
    • rahulpalamuttam's avatar
      [SPARK-8923] [DOCUMENTATION, MLLIB] Add @since tags to mllib.fpm · 0772026c
      rahulpalamuttam authored
      Author: rahulpalamuttam <rahulpalamut@gmail.com>
      
      Closes #7341 from rahulpalamuttam/TaggingMLlibfpm and squashes the following commits:
      
      bef2843 [rahulpalamuttam] fix @since tags in mmlib.fpm
      cd86252 [rahulpalamuttam] Add @since tags to mllib.fpm
      0772026c
Loading