- Jul 24, 2015
-
-
MechCoder authored
This makes it easier to test all the class variables of the DistributedLDAmodel. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7573 from MechCoder/lda_test and squashes the following commits: 2f1a293 [MechCoder] [SPARK-9222] [MLlib] Make class instantiation variables in DistributedLDAModel private[clustering]
-
Reynold Xin authored
I also changed InternalRow's size/length function to numFields, to make it more obvious that it is not about bytes, but the number of fields. Author: Reynold Xin <rxin@databricks.com> Closes #7626 from rxin/internalRow and squashes the following commits: e124daf [Reynold Xin] Fixed test case. 805ceb7 [Reynold Xin] Commented out the failed test suite. f8a9ca5 [Reynold Xin] Fixed more bugs. Still at least one more remaining. 76d9081 [Reynold Xin] Fixed data sources. 7807f70 [Reynold Xin] Fixed DataFrameSuite. cb60cd2 [Reynold Xin] Code review & small bug fixes. 0a2948b [Reynold Xin] Fixed style. 3280d03 [Reynold Xin] [SPARK-9285][SQL] Remove InternalRow's inheritance from Row.
-
Ram Sriharsha authored
The base classifier input and output columns are ignored in favor of the ones specified in OneVsRest. Author: Ram Sriharsha <rsriharsha@hw11853.local> Closes #6631 from harsha2010/SPARK-8092 and squashes the following commits: 6591dc6 [Ram Sriharsha] add documentation for params b7024b1 [Ram Sriharsha] cleanup f0e2bfb [Ram Sriharsha] merge with master 108d3d7 [Ram Sriharsha] merge with master 4f74126 [Ram Sriharsha] Allow label/ features columns to be configurable
-
- Jul 23, 2015
-
-
Davies Liu authored
Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases). In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18). If no precision and scale is provide, it's Decimal(10, 0) as before. Author: Davies Liu <davies@databricks.com> Closes #7605 from davies/decimal_unlimited and squashes the following commits: aa3f115 [Davies Liu] fix tests and style fb0d20d [Davies Liu] address comments bfaae35 [Davies Liu] fix style df93657 [Davies Liu] address comments and clean up 06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited 4c28969 [Davies Liu] fix tests 8d783cc [Davies Liu] fix tests 788631c [Davies Liu] fix double with decimal in Union/except 1779bde [Davies Liu] fix scala style c9c7c78 [Davies Liu] remove Decimal.Unlimited
-
Liang-Chi Hsieh authored
JIRA: https://issues.apache.org/jira/browse/SPARK-7254 Author: Liang-Chi Hsieh <viirya@appier.com> Author: Liang-Chi Hsieh <viirya@gmail.com> Closes #6054 from viirya/pic_on_graph and squashes the following commits: 8b87b81 [Liang-Chi Hsieh] Fix scala style. a22fb8b [Liang-Chi Hsieh] For comment. ef565a0 [Liang-Chi Hsieh] Fix indentation. d249aa1 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into pic_on_graph 82d7351 [Liang-Chi Hsieh] Run PowerIterationClustering directly on graph.
-
Joseph K. Bradley authored
Removed varargs annotation from Params.setDefault taking multiple params. Though varargs is technically correct, it often requires that developers do clean assembly, rather than (not clean) assembly, which is a nuisance during development. CC: mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7604 from jkbradley/params-setdefault-varargs and squashes the following commits: 6016dc6 [Joseph K. Bradley] removed varargs annotation from Params.setDefault taking multiple params
-
- Jul 22, 2015
-
-
Josh Rosen authored
Spark has an option called spark.localExecution.enabled; according to the docs: > Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver. This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5. This pull request simply brings #7484 up to date. Author: Josh Rosen <joshrosen@databricks.com> Author: Reynold Xin <rxin@databricks.com> Closes #7585 from rxin/remove-local-exec and squashes the following commits: 84bd10e [Reynold Xin] Python fix. 1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it. b0835dc [Josh Rosen] Remove local execution code in DAGScheduler 8975d96 [Josh Rosen] Remove local execution tests. ffa8c9b [Josh Rosen] Remove documentation for configuration
-
Reynold Xin authored
I've seen a few cases in the past few weeks that the compiler is throwing warnings that are caused by legitimate bugs. This patch upgrades warnings to errors, except deprecation warnings. Note that ideally we should be able to mark deprecation warnings as errors as well. However, due to the lack of ability to suppress individual warning messages in the Scala compiler, we cannot do that (since we do need to access deprecated APIs in Hadoop). Most of the work are done by ericl. Author: Reynold Xin <rxin@databricks.com> Author: Eric Liang <ekl@databricks.com> Closes #7598 from rxin/warnings and squashes the following commits: beb311b [Reynold Xin] Fixed tests. 542c031 [Reynold Xin] Fixed one more warning. 87c354a [Reynold Xin] Fixed all non-deprecation warnings. 78660ac [Eric Liang] first effort to fix warnings
-
martinzapletal authored
- [X] Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. It should be similar to CrossValidator, but simpler and less expensive. - [X] Simplified replacement of https://github.com/apache/spark/pull/6996 Author: martinzapletal <zapletal-martin@email.cz> Closes #7337 from zapletal-martin/SPARK-8484-TrainValidationSplit and squashes the following commits: cafc949 [martinzapletal] Review comments https://github.com/apache/spark/pull/7337. 511b398 [martinzapletal] Merge remote-tracking branch 'upstream/master' into SPARK-8484-TrainValidationSplit f4fc9c4 [martinzapletal] SPARK-8484 Resolved feedback to https://github.com/apache/spark/pull/7337 00c4f5a [martinzapletal] SPARK-8484. Styling. d699506 [martinzapletal] SPARK-8484. Styling. 93ed2ee [martinzapletal] Styling. 3bc1853 [martinzapletal] SPARK-8484. Styling. 2aa6f43 [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model. 21662eb [martinzapletal] SPARK-8484. Added TrainValidationSplit for hyper-parameter tuning. It randomly splits the input dataset into train and validation and use evaluation metric on the validation set to select the best model.
-
Matei Zaharia authored
There are a few memory limits that people hit often and that we could make higher, especially now that memory sizes have grown. - spark.akka.frameSize: This defaults at 10 but is often hit for map output statuses in large shuffles. This memory is not fully allocated up-front, so we can just make this larger and still not affect jobs that never sent a status that large. We increase it to 128. - spark.executor.memory: Defaults at 512m, which is really small. We increase it to 1g. Author: Matei Zaharia <matei@databricks.com> Closes #7586 from mateiz/configs and squashes the following commits: ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
-
Feynman Liang authored
Modify `LDA` to take asymmetric document-topic prior distributions and `OnlineLDAOptimizer` to use the asymmetric prior during variational inference. This PR only generalizes `OnlineLDAOptimizer` and the associated `LocalLDAModel`; `EMLDAOptimizer` and `DistributedLDAModel` still only support symmetric `alpha` (checked during `EMLDAOptimizer.initialize`). Author: Feynman Liang <fliang@databricks.com> Closes #7575 from feynmanliang/SPARK-8536-LDA-asymmetric-priors and squashes the following commits: af8fbb7 [Feynman Liang] Fix merge errors ef5821d [Feynman Liang] Merge remote-tracking branch 'apache/master' into SPARK-8536-LDA-asymmetric-priors 58f1d7b [Feynman Liang] Fix from review feedback a6dcf70 [Feynman Liang] Change docConcentration interface and move LDAOptimizer validation to initialize, add sad path tests 72038ff [Feynman Liang] Add tests referenced against gensim d4284fa [Feynman Liang] Generalize OnlineLDA to asymmetric priors, no tests
-
Feynman Liang authored
In-place updates, reduce number of transposes, and vectorize operations in OnlineLDA implementation. Author: Feynman Liang <fliang@databricks.com> Closes #7454 from feynmanliang/OnlineLDA-perf-improvements and squashes the following commits: 78b0f5a [Feynman Liang] Make in-place variables vals, fix BLAS error 7f62a55 [Feynman Liang] --amend c62cb1e [Feynman Liang] Outer product for stats, revert Range slicing aead650 [Feynman Liang] Range slice, in-place update, reduce transposes
-
- Jul 21, 2015
-
-
MechCoder authored
Add support for saving and loading LDA both the local and distributed versions. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #6948 from MechCoder/lda_save_load and squashes the following commits: 49bcdce [MechCoder] minor style fixes cc14054 [MechCoder] minor 4587d1d [MechCoder] Minor changes c753122 [MechCoder] Load and save the model in private methods 2782326 [MechCoder] [SPARK-5989] Model save/load for LDA
-
petz2000 authored
Created since tags for methods in mllib.classification Author: petz2000 <petz2000@gmail.com> Closes #7371 from petz2000/add_since_mllib.classification and squashes the following commits: 39fe291 [petz2000] Removed whitespace in block comment c9b1e03 [petz2000] Removed @since tags again from protected and private methods cd759b6 [petz2000] Added @since tags to methods
-
Holden Karau authored
Author: Holden Karau <holden@pigscanfly.ca> Closes #7553 from holdenk/SPARK-9204-add-default-params-test-to-linear-regression and squashes the following commits: 630ba19 [Holden Karau] style fix faa08a3 [Holden Karau] Add default params test for linearyregression suite
-
- Jul 20, 2015
-
-
Eric Liang authored
This exposes the SparkR:::glm() and SparkR:::predict() APIs. It was necessary to change RFormula to silently drop the label column if it was missing from the input dataset, which is kind of a hack but necessary to integrate with the Pipeline API. The umbrella design doc for MLlib + SparkR integration can be viewed here: https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/edit mengxr Author: Eric Liang <ekl@databricks.com> Closes #7483 from ericl/spark-8774 and squashes the following commits: 3dfac0c [Eric Liang] update 17ef516 [Eric Liang] more comments 1753a0f [Eric Liang] make glm generic b0f50f8 [Eric Liang] equivalence test 550d56d [Eric Liang] export methods c015697 [Eric Liang] second pass 117949a [Eric Liang] comments 5afbc67 [Eric Liang] test label columns 6b7f15f [Eric Liang] Fri Jul 17 14:20:22 PDT 2015 3a63ae5 [Eric Liang] Fri Jul 17 13:41:52 PDT 2015 ce61367 [Eric Liang] Fri Jul 17 13:41:17 PDT 2015 0299c59 [Eric Liang] Fri Jul 17 13:40:32 PDT 2015 e37603f [Eric Liang] Fri Jul 17 12:15:03 PDT 2015 d417d0c [Eric Liang] Merge remote-tracking branch 'upstream/master' into spark-8774 29a2ce7 [Eric Liang] Merge branch 'spark-8774-1' into spark-8774 d1959d2 [Eric Liang] clarify comment 2db68aa [Eric Liang] second round of comments dc3c943 [Eric Liang] address comments 5765ec6 [Eric Liang] fix style checks 1f361b0 [Eric Liang] doc d33211b [Eric Liang] r support fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
-
Meihua Wu authored
Fix BLAS.gemm to update matrix C when alpha==0 and beta!=1 Also include unit tests to verify the fix. mengxr brkyvz Author: Meihua Wu <meihuawu@umich.edu> Closes #7503 from rotationsymmetry/fix_BLAS_gemm and squashes the following commits: fce199c [Meihua Wu] Fix BLAS.gemm to update C when alpha==0 and beta!=1
-
MechCoder authored
Python API for the KS-test Statistics.kolmogorovSmirnovTest(data, distName, *params) I'm not quite sure how to support the callable function since it is not serializable. Author: MechCoder <manojkumarsivaraj334@gmail.com> Closes #7430 from MechCoder/spark-8996 and squashes the following commits: 2dd009d [MechCoder] minor 021d233 [MechCoder] Remove one wrapper and other minor stuff 49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
-
George Dittmar authored
Modifying Vector, DenseVector, and SparseVector to implement argmax functionality. This work is to set the stage for changes to be done in Spark-7423. Author: George Dittmar <georgedittmar@gmail.com> Author: George <dittmar@Georges-MacBook-Pro.local> Author: dittmarg <george.dittmar@webtrends.com> Author: Xiangrui Meng <meng@databricks.com> Closes #6112 from GeorgeDittmar/SPARK-7422 and squashes the following commits: 3e0a939 [George Dittmar] Merge pull request #1 from mengxr/SPARK-7422 127dec5 [Xiangrui Meng] update argmax impl 2ea6a55 [George Dittmar] Added MimaExcludes for Vectors.argmax 98058f4 [George Dittmar] Merge branch 'master' of github.com:apache/spark into SPARK-7422 5fd9380 [George Dittmar] fixing style check error 42341fb [George Dittmar] refactoring arg max check to better handle zero values b22af46 [George Dittmar] Fixing spaces between commas in unit test f2eba2f [George Dittmar] Cleaning up unit tests to be fewer lines aa330e3 [George Dittmar] Fixing some last if else spacing issues ac53c55 [George Dittmar] changing dense vector argmax unit test to be one line call vs 2 d5b5423 [George Dittmar] Fixing code style and updating if logic on when to check for zero values ee1a85a [George Dittmar] Cleaning up unit tests a bit and modifying a few cases 3ee8711 [George Dittmar] Fixing corner case issue with zeros in the active values of the sparse vector. Updated unit tests b1f059f [George Dittmar] Added comment before we start arg max calculation. Updated unit tests to cover corner cases f21dcce [George Dittmar] commit af17981 [dittmarg] Initial work fixing bug that was made clear in pr eeda560 [George] Fixing SparseVector argmax function to ignore zero values while doing the calculation. 4526acc [George] Merge branch 'master' of github.com:apache/spark into SPARK-7422 df9538a [George] Added argmax to sparse vector and added unit test 3cffed4 [George] Adding unit tests for argmax functions for Dense and Sparse vectors 04677af [George] initial work on adding argmax to Vector and SparseVector
-
- Jul 17, 2015
-
-
Rekha Joshi authored
Implement IntArrayParam in mllib Author: Rekha Joshi <rekhajoshm@gmail.com> Author: Joshi <rekhajoshm@gmail.com> Closes #7481 from rekhajoshm/SPARK-9118 and squashes the following commits: d3b1766 [Joshi] Implement IntArrayParam 0be142d [Rekha Joshi] Merge pull request #3 from apache/master 106fd8e [Rekha Joshi] Merge pull request #2 from apache/master e3677c9 [Rekha Joshi] Merge pull request #1 from apache/master
-
Yu ISHIKAWA authored
I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks. [SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879 Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits: be752de [Yu ISHIKAWA] Add assertions a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst 4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python fb2417c [Yu ISHIKAWA] Use getInt, instead of get f397be4 [Yu ISHIKAWA] Switch the comparisons. ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter. effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test 19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests 1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst f8338bc [Yu ISHIKAWA] Add the placeholders in Python 4a03003 [Yu ISHIKAWA] Test for contains in Python 6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply` 288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names 5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception 97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy` e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class 978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans 2ec80bc [Yu ISHIKAWA] Fit on 1 line e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python 3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation 4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon 2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam` 19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF 4d2ad1e [Yu ISHIKAWA] Modify the indentations 0ae422f [Yu ISHIKAWA] Add a test for `setParams` 4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala 11ffdf1 [Yu ISHIKAWA] Use `===` and the variable 220a176 [Yu ISHIKAWA] Set a random seed in the unit testing 92c3efc [Yu ISHIKAWA] Make the points for a test be fewer c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python 6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods 687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations 5bedc51 [Yu ISHIKAWA] Remve an extra new line 444c289 [Yu ISHIKAWA] Add the validation for `runs` e41989c [Yu ISHIKAWA] Modify how to validate `initStep` 7ea133a [Yu ISHIKAWA] Change how to validate `initMode` 7991e15 [Yu ISHIKAWA] Add a validation for `k` c2df35d [Yu ISHIKAWA] Make `predict` private 93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform` d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private 8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans 6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps` 99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode` 79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs 6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault` 20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault` 11c2a12 [Yu ISHIKAWA] Limit the imports badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel} f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods 85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol` aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline 598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python 63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
-
Bryan Cutler authored
Broadcast of ensemble models in transformImpl before call to predict Author: Bryan Cutler <bjcutler@us.ibm.com> Closes #6300 from BryanCutler/bcast-ensemble-models-7127 and squashes the following commits: 86e73de [Bryan Cutler] [SPARK-7127] Replaced deprecated callUDF with udf 40a139d [Bryan Cutler] Merge branch 'master' into bcast-ensemble-models-7127 9afad56 [Bryan Cutler] [SPARK-7127] Simplified calls by overriding transformImpl and using broadcasted model in callUDF to make prediction 1f34be4 [Bryan Cutler] [SPARK-7127] Removed accidental newline 171a6ce [Bryan Cutler] [SPARK-7127] Used modelAccessor parameter in predictImpl to access broadcasted model 6fd153c [Bryan Cutler] [SPARK-7127] Applied broadcasting to remaining ensemble models aaad77b [Bryan Cutler] [SPARK-7127] Removed abstract class for broadcasting model, instead passing a prediction function as param to transform 83904bb [Bryan Cutler] [SPARK-7127] Adding broadcast of model before prediction in RandomForestClassifier
-
Feynman Liang authored
[SPARK-9090] [ML] Fix definition of residual in LinearRegressionSummary, EnsembleTestHelper, and SquaredError Make the definition of residuals in Spark consistent with literature. We have been using `prediction - label` for residuals, but literature usually defines `residual = label - prediction`. Author: Feynman Liang <fliang@databricks.com> Closes #7435 from feynmanliang/SPARK-9090-Fix-LinearRegressionSummary-Residuals and squashes the following commits: f4b39d8 [Feynman Liang] Fix doc bc12a92 [Feynman Liang] Tweak EnsembleTestHelper and SquaredError residuals 63f0d60 [Feynman Liang] Fix definition of residual
-
Yanbo Liang authored
Naive Bayes API for spark.ml Pipelines Author: Yanbo Liang <ybliang8@gmail.com> Closes #7284 from yanboliang/spark-8600 and squashes the following commits: bc890f7 [Yanbo Liang] remove labels valid check c3de687 [Yanbo Liang] remove labels from ml.NaiveBayesModel a2b3088 [Yanbo Liang] address comments 3220b82 [Yanbo Liang] trigger jenkins 3018a41 [Yanbo Liang] address comments 208e166 [Yanbo Liang] Naive Bayes API for spark.ml Pipelines
-
Yuhao Yang authored
jira: https://issues.apache.org/jira/browse/SPARK-9062 Currently output type of Tokenizer is Array(String, false), which is not compatible with Word2Vec and Other transformers since their input type is Array(String, true). Seq[String] in udf will be treated as Array(String, true) by default. I'm not sure what's the recommended way for Tokenizer to handle the null value in the input. Any suggestion will be welcome. Author: Yuhao Yang <hhbyyh@gmail.com> Closes #7414 from hhbyyh/tokenizer and squashes the following commits: c01bd7a [Yuhao Yang] change output type of tokenizer
-
Yanbo Liang authored
fix wrong annotation of RFormula.formula Author: Yanbo Liang <ybliang8@gmail.com> Closes #7470 from yanboliang/RFormula and squashes the following commits: 61f1919 [Yanbo Liang] fix wrong annotation
-
Xiangrui Meng authored
Measure lower and upper bounds for task time and use them for validation. This PR also implements `Stopwatch.toString`. This suite should finish in less than 1 second. jkbradley pwendell Author: Xiangrui Meng <meng@databricks.com> Closes #7457 from mengxr/SPARK-9126 and squashes the following commits: 4b40faa [Xiangrui Meng] simplify tests 739f5bd [Xiangrui Meng] do not assert on time taken by Thread.sleep()
-
Joseph K. Bradley authored
This PR copies the RandomForest implementation from spark.mllib to spark.ml. Note that this includes the DecisionTree implementation, but not the GradientBoostedTrees one (which will come later). I essentially copied a minimal amount of code to spark.ml, removed the use of bins (and only used splits), and modified code only as much as necessary to get it to compile. The spark.ml implementation still uses some spark.mllib classes (privately), which can be moved in future PRs. This refactoring will be helpful in extending the node representation to include more information, such as class probabilities. Specifically: * Copied code from spark.mllib to spark.ml: * mllib.tree.DecisionTree, mllib.tree.RandomForest copied to ml.tree.impl.RandomForest (main implementation) * NodeIdCache (needed to use splits instead of bins) * TreePoint (use splits instead of bins) * Added ml.tree.LearningNode used in RandomForest training (needed vars) * Removed bins from implementation, and only used splits * Small fix in JavaDecisionTreeRegressorSuite CC: mengxr manishamde codedeft chouqin Author: Joseph K. Bradley <joseph@databricks.com> Closes #7294 from jkbradley/dt-move-impl and squashes the following commits: 48749be [Joseph K. Bradley] cleanups based on code review, mostly style bea9703 [Joseph K. Bradley] scala style fixes. added some scala doc 4e6d2a4 [Joseph K. Bradley] removed unnecessary use of copyValues, setParent for trees 9a4d721 [Joseph K. Bradley] cleanups. removed InfoGainStats from ml, using old one for now. 836e7d4 [Joseph K. Bradley] Fixed test suite failures bd5e063 [Joseph K. Bradley] fixed bucketizing issue 0df3759 [Joseph K. Bradley] Need to remove use of Bucketizer d5224a9 [Joseph K. Bradley] modified tree and forest to use moved impl cc01823 [Joseph K. Bradley] still editing RF to get it to work 19143fb [Joseph K. Bradley] More progress, but not done yet. Rebased with master after 1.4 release.
-
- Jul 15, 2015
-
-
Xiangrui Meng authored
Add stopwatches for easy instrumentation of MLlib algorithms. This is based on the `TimeTracker` used in decision trees. The distributed version uses Spark accumulator. jkbradley Author: Xiangrui Meng <meng@databricks.com> Closes #7415 from mengxr/SPARK-9018 and squashes the following commits: 40b4347 [Xiangrui Meng] == -> === c477745 [Xiangrui Meng] address Joseph's comments f981a49 [Xiangrui Meng] add stopwatches
-
Eric Liang authored
This implements minimal R formula support as a feature transformer. Both numeric and string labels are supported, but features must be numeric for now. cc mengxr Author: Eric Liang <ekl@databricks.com> Closes #7381 from ericl/spark-8774-1 and squashes the following commits: d1959d2 [Eric Liang] clarify comment 2db68aa [Eric Liang] second round of comments dc3c943 [Eric Liang] address comments 5765ec6 [Eric Liang] fix style checks 1f361b0 [Eric Liang] doc fb0826b [Eric Liang] [SPARK-8774] Add R model formula with basic support as a transformer
-
Feynman Liang authored
Fixes implementation of `explainedVariance` and `r2` to be consistent with their definitions as described in [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005). Author: Feynman Liang <fliang@databricks.com> Closes #7361 from feynmanliang/SPARK-9005-RegressionMetrics-bugs and squashes the following commits: f1112fc [Feynman Liang] Add explainedVariance formula 1a3d098 [Feynman Liang] SROwen code review comments 08a0e1b [Feynman Liang] Fix pyspark tests db8605a [Feynman Liang] Style fix bde9761 [Feynman Liang] Fix RegressionMetrics tests, relax assumption predictor is unbiased c235de0 [Feynman Liang] Fix RegressionMetrics tests 4c4e56f [Feynman Liang] Fix RegressionMetrics computation of explainedVariance and r2
-
Feynman Liang authored
Improves the performance of LocalPrefixSpan by implementing optimizations proposed in [SPARK-8997](https://issues.apache.org/jira/browse/SPARK-8997) Author: Feynman Liang <fliang@databricks.com> Author: Feynman Liang <feynman.liang@gmail.com> Author: Xiangrui Meng <meng@databricks.com> Closes #7360 from feynmanliang/SPARK-8997-improve-prefixspan and squashes the following commits: 59db2f5 [Feynman Liang] Merge pull request #1 from mengxr/SPARK-8997 91e4357 [Xiangrui Meng] update LocalPrefixSpan impl 9212256 [Feynman Liang] MengXR code review comments f055d82 [Feynman Liang] Fix failing scalatest 2e00cba [Feynman Liang] Depth first projections 70b93e3 [Feynman Liang] Performance improvements in LocalPrefixSpan, fix tests
-
FlytxtRnD authored
This allows Kmeans to be initialized using an existing set of cluster centers provided as a KMeansModel object. This mode of initialization performs a single run. Author: FlytxtRnD <meethu.mathew@flytxt.com> Closes #6737 from FlytxtRnD/Kmeans-8018 and squashes the following commits: 94b56df [FlytxtRnD] style correction ef95ee2 [FlytxtRnD] style correction c446c58 [FlytxtRnD] documentation and numRuns warning change 06d13ef [FlytxtRnD] numRuns corrected d12336e [FlytxtRnD] numRuns variable modifications 07f8554 [FlytxtRnD] remove setRuns from setIntialModel e721dfe [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018 242ead1 [FlytxtRnD] corrected == to === in assert 714acb5 [FlytxtRnD] added numRuns 60c8ce2 [FlytxtRnD] ignore runs parameter and initialModel test suite changed 582e6d9 [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018 3f5fc8e [FlytxtRnD] test case modified and one runs condition added cd5dc5c [FlytxtRnD] Merge remote-tracking branch 'upstream/master' into Kmeans-8018 16f1b53 [FlytxtRnD] Merge branch 'Kmeans-8018', remote-tracking branch 'upstream/master' into Kmeans-8018 e9c35d7 [FlytxtRnD] Remove getInitialModel and match cluster count criteria 6959861 [FlytxtRnD] Accept initial cluster centers in KMeans
-
Yu ISHIKAWA authored
I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue. TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes. Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com> Closes #6791 from yu-iskw/SPARK-6259 and squashes the following commits: 6855f59 [Yu ISHIKAWA] LDA inherits object 28bd165 [Yu ISHIKAWA] Change the place of testing code d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value 083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer` 9f8bed8 [Yu ISHIKAWA] Simplify casting faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters 98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented. 57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing 73412c3 [Yu ISHIKAWA] Fix the typo 2278829 [Yu ISHIKAWA] Fix the indentation 39514ec [Yu ISHIKAWA] Modify how to cast the input data 8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala` 77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint 68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python 25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing
-
- Jul 14, 2015
-
-
Sean Owen authored
Add predictProbabilities to Naive Bayes, return class probabilities. Continues https://github.com/apache/spark/pull/6761 Author: Sean Owen <sowen@cloudera.com> Closes #7376 from srowen/SPARK-4362 and squashes the following commits: 23d5a76 [Sean Owen] Fix model.labels -> model.theta 95d91fb [Sean Owen] Check that predicted probabilities sum to 1 b32d1c8 [Sean Owen] Add predictProbabilities to Naive Bayes, return class probabilities
-
- Jul 13, 2015
-
-
Vinod K C authored
Removed private[ml] from Generated documentation Author: Vinod K C <vinod.kc@huawei.com> Closes #7367 from vinodkc/fix_sharedparmascodegen and squashes the following commits: 4fa3c8f [Vinod K C] Adding auto generated code 7e19025 [Vinod K C] Removed private[ml]
-
- Jul 10, 2015
-
-
Joseph K. Bradley authored
Made default impl of Params.validateParams empty CC mengxr Author: Joseph K. Bradley <joseph@databricks.com> Closes #7349 from jkbradley/pipeline-small-cleanups and squashes the following commits: 4e0f013 [Joseph K. Bradley] small cleanups after SPARK-5956
-
zhangjiajin authored
Add parallel PrefixSpan algorithm and test file. Support non-temporal sequences. Author: zhangjiajin <zhangjiajin@huawei.com> Author: zhang jiajin <zhangjiajin@huawei.com> Closes #7258 from zhangjiajin/master and squashes the following commits: ca9c4c8 [zhangjiajin] Modified the code according to the review comments. 574e56c [zhangjiajin] Add new object LocalPrefixSpan, and do some optimization. ba5df34 [zhangjiajin] Fix a Scala style error. 4c60fb3 [zhangjiajin] Fix some Scala style errors. 1dd33ad [zhangjiajin] Modified the code according to the review comments. 89bc368 [zhangjiajin] Fixed a Scala style error. a2eb14c [zhang jiajin] Delete PrefixspanSuite.scala 951fd42 [zhang jiajin] Delete Prefixspan.scala 575995f [zhangjiajin] Modified the code according to the review comments. 91fd7e6 [zhangjiajin] Add new algorithm PrefixSpan and test file.
-
jose.cambronero authored
This contribution is my original work and I license it to the project under it's open source license. Author: jose.cambronero <jose.cambronero@cloudera.com> Closes #6994 from josepablocam/master and squashes the following commits: bbb30b1 [jose.cambronero] renamed KSTestResult to KolmogorovSmirnovTestResult, to stay consistent with method name 0d0c201 [jose.cambronero] kstTest -> kolmogorovSmirnovTest in statistics.md 1f56371 [jose.cambronero] changed ksTest in public API to kolmogorovSmirnovTest for clarity a48ae7b [jose.cambronero] refactor code to account for serializable RealDistribution. Reuse testOneSample( _, cdf) 1bb44bd [jose.cambronero] style and doc changes. Factored out ks test into 2 separate tests 2ec2aa6 [jose.cambronero] initialize to stdnormal when no params passed (and log). Change unit tests to approximate equivalence rather than strict a4bc0c7 [jose.cambronero] changed ksTest(data, distName) to ksTest(data, distName, params*) after api discussions. Changed tests and docs accordingly 7e66f57 [jose.cambronero] copied implementation note to public api docs, and added @see for links to wiki info e760ebd [jose.cambronero] line length changes to fit style check 3288e42 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 9026895 [jose.cambronero] addressed style changes, correctness change to simpler approach, and fixed edge case for foldLeft in searchOneSampleCandidates when a partition is empty 1226b30 [jose.cambronero] reindent multi-line lambdas, prior intepretation of style guide was wrong on my part 9c0f1af [jose.cambronero] additional style changes incorporated and added documentation to mllib statistics docs 3f81ad2 [jose.cambronero] renamed ks1 sample test for clarity 992293b [jose.cambronero] Style changes as per comments and added implementation note explaining the distributed approach. 6a4784f [jose.cambronero] specified what distributions are available for the convenience method ksTest(data, name) (solely standard normal) 4b8ba61 [jose.cambronero] fixed off by 1/N in cases when post-constant adjustment ecdf is above cdf, but prior to adj it was below 0b5e8ec [jose.cambronero] changed KS one sample test to perform just 1 distributed pass (in addition to the sorting pass), operates on each partition separately. Implementation of Sandy Ryza's algorithm 16b5c4c [jose.cambronero] renamed dat to data and eliminated recalc of RDD size by sharing as argument between empirical and evalOneSampleP c18dc66 [jose.cambronero] removed ksTestOpt from API and changed comments in HypothesisTestSuite accordingly f6951b6 [jose.cambronero] changed style and some comments based on feedback from pull request b9cff3a [jose.cambronero] made small changes to pass style check ce8e9a1 [jose.cambronero] added kstest testing in HypothesisTestSuite 4da189b [jose.cambronero] added user facing ks test functions c659ea1 [jose.cambronero] created KS test class 13dfe4d [jose.cambronero] created test result class for ks test
-
rahulpalamuttam authored
Author: rahulpalamuttam <rahulpalamut@gmail.com> Closes #7341 from rahulpalamuttam/TaggingMLlibfpm and squashes the following commits: bef2843 [rahulpalamuttam] fix @since tags in mmlib.fpm cd86252 [rahulpalamuttam] Add @since tags to mllib.fpm
-