Skip to content
Snippets Groups Projects
  1. Aug 06, 2014
  2. Aug 05, 2014
  3. Aug 04, 2014
    • Liquan Pei's avatar
      [MLlib] [SPARK-2510]Word2Vec: Distributed Representation of Words · e053c558
      Liquan Pei authored
      This is a pull request regarding SPARK-2510 at https://issues.apache.org/jira/browse/SPARK-2510. Word2Vec creates vector representation of words in a text corpus. The algorithm first constructs a vocabulary from the corpus and then learns vector representation of words in the vocabulary. The vector representation can be used as features in natural language processing and machine learning algorithms.
      
      To make our implementation more scalable, we train each partition separately and merge the model of each partition after each iteration. To make the model more accurate, multiple iterations may be needed.
      
      To investigate the vector representations is to find the closest words for a query word. For example, the top 20 closest words to "china" are for 1 partition and 1 iteration :
      
      taiwan 0.8077646146334014
      korea 0.740913304563621
      japan 0.7240667798885471
      republic 0.7107151279078352
      thailand 0.6953217332072862
      tibet 0.6916782118129544
      mongolia 0.6800858715972612
      macau 0.6794925677480378
      singapore 0.6594048695593799
      manchuria 0.658989931844148
      laos 0.6512978726001666
      nepal 0.6380792327845325
      mainland 0.6365469459587788
      myanmar 0.6358614338840394
      macedonia 0.6322366180313249
      xinjiang 0.6285291551708028
      russia 0.6279951236068411
      india 0.6272874944023487
      shanghai 0.6234544135576999
      macao 0.6220588462925876
      
      The result with 10 partitions and 5 iterations is:
      taiwan 0.8310495079388313
      india 0.7737171315919039
      japan 0.756777901233668
      korea 0.7429767187102452
      indonesia 0.7407557427278356
      pakistan 0.712883426985585
      mainland 0.7053379963140822
      thailand 0.696298191073948
      mongolia 0.693690656871415
      laos 0.6913069680735292
      macau 0.6903427690029617
      republic 0.6766381604813666
      malaysia 0.676460699141784
      singapore 0.6728790997360923
      malaya 0.672345232966194
      manchuria 0.6703732292753156
      macedonia 0.6637955686322028
      myanmar 0.6589462882439646
      kazakhstan 0.657017801081494
      cambodia 0.6542383836451932
      
      Author: Liquan Pei <lpei@gopivotal.com>
      Author: Xiangrui Meng <meng@databricks.com>
      Author: Liquan Pei <liquanpei@gmail.com>
      
      Closes #1719 from Ishiihara/master and squashes the following commits:
      
      2ba9483 [Liquan Pei] minor fix for Word2Vec test
      e248441 [Liquan Pei] minor style change
      26a948d [Liquan Pei] Merge pull request #1 from mengxr/Ishiihara-master
      c14da41 [Xiangrui Meng] fix styles
      384c771 [Xiangrui Meng] remove minCount and window from constructor change model to use float instead of double
      e93e726 [Liquan Pei] use treeAggregate instead of aggregate
      1a8fb41 [Liquan Pei] use weighted sum in combOp
      7efbb6f [Liquan Pei] use broadcast version of vocab in aggregate
      6bcc8be [Liquan Pei] add multiple iteration support
      720b5a3 [Liquan Pei] Add test for Word2Vec algorithm, minor fixes
      2e92b59 [Liquan Pei] modify according to feedback
      57dc50d [Liquan Pei] code formatting
      e4a04d3 [Liquan Pei] minor fix
      0aafb1b [Liquan Pei] Add comments, minor fixes
      8d6befe [Liquan Pei] initial commit
      e053c558
  4. Aug 03, 2014
    • DB Tsai's avatar
      SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent... · ae58aea2
      DB Tsai authored
      SPARK-2272 [MLlib] Feature scaling which standardizes the range of independent variables or features of data
      
      Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is generally performed during the data preprocessing step.
      
      In this work, a trait called `VectorTransformer` is defined for generic transformation on a vector. It contains one method to be implemented, `transform` which applies transformation on a vector.
      
      There are two implementations of `VectorTransformer` now, and they all can be easily extended with PMML transformation support.
      
      1) `StandardScaler` - Standardizes features by removing the mean and scaling to unit variance using column summary statistics on the samples in the training set.
      
      2) `Normalizer` - Normalizes samples individually to unit L^n norm
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #1207 from dbtsai/dbtsai-feature-scaling and squashes the following commits:
      
      78c15d3 [DB Tsai] Alpine Data Labs
      ae58aea2
    • Joseph K. Bradley's avatar
      [SPARK-2197] [mllib] Java DecisionTree bug fix and easy-of-use · 2998e38a
      Joseph K. Bradley authored
      Bug fix: Before, when an RDD was created in Java and passed to DecisionTree.train(), the fake class tag caused problems.
      * Fix: DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java.
      
      Other improvements to Decision Trees for easy-of-use with Java:
      * impurity classes: Added instance() methods to help with Java interface.
      * Strategy: Added Java-friendly constructor
      --> Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
      
      CC: mengxr
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1740 from jkbradley/dt-java-new and squashes the following commits:
      
      0805dc6 [Joseph K. Bradley] Changed Strategy to use JavaConverters instead of JavaConversions
      519b1b7 [Joseph K. Bradley] * Organized imports in JavaDecisionTreeSuite.java * Using JavaConverters instead of JavaConversions in DecisionTreeSuite.scala
      f7b5ca1 [Joseph K. Bradley] Improvements to make it easier to run DecisionTree from Java. * DecisionTree: Used new RDD.retag() method to allow passing RDDs from Java. * impurity classes: Added instance() methods to help with Java interface. * Strategy: Added Java-friendly constructor ** Note: I removed quantileCalculationStrategy from the Java-friendly constructor since (a) it is a special class and (b) there is only 1 option currently.  I suspect we will redo the API before the other options are included.
      d78ada6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
      320853f [Joseph K. Bradley] Added JavaDecisionTreeSuite, partly written
      13a585e [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dt-java
      f1a8283 [Joseph K. Bradley] Added old JavaDecisionTreeSuite, to be updated later
      225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      2998e38a
  5. Aug 02, 2014
    • Joseph K. Bradley's avatar
      [SPARK-2478] [mllib] DecisionTree Python API · 3f67382e
      Joseph K. Bradley authored
      Added experimental Python API for Decision Trees.
      
      API:
      * class DecisionTreeModel
      ** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
      ** numNodes()
      ** depth()
      ** __str__()
      * class DecisionTree
      ** trainClassifier()
      ** trainRegressor()
      ** train()
      
      Examples and testing:
      * Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
      * Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors
      
      Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.
      
      CC mengxr manishamde
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:
      
      3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
      6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
      67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
      aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
      fa10ea7 [Joseph K. Bradley] Small style update
      7968692 [Joseph K. Bradley] small braces typo fix
      e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
      db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
      6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      93953f1 [Joseph K. Bradley] Likely done with Python API.
      6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
      188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
      2b20c61 [Joseph K. Bradley] Small doc and style updates
      1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
      8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
      376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
      e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
      52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
      8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
      cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
      8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      2283df8 [Joseph K. Bradley] 2 bug fixes.
      73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
      f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
      3f67382e
    • Patrick Wendell's avatar
    • GuoQiang Li's avatar
      [SPARK-1470][SPARK-1842] Use the scala-logging wrapper instead of the directly sfl4j api · adc83032
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      
      Closes #1369 from witgo/SPARK-1470_new and squashes the following commits:
      
      66a1641 [GuoQiang Li] IncompatibleResultTypeProblem
      73a89ba [GuoQiang Li] Use the scala-logging wrapper instead of the directly sfl4j api.
      adc83032
    • Burak's avatar
      [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator.... · fda47598
      Burak authored
      [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator. RandomRDD is now of generic type
      
      The RandomRDDGenerators used to only output RDD[Double].
      Now RandomRDDGenerators.randomRDD can be used to generate a random RDD[T] via a class that extends RandomDataGenerator, by supplying a type T and overriding the nextValue() function as they wish.
      
      Author: Burak <brkyvz@gmail.com>
      
      Closes #1732 from brkyvz/SPARK-2801 and squashes the following commits:
      
      c94a694 [Burak] [SPARK-2801][MLlib] Missing ClassTags added
      22d96fe [Burak] [SPARK-2801][MLlib]: DistributionGenerator renamed to RandomDataGenerator, generic types added for RandomRDD instead of Double
      fda47598
  6. Aug 01, 2014
    • Tor Myklebust's avatar
      [SPARK-1580][MLLIB] Estimate ALS communication and computation costs. · e25ec061
      Tor Myklebust authored
      Continue the work from #493.
      
      Closes #493 and Closes #593
      
      Author: Tor Myklebust <tmyklebu@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1731 from mengxr/tmyklebu-alscost and squashes the following commits:
      
      9b56a8b [Xiangrui Meng] updated API and added a simple test
      68a3229 [Xiangrui Meng] merge master
      217bd1d [Tor Myklebust] Documentation and choleskies -> subproblems.
      8cbb718 [Tor Myklebust] Braces get spaces.
      0455cd4 [Tor Myklebust] Parens for collectAsMap.
      2b2febe [Tor Myklebust] Use `makeLinkRDDs` when estimating costs.
      2ab7a5d [Tor Myklebust] Reindent estimateCost's declaration and make it return Seqs.
      8b21e6d [Tor Myklebust] Fix overlong lines.
      8cbebf1 [Tor Myklebust] Rename and clean up the return format of cost estimator.
      6615ed5 [Tor Myklebust] It's more useful to give per-partition estimates.  Do that.
      5530678 [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark into alscost
      6c31324 [Tor Myklebust] Make it actually build...
      a1184d1 [Tor Myklebust] Mark ALS.evaluatePartitioner DeveloperApi.
      657a71b [Tor Myklebust] Simple-minded estimates of computation and communication costs in ALS.
      dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
      23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
      495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
      674933a [Tor Myklebust] Fix style.
      40edc23 [Tor Myklebust] Fix missing space.
      f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
      5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
      36a0f43 [Tor Myklebust] Make the partitioner private.
      d872b09 [Tor Myklebust] Add negative id ALS test.
      df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
      c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
      c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
      e25ec061
    • Michael Giannakopoulos's avatar
      [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. · c2811892
      Michael Giannakopoulos authored
      Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC).
      
      Author: Michael Giannakopoulos <miccagiann@gmail.com>
      
      Closes #1624 from miccagiann/new-branch and squashes the following commits:
      
      c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master.
      8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets.
      fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException.
      44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD.
      8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one.
      638be47 [Michael Giannakopoulos] Modified code to comply with code standards.
      ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file
      b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter.
      78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function.
      3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.
      c2811892
    • Jeremy Freeman's avatar
      Streaming mllib [SPARK-2438][MLLIB] · f6a18993
      Jeremy Freeman authored
      This PR implements a streaming linear regression analysis, in which a linear regression model is trained online as new data arrive. The design is based on discussions with tdas and mengxr, in which we determined how to add this functionality in a general way, with minimal changes to existing libraries.
      
      __Summary of additions:__
      
      _StreamingLinearAlgorithm_
      - An abstract class for fitting generalized linear models online to streaming data, including training on (and updating) a model, and making predictions.
      
      _StreamingLinearRegressionWithSGD_
      - Class and companion object for running streaming linear regression
      
      _StreamingLinearRegressionTestSuite_
      - Unit tests
      
      _StreamingLinearRegression_
      - Example use case: fitting a model online to data from one stream, and making predictions on other data
      
      __Notes__
      - If this looks good, I can use the StreamingLinearAlgorithm class to easily implement other analyses that follow the same logic (Ridge, Lasso, Logistic, SVM).
      
      Author: Jeremy Freeman <the.freeman.lab@gmail.com>
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #1361 from freeman-lab/streaming-mllib and squashes the following commits:
      
      775ea29 [Jeremy Freeman] Throw error if user doesn't initialize weights
      4086fee [Jeremy Freeman] Fixed current weight formatting
      8b95b27 [Jeremy Freeman] Restored broadcasting
      29f27ec [Jeremy Freeman] Formatting
      8711c41 [Jeremy Freeman] Used return to avoid indentation
      777b596 [Jeremy Freeman] Restored treeAggregate
      74cf440 [Jeremy Freeman] Removed static methods
      d28cf9a [Jeremy Freeman] Added usage notes
      c3326e7 [Jeremy Freeman] Improved documentation
      9541a41 [Jeremy Freeman] Merge remote-tracking branch 'upstream/master' into streaming-mllib
      66eba5e [Jeremy Freeman] Fixed line lengths
      2fe0720 [Jeremy Freeman] Minor cleanup
      7d51378 [Jeremy Freeman] Moved streaming loader to MLUtils
      b9b69f6 [Jeremy Freeman] Added setter methods
      c3f8b5a [Jeremy Freeman] Modified logging
      00aafdc [Jeremy Freeman] Add modifiers
      14b801e [Jeremy Freeman] Name changes
      c7d38a3 [Jeremy Freeman] Move check for empty data to GradientDescent
      4b0a5d3 [Jeremy Freeman] Cleaned up tests
      74188d6 [Jeremy Freeman] Eliminate dependency on commons
      50dd237 [Jeremy Freeman] Removed experimental tag
      6bfe1e6 [Jeremy Freeman] Fixed imports
      a2a63ad [freeman] Makes convergence test more robust
      86220bc [freeman] Streaming linear regression unit tests
      fb4683a [freeman] Minor changes for scalastyle consistency
      fd31e03 [freeman] Changed logging behavior
      453974e [freeman] Fixed indentation
      c4b1143 [freeman] Streaming linear regression
      604f4d7 [freeman] Expanded private class to include mllib
      d99aa85 [freeman] Helper methods for streaming MLlib apps
      0898add [freeman] Added dependency on streaming
      f6a18993
    • Joseph K. Bradley's avatar
      [SPARK-2796] [mllib] DecisionTree bug fix: ordered categorical features · 7058a539
      Joseph K. Bradley authored
      Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      
      Added new test to DecisionTreeSuite to catch this: "regression stump with categorical variables of arity 2"
      
      Bug fix: Modified upper bound discussed above.
      
      Also: Small improvements to coding style in DecisionTree.
      
      CC mengxr manishamde
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1720 from jkbradley/decisiontree-bugfix2 and squashes the following commits:
      
      225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      7058a539
    • Doris Xin's avatar
      [SPARK-2786][mllib] Python correlations · d88e6956
      Doris Xin authored
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1713 from dorx/pythonCorrelation and squashes the following commits:
      
      5f1e60c [Doris Xin] reviewer comments.
      46ff6eb [Doris Xin] reviewer comments.
      ad44085 [Doris Xin] style fix
      e69d446 [Doris Xin] fixed missed conflicts.
      eb5bf56 [Doris Xin] merge master
      cc9f725 [Doris Xin] units passed.
      9141a63 [Doris Xin] WIP2
      d199f1f [Doris Xin] Moved correlation names into a public object
      cd163d6 [Doris Xin] WIP
      d88e6956
    • Xiangrui Meng's avatar
      [HOTFIX] downgrade breeze version to 0.7 · 5328c0aa
      Xiangrui Meng authored
      breeze-0.8.1 causes dependency issues, as discussed in #940 .
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1718 from mengxr/revert-breeze and squashes the following commits:
      
      99c4681 [Xiangrui Meng] downgrade breeze version to 0.7
      5328c0aa
    • witgo's avatar
      [SPARK-1997] update breeze to version 0.8.1 · 0dacb1ad
      witgo authored
      `breeze 0.8.1`  dependent on  `scala-logging-slf4j 2.1.1` The relevant code on #1369
      
      Author: witgo <witgo@qq.com>
      
      Closes #940 from witgo/breeze-8.0.1 and squashes the following commits:
      
      65cc65e [witgo] update breeze  to version 0.8.1
      0dacb1ad
    • Sean Owen's avatar
      SPARK-2768 [MLLIB] Add product, user recommend method to MatrixFactorizationModel · 82d209d4
      Sean Owen authored
      Right now, `MatrixFactorizationModel` can only predict a score for one or more `(user,product)` tuples. As a comment in the file notes, it would be more useful to expose a recommend method, that computes top N scoring products for a user (or vice versa – users for a product).
      
      (This also corrects some long lines in the Java ALS test suite.)
      
      As you can see, it's a little messy to access the class from Java. Should there be a Java-friendly wrapper for it? with a pointer about where that should go, I could add that.
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1687 from srowen/SPARK-2768 and squashes the following commits:
      
      b349675 [Sean Owen] Additional review changes
      c9edb04 [Sean Owen] Updates from code review
      7bc35f9 [Sean Owen] Add recommend methods to MatrixFactorizationModel
      82d209d4
  7. Jul 31, 2014
    • Doris Xin's avatar
      [SPARK-2782][mllib] Bug fix for getRanks in SpearmanCorrelation · c4755403
      Doris Xin authored
      getRanks computes the wrong rank when numPartition >= size in the input RDDs before this patch. added units to address this bug.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1710 from dorx/correlationBug and squashes the following commits:
      
      733def4 [Doris Xin] bugs and reviewer comments.
      31db920 [Doris Xin] revert unnecessary change
      043ff83 [Doris Xin] bug fix for spearman corner case
      c4755403
    • Xiangrui Meng's avatar
      [SPARK-2777][MLLIB] change ALS factors storage level to MEMORY_AND_DISK · b1900832
      Xiangrui Meng authored
      Now the factors are persisted in memory only. If they get kicked off by later jobs, we might have to start the computation from very beginning. A better solution is changing the storage level to `MEMORY_AND_DISK`.
      
      srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1700 from mengxr/als-level and squashes the following commits:
      
      c103d76 [Xiangrui Meng] change ALS factors storage level to MEMORY_AND_DISK
      b1900832
    • Joseph K. Bradley's avatar
      [SPARK-2756] [mllib] Decision tree bug fixes · b124de58
      Joseph K. Bradley authored
      (1) Inconsistent aggregate (agg) indexing for unordered features.
      (2) Fixed gain calculations for edge cases.
      (3) One-off error in choosing thresholds for continuous features for small datasets.
      (4) (not a bug) Changed meaning of tree depth by 1 to fit scikit-learn and rpart. (Depth 1 used to mean 1 leaf node; depth 0 now means 1 leaf node.)
      
      Other updates, to help with tests:
      * Updated DecisionTreeRunner to print more info.
      * Added utility functions to DecisionTreeModel: toString, depth, numNodes
      * Improved internal DecisionTree documentation
      
      Bug fix details:
      
      (1) Indexing was inconsistent for aggregate calculations for unordered features (in multiclass classification with categorical features, where the features had few enough values such that they could be considered unordered, i.e., isSpaceSufficientForAllCategoricalSplits=true).
      
      * updateBinForUnorderedFeature indexed agg as (node, feature, featureValue, binIndex), where
      ** featureValue was from arr (so it was a feature value)
      ** binIndex was in [0,…, 2^(maxFeatureValue-1)-1)
      * The rest of the code indexed agg as (node, feature, binIndex, label).
      * Corrected this bug by changing updateBinForUnorderedFeature to use the second indexing pattern.
      
      Unit tests in DecisionTreeSuite
      * Updated a few tests to train a model and test its training accuracy, which catches the indexing bug from updateBinForUnorderedFeature() discussed above.
      * Added new test (“stump with categorical variables for multiclass classification, with just enough bins”) to test bin extremes.
      
      (2) Bug fix: calculateGainForSplit (for classification):
      * It used to return dummy prediction values when either the right or left children had 0 weight.  These were incorrect for multiclass classification.  It has been corrected.
      
      Updated impurities to allow for count = 0.  This was related to the above bug fix for calculateGainForSplit (for classification).
      
      Small updates to documentation and coding style.
      
      (3) Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
      
      * Exhibited bug in new test in DecisionTreeSuite: “stump with 1 continuous variable for binary classification, to check off-by-1 error”
      * Description: When finding thresholds for possible splits for continuous features in DecisionTree.findSplitsBins, the thresholds were set according to individual training examples’ feature values.
      * Fix: The threshold is set to be the average of 2 consecutive (sorted) examples’ feature values.  E.g.: If the old code set the threshold using example i, the new code sets the threshold using exam
      * Note: In 4 DecisionTreeSuite tests with all labels identical, removed check of threshold since it is somewhat arbitrary.
      
      CC: mengxr manishamde  Please let me know if I missed something!
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1673 from jkbradley/decisiontree-bugfix and squashes the following commits:
      
      2b20c61 [Joseph K. Bradley] Small doc and style updates
      dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
      8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
      376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
      59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
      52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
      8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
      2283df8 [Joseph K. Bradley] 2 bug fixes.
      73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
      b124de58
    • Doris Xin's avatar
      [SPARK-2724] Python version of RandomRDDGenerators · d8430148
      Doris Xin authored
      RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.
      
      `randomRDD.py` is named to avoid collision with the built-in Python `random` package.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1628 from dorx/pythonRDD and squashes the following commits:
      
      55c6de8 [Doris Xin] review comments. all python units passed.
      f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI
      2d73917 [Doris Xin] fix for linalg.py
      8663e6a [Doris Xin] reverting back to a single python file for random
      f47c481 [Doris Xin] docs update
      687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests
      4338f40 [Doris Xin] renamed randomRDD to rand and import as random
      29d205e [Doris Xin] created mllib.random package
      bd2df13 [Doris Xin] typos
      07ddff2 [Doris Xin] units passed.
      23b2ecd [Doris Xin] WIP
      d8430148
    • Xiangrui Meng's avatar
      [SPARK-2511][MLLIB] add HashingTF and IDF · dc0865bc
      Xiangrui Meng authored
      This is roughly the TF-IDF implementation used in the Databricks Cloud Demo: http://databricks.com/cloud/ .
      
      Both `HashingTF` and `IDF` are implemented as transformers, similar to scikit-learn.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1671 from mengxr/tfidf and squashes the following commits:
      
      7d65888 [Xiangrui Meng] use JavaConverters._
      5fe9ec4 [Xiangrui Meng] fix unit test
      6e214ec [Xiangrui Meng] add apache header
      cfd9aed [Xiangrui Meng] add Java-friendly methods move classes to mllib.feature
      3814440 [Xiangrui Meng] add HashingTF and IDF
      dc0865bc
  8. Jul 30, 2014
    • Sean Owen's avatar
      SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets · e9b275b7
      Sean Owen authored
      Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1663 from srowen/SPARK-2341 and squashes the following commits:
      
      8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
      18a8c8e [Sean Owen] Updates from review
      83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
      e9b275b7
    • Sean Owen's avatar
      SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven... · 6ab96a6f
      Sean Owen authored
      SPARK-2749 [BUILD]. Spark SQL Java tests aren't compiling in Jenkins' Maven builds; missing junit:junit dep
      
      The Maven-based builds in the build matrix have been failing for a few days:
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark/
      
      On inspection, it looks like the Spark SQL Java tests don't compile:
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/244/consoleFull
      
      I confirmed it by repeating the command vs master:
      
      `mvn -Dhadoop.version=1.0.4 -Dlabel=centos -DskipTests clean package`
      
      The problem is that this module doesn't depend on JUnit. In fact, none of the modules do, but `com.novocode:junit-interface` (the SBT-JUnit bridge) pulls it in, in most places. However this module doesn't depend on `com.novocode:junit-interface`
      
      Adding the `junit:junit` dependency fixes the compile problem. In fact, the other modules with Java tests should probably depend on it explicitly instead of happening to get it via `com.novocode:junit-interface`, since that is a bit SBT/Scala-specific (and I am not even sure it's needed).
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1660 from srowen/SPARK-2749 and squashes the following commits:
      
      858ff7c [Sean Owen] Add explicit junit dep to other modules with Java tests for robustness
      9636794 [Sean Owen] Add junit dep so that Spark SQL Java tests compile
      6ab96a6f
    • GuoQiang Li's avatar
      [SPARK-2544][MLLIB] Improve ALS algorithm resource usage · fc47bb69
      GuoQiang Li authored
      Author: GuoQiang Li <witgo@qq.com>
      Author: witgo <witgo@qq.com>
      
      Closes #929 from witgo/improve_als and squashes the following commits:
      
      ea25033 [GuoQiang Li] checkpoint products 3,6,9 ...
      154dccf [GuoQiang Li] checkpoint products only
      c5779ff [witgo] Improve ALS algorithm resource usage
      fc47bb69
    • Sean Owen's avatar
      SPARK-2748 [MLLIB] [GRAPHX] Loss of precision for small arguments to Math.exp, Math.log · ee07541e
      Sean Owen authored
      In a few places in MLlib, an expression of the form `log(1.0 + p)` is evaluated. When p is so small that `1.0 + p == 1.0`, the result is 0.0. However the correct answer is very near `p`. This is why `Math.log1p` exists.
      
      Similarly for one instance of `exp(m) - 1` in GraphX; there's a special `Math.expm1` method.
      
      While the errors occur only for very small arguments, given their use in machine learning algorithms, this is entirely possible.
      
      Also note the related PR for Python: https://github.com/apache/spark/pull/1652
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1659 from srowen/SPARK-2748 and squashes the following commits:
      
      c5926d4 [Sean Owen] Use log1p, expm1 for better precision for tiny arguments
      ee07541e
  9. Jul 29, 2014
    • Xiangrui Meng's avatar
      [SPARK-2174][MLLIB] treeReduce and treeAggregate · 20424dad
      Xiangrui Meng authored
      In `reduce` and `aggregate`, the driver node spends linear time on the number of partitions. It becomes a bottleneck when there are many partitions and the data from each partition is big.
      
      SPARK-1485 (#506) tracks the progress of implementing AllReduce on Spark. I did several implementations including butterfly, reduce + broadcast, and treeReduce + broadcast. treeReduce + BT broadcast seems to be right way to go for Spark. Using binary tree may introduce some overhead in communication, because the driver still need to coordinate on data shuffling. In my experiments, n -> sqrt(n) -> 1 gives the best performance in general, which is why I set "depth = 2" in MLlib algorithms. But it certainly needs more testing.
      
      I left `treeReduce` and `treeAggregate` public for easy testing. Some numbers from a test on 32-node m3.2xlarge cluster.
      
      code:
      
      ~~~
      import breeze.linalg._
      import org.apache.log4j._
      
      Logger.getRootLogger.setLevel(Level.OFF)
      
      for (n <- Seq(1, 10, 100, 1000, 10000, 100000, 1000000)) {
        val vv = sc.parallelize(0 until 1024, 1024).map(i => DenseVector.zeros[Double](n))
        var start = System.nanoTime(); vv.treeReduce(_ + _, 2); println((System.nanoTime() - start) / 1e9)
        start = System.nanoTime(); vv.reduce(_ + _); println((System.nanoTime() - start) / 1e9)
      }
      ~~~
      
      out:
      
      | n | treeReduce(,2) | reduce |
      |---|---------------------|-----------|
      | 10 | 0.215538731 | 0.204206899 |
      | 100 | 0.278405907 | 0.205732582 |
      | 1000 | 0.208972182 | 0.214298272 |
      | 10000 | 0.194792071 | 0.349353687 |
      | 100000 | 0.347683285 | 6.086671892 |
      | 1000000 | 2.589350682 | 66.572906702 |
      
      CC: @pwendell
      
      This is clearly more scalable than the default implementation. My question is whether we should use this implementation in `reduce` and `aggregate` or put them as separate methods. The concern is that users may use `reduce` and `aggregate` as collect, where having multiple stages doesn't reduce the data size. However, in this case, `collect` is more appropriate.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1110 from mengxr/tree and squashes the following commits:
      
      c6cd267 [Xiangrui Meng] make depth default to 2
      b04b96a [Xiangrui Meng] address comments
      9bcc5d3 [Xiangrui Meng] add depth for readability
      7495681 [Xiangrui Meng] fix compile error
      142a857 [Xiangrui Meng] merge master
      d58a087 [Xiangrui Meng] move treeReduce and treeAggregate to mllib
      8a2a59c [Xiangrui Meng] Merge branch 'master' into tree
      be6a88a [Xiangrui Meng] use treeAggregate in mllib
      0f94490 [Xiangrui Meng] add docs
      eb71c33 [Xiangrui Meng] add treeReduce
      fe42a5e [Xiangrui Meng] add treeAggregate
      20424dad
  10. Jul 28, 2014
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server (with Maven profile fix) · a7a9d144
      Cheng Lian authored
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Another try for #1399 & #1600. Those two PR breaks Jenkins builds because we made a separate profile `hive-thriftserver` in sub-project `assembly`, but the `hive-thriftserver` module is defined outside the `hive-thriftserver` profile. Thus every time a pull request that doesn't touch SQL code will also execute test suites defined in `hive-thriftserver`, but tests fail because related .class files are not included in the assembly jar.
      
      In the most recent commit, module `hive-thriftserver` is moved into its own profile to fix this problem. All previous commits are squashed for clarity.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1620 from liancheng/jdbc-with-maven-fix and squashes the following commits:
      
      629988e [Cheng Lian] Moved hive-thriftserver module definition into its own profile
      ec3c7a7 [Cheng Lian] Cherry picked the Hive Thrift server
      a7a9d144
    • DB Tsai's avatar
      [SPARK-2479][MLlib] Comparing floating-point numbers using relative error in UnitTests · 255b56f9
      DB Tsai authored
      Floating point math is not exact, and most floating-point numbers end up being slightly imprecise due to rounding errors.
      
      Simple values like 0.1 cannot be precisely represented using binary floating point numbers, and the limited precision of floating point numbers means that slight changes in the order of operations or the precision of intermediates can change the result.
      
      That means that comparing two floats to see if they are equal is usually not what we want. As long as this imprecision stays small, it can usually be ignored.
      
      Based on discussion in the community, we have implemented two different APIs for relative tolerance, and absolute tolerance. It makes sense that test writers should know which one they need depending on their circumstances.
      
      Developers also need to explicitly specify the eps, and there is no default value which will sometimes cause confusion.
      
      When comparing against zero using relative tolerance, a exception will be raised to warn users that it's meaningless.
      
      For relative tolerance, users can now write
      
          assert(23.1 ~== 23.52 relTol 0.02)
          assert(23.1 ~== 22.74 relTol 0.02)
          assert(23.1 ~= 23.52 relTol 0.02)
          assert(23.1 ~= 22.74 relTol 0.02)
          assert(!(23.1 !~= 23.52 relTol 0.02))
          assert(!(23.1 !~= 22.74 relTol 0.02))
      
          // This will throw exception with the following message.
          // "Did not expect 23.1 and 23.52 to be within 0.02 using relative tolerance."
          assert(23.1 !~== 23.52 relTol 0.02)
      
          // "Expected 23.1 and 22.34 to be within 0.02 using relative tolerance."
          assert(23.1 ~== 22.34 relTol 0.02)
      
      For absolute error,
      
          assert(17.8 ~== 17.99 absTol 0.2)
          assert(17.8 ~== 17.61 absTol 0.2)
          assert(17.8 ~= 17.99 absTol 0.2)
          assert(17.8 ~= 17.61 absTol 0.2)
          assert(!(17.8 !~= 17.99 absTol 0.2))
          assert(!(17.8 !~= 17.61 absTol 0.2))
      
          // This will throw exception with the following message.
          // "Did not expect 17.8 and 17.99 to be within 0.2 using absolute error."
          assert(17.8 !~== 17.99 absTol 0.2)
      
          // "Expected 17.8 and 17.59 to be within 0.2 using absolute error."
          assert(17.8 ~== 17.59 absTol 0.2)
      
      Authors:
        DB Tsai <dbtsaialpinenow.com>
        Marek Kolodziej <marekalpinenow.com>
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #1425 from dbtsai/SPARK-2479_comparing_floating_point and squashes the following commits:
      
      8c7cbcc [DB Tsai] Alpine Data Labs
      255b56f9
  11. Jul 27, 2014
    • Patrick Wendell's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · e5bbce9a
      Patrick Wendell authored
      This reverts commit f6ff2a61.
      e5bbce9a
    • Doris Xin's avatar
      [SPARK-2514] [mllib] Random RDD generator · 81fcdd22
      Doris Xin authored
      Utilities for generating random RDDs.
      
      RandomRDD and RandomVectorRDD are created instead of using `sc.parallelize(range:Range)` because `Range` objects in Scala can only have `size <= Int.MaxValue`.
      
      The object `RandomRDDGenerators` can be transformed into a generator class to reduce the number of auxiliary methods for optional arguments.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1520 from dorx/randomRDD and squashes the following commits:
      
      01121ac [Doris Xin] reviewer comments
      6bf27d8 [Doris Xin] Merge branch 'master' into randomRDD
      a8ea92d [Doris Xin] Reviewer comments
      063ea0b [Doris Xin] Merge branch 'master' into randomRDD
      aec68eb [Doris Xin] newline
      bc90234 [Doris Xin] units passed.
      d56cacb [Doris Xin] impl with RandomRDD
      92d6f1c [Doris Xin] solution for Cloneable
      df5bcff [Doris Xin] Merge branch 'generator' into randomRDD
      f46d928 [Doris Xin] WIP
      49ed20d [Doris Xin] alternative poisson distribution generator
      7cb0e40 [Doris Xin] fix for data inconsistency
      8881444 [Doris Xin] RandomRDDGenerator: initial design
      81fcdd22
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · f6ff2a61
      Cheng Lian authored
      (This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)
      
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1600 from liancheng/jdbc and squashes the following commits:
      
      ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      f6ff2a61
    • Doris Xin's avatar
      [SPARK-2679] [MLLib] Ser/De for Double · 3a69c72e
      Doris Xin authored
      Added a set of serializer/deserializer for Double in _common.py and PythonMLLibAPI in MLLib.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1581 from dorx/doubleSerDe and squashes the following commits:
      
      86a85b3 [Doris Xin] Merge branch 'master' into doubleSerDe
      2bfe7a4 [Doris Xin] Removed magic byte
      ad4d0d9 [Doris Xin] removed a space in unit
      a9020bc [Doris Xin] units passed
      7dad9af [Doris Xin] WIP
      3a69c72e
    • Xiangrui Meng's avatar
      [SPARK-2361][MLLIB] Use broadcast instead of serializing data directly into task closure · aaf2b735
      Xiangrui Meng authored
      We saw task serialization problems with large feature dimension, which could be avoid if we don't serialize data directly into task but use broadcast variables. This PR uses broadcast in both training and prediction and adds tests to make sure the task size is small.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1427 from mengxr/broadcast-new and squashes the following commits:
      
      b9a1228 [Xiangrui Meng] style update
      b97c184 [Xiangrui Meng] minimal change to LBFGS
      9ebadcc [Xiangrui Meng] add task size test to RowMatrix
      9427bf0 [Xiangrui Meng] add task size tests to linear methods
      e0a5cf2 [Xiangrui Meng] add task size test to GD
      28a8411 [Xiangrui Meng] add test for NaiveBayes
      380778c [Xiangrui Meng] update KMeans test
      bccab92 [Xiangrui Meng] add task size test to LBFGS
      02103ba [Xiangrui Meng] remove print
      e73d68e [Xiangrui Meng] update tests for k-means
      174cb15 [Xiangrui Meng] use local-cluster for test with a small akka.frameSize
      1928a5a [Xiangrui Meng] add test for KMeans task size
      e00c2da [Xiangrui Meng] use broadcast in GD, KMeans
      010d076 [Xiangrui Meng] modify NaiveBayesModel and GLM to use broadcast
      aaf2b735
  12. Jul 25, 2014
    • Michael Armbrust's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · afd757a2
      Michael Armbrust authored
      This reverts commit 06dc0d2c.
      
      #1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1594 from marmbrus/revertJDBC and squashes the following commits:
      
      59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
      afd757a2
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · 06dc0d2c
      Cheng Lian authored
      JIRA issue:
      
      - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
      
      Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)
      
      TODO
      
      - [x] Use `spark-submit` to launch the server, the CLI and beeline
      - [x] Migration guideline draft for Shark users
      
      ----
      
      Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:
      
      ```bash
      $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
      ```
      
      This actually shows usage information of `SparkSubmit` rather than `BeeLine`.
      
      ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~
      
      **UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1399 from liancheng/thriftserver and squashes the following commits:
      
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      06dc0d2c
    • Matei Zaharia's avatar
      SPARK-2657 Use more compact data structures than ArrayBuffer in groupBy & cogroup · 8529ced3
      Matei Zaharia authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2657
      
      Our current code uses ArrayBuffers for each group of values in groupBy, as well as for the key's elements in CoGroupedRDD. ArrayBuffers have a lot of overhead if there are few values in them, which is likely to happen in cases such as join. In particular, they have a pointer to an Object[] of size 16 by default, which is 24 bytes for the array header + 128 for the pointers in there, plus at least 32 for the ArrayBuffer data structure. This patch replaces the per-group buffers with a CompactBuffer class that can store up to 2 elements more efficiently (in fields of itself) and acts like an ArrayBuffer beyond that. For a key's elements in CoGroupedRDD, we use an Array of CompactBuffers instead of an ArrayBuffer of ArrayBuffers.
      
      There are some changes throughout the code to deal with CoGroupedRDD returning Array instead. We can also decide not to do that but CoGroupedRDD is a `DeveloperAPI` so I think it's okay to change it here.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #1555 from mateiz/compact-groupby and squashes the following commits:
      
      845a356 [Matei Zaharia] Lower initial size of CompactBuffer's vector to 8
      07621a7 [Matei Zaharia] Review comments
      0c1cd12 [Matei Zaharia] Don't use varargs in CompactBuffer.apply
      bdc8a39 [Matei Zaharia] Small tweak to +=, and typos
      f61f040 [Matei Zaharia] Fix line lengths
      59da88b0 [Matei Zaharia] Fix line lengths
      197cde8 [Matei Zaharia] Make CompactBuffer extend Seq to make its toSeq more efficient
      775110f [Matei Zaharia] Change CoGroupedRDD to give (K, Array[Iterable[_]]) to avoid wrappers
      9b4c6e8 [Matei Zaharia] Use CompactBuffer in CoGroupedRDD
      ed577ab [Matei Zaharia] Use CompactBuffer in groupByKey
      10f0de1 [Matei Zaharia] A CompactBuffer that's more memory-efficient than ArrayBuffer for small buffers
      8529ced3
  13. Jul 24, 2014
  14. Jul 23, 2014
    • Xiangrui Meng's avatar
      [SPARK-2617] Correct doc and usages of preservesPartitioning · 4c7243e1
      Xiangrui Meng authored
      The name `preservesPartitioning` is ambiguous: 1) preserves the indices of partitions, 2) preserves the partitioner. The latter is correct and `preservesPartitioning` should really be called `preservesPartitioner` to avoid confusion. Unfortunately, this is already part of the API and we cannot change. We should be clear in the doc and fix wrong usages.
      
      This PR
      
      1. adds notes in `maPartitions*`,
      2. makes `RDD.sample` preserve partitioner,
      3. changes `preservesPartitioning` to false in  `RDD.zip` because the keys of the first RDD are no longer the keys of the zipped RDD,
      4. fixes some wrong usages in MLlib.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1526 from mengxr/preserve-partitioner and squashes the following commits:
      
      b361e65 [Xiangrui Meng] update doc based on pwendell's comments
      3b1ba19 [Xiangrui Meng] update doc
      357575c [Xiangrui Meng] fix unit test
      20b4816 [Xiangrui Meng] Merge branch 'master' into preserve-partitioner
      d1caa65 [Xiangrui Meng] add doc to explain preservesPartitioning fix wrong usage of preservesPartitioning make sample preserse partitioning
      4c7243e1
Loading