Skip to content
Snippets Groups Projects
  1. Jul 27, 2014
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · f6ff2a61
      Cheng Lian authored
      (This is a replacement of #1399, trying to fix potential `HiveThriftServer2` port collision between parallel builds. Please refer to [these comments](https://github.com/apache/spark/pull/1399#issuecomment-50212572) for details.)
      
      JIRA issue: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      
      Merging the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1600 from liancheng/jdbc and squashes the following commits:
      
      ac4618b [Cheng Lian] Uses random port for HiveThriftServer2 to avoid collision with parallel builds
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      f6ff2a61
  2. Jul 25, 2014
    • Michael Armbrust's avatar
      Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server" · afd757a2
      Michael Armbrust authored
      This reverts commit 06dc0d2c.
      
      #1399 is making Jenkins fail.  We should investigate and put this back after its passing tests.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1594 from marmbrus/revertJDBC and squashes the following commits:
      
      59748da [Michael Armbrust] Revert "[SPARK-2410][SQL] Merging Hive Thrift/JDBC server"
      afd757a2
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · 06dc0d2c
      Cheng Lian authored
      JIRA issue:
      
      - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
      
      Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)
      
      TODO
      
      - [x] Use `spark-submit` to launch the server, the CLI and beeline
      - [x] Migration guideline draft for Shark users
      
      ----
      
      Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:
      
      ```bash
      $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
      ```
      
      This actually shows usage information of `SparkSubmit` rather than `BeeLine`.
      
      ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~
      
      **UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1399 from liancheng/thriftserver and squashes the following commits:
      
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      06dc0d2c
  3. Jul 21, 2014
    • Burak's avatar
      [SPARK-2434][MLlib]: Warning messages that point users to original MLlib... · a4d60208
      Burak authored
      [SPARK-2434][MLlib]: Warning messages that point users to original MLlib implementations added to Examples
      
      [SPARK-2434][MLlib]: Warning messages that refer users to the original MLlib implementations of some popular example machine learning algorithms added both in the comments and the code. The following examples have been modified:
      Scala:
      * LocalALS
      * LocalFileLR
      * LocalKMeans
      * LocalLP
      * SparkALS
      * SparkHdfsLR
      * SparkKMeans
      * SparkLR
      Python:
       * kmeans.py
       * als.py
       * logistic_regression.py
      
      Author: Burak <brkyvz@gmail.com>
      
      Closes #1515 from brkyvz/SPARK-2434 and squashes the following commits:
      
      7505da9 [Burak] [SPARK-2434][MLlib]: Warning messages added, scalastyle errors fixed, and added missing punctuation
      b96b522 [Burak] [SPARK-2434][MLlib]: Warning messages added and scalastyle errors fixed
      4762f39 [Burak] [SPARK-2434]: Warning messages added
      17d3d83 [Burak] SPARK-2434: Added warning messages to the naive implementations of the example algorithms
      2cb5301 [Burak] SPARK-2434: Warning messages redirecting to original implementaions added.
      a4d60208
  4. Jul 18, 2014
    • Manish Amde's avatar
      [MLlib] SPARK-1536: multiclass classification support for decision tree · d88f6be4
      Manish Amde authored
      The ability to perform multiclass classification is a big advantage for using decision trees and was a highly requested feature for mllib. This pull request adds multiclass classification support to the MLlib decision tree. It also adds sample weights support using WeightedLabeledPoint class for handling unbalanced datasets during classification. It will also support algorithms such as AdaBoost which requires instances to be weighted.
      
      It handles the special case where the categorical variables cannot be ordered for multiclass classification and thus the optimizations used for speeding up binary classification cannot be directly used for multiclass classification with categorical variables. More specifically, for m categories in a categorical feature, it analyses all the ```2^(m-1) - 1``` categorical splits provided that #splits are less than the maxBins provided in the input. This condition will not be met for features with large number of categories -- using decision trees is not recommended for such datasets in general since the categorical features are favored over continuous features. Moreover, the user can use a combination of tricks (increasing bin size of the tree algorithms, use binary encoding for categorical features or use one-vs-all classification strategy) to avoid these constraints.
      
      The new code is accompanied by unit tests and has also been tested on the iris and covtype datasets.
      
      cc: mengxr, etrain, hirakendu, atalwalkar, srowen
      
      Author: Manish Amde <manish9ue@gmail.com>
      Author: manishamde <manish9ue@gmail.com>
      Author: Evan Sparks <sparks@cs.berkeley.edu>
      
      Closes #886 from manishamde/multiclass and squashes the following commits:
      
      26f8acc [Manish Amde] another attempt at fixing mima
      c5b2d04 [Manish Amde] more MIMA fixes
      1ce7212 [Manish Amde] change problem filter for mima
      10fdd82 [Manish Amde] fixing MIMA excludes
      e1c970d [Manish Amde] merged master
      abf2901 [Manish Amde] adding classes to MimaExcludes.scala
      45e767a [Manish Amde] adding developer api annotation for overriden methods
      c8428c4 [Manish Amde] fixing weird multiline bug
      afced16 [Manish Amde] removed label weights support
      2d85a48 [Manish Amde] minor: fixed scalastyle issues reprise
      4e85f2c [Manish Amde] minor: fixed scalastyle issues
      b2ae41f [Manish Amde] minor: scalastyle
      e4c1321 [Manish Amde] using while loop for regression histograms
      d75ac32 [Manish Amde] removed WeightedLabeledPoint from this PR
      0fecd38 [Manish Amde] minor: add newline to EOF
      2061cf5 [Manish Amde] merged from master
      06b1690 [Manish Amde] fixed off-by-one error in bin to split conversion
      9cc3e31 [Manish Amde] added implicit conversion import
      5c1b2ca [Manish Amde] doc for PointConverter class
      485eaae [Manish Amde] implicit conversion from LabeledPoint to WeightedLabeledPoint
      3d7f911 [Manish Amde] updated doc
      8e44ab8 [Manish Amde] updated doc
      adc7315 [Manish Amde] support ordered categorical splits for multiclass classification
      e3e8843 [Manish Amde] minor code formatting
      23d4268 [Manish Amde] minor: another minor code style
      34ee7b9 [Manish Amde] minor: code style
      237762d [Manish Amde] renaming functions
      12e6d0a [Manish Amde] minor: removing line in doc
      9a90c93 [Manish Amde] Merge branch 'master' into multiclass
      1892a2c [Manish Amde] tests and use multiclass binaggregate length when atleast one categorical feature is present
      f5f6b83 [Manish Amde] multiclass for continous variables
      8cfd3b6 [Manish Amde] working for categorical multiclass classification
      828ff16 [Manish Amde] added categorical variable test
      bce835f [Manish Amde] code cleanup
      7e5f08c [Manish Amde] minor doc
      1dd2735 [Manish Amde] bin search logic for multiclass
      f16a9bb [Manish Amde] fixing while loop
      d811425 [Manish Amde] multiclass bin aggregate logic
      ab5cb21 [Manish Amde] multiclass logic
      d8e4a11 [Manish Amde] sample weights
      ed5a2df [Manish Amde] fixed classification requirements
      d012be7 [Manish Amde] fixed while loop
      18d2835 [Manish Amde] changing default values for num classes
      6b912dc [Manish Amde] added numclasses to tree runner, predict logic for multiclass, add multiclass option to train
      75f2bfc [Manish Amde] minor code style fix
      e547151 [Manish Amde] minor modifications
      34549d0 [Manish Amde] fixing error during merge
      098e8c5 [Manish Amde] merged master
      e006f9d [Manish Amde] changing variable names
      5c78e1a [Manish Amde] added multiclass support
      6c7af22 [Manish Amde] prepared for multiclass without breaking binary classification
      46e06ee [Manish Amde] minor mods
      3f85a17 [Manish Amde] tests for multiclass classification
      4d5f70c [Manish Amde] added multiclass support for find splits bins
      46f909c [Manish Amde] todo for multiclass support
      455bea9 [Manish Amde] fixed tests
      14aea48 [Manish Amde] changing instance format to weighted labeled point
      a1a6e09 [Manish Amde] added weighted point class
      968ca9d [Manish Amde] merged master
      7fc9545 [Manish Amde] added docs
      ce004a1 [Manish Amde] minor formatting
      b27ad2c [Manish Amde] formatting
      426bb28 [Manish Amde] programming guide blurb
      8053fed [Manish Amde] more formatting
      5eca9e4 [Manish Amde] grammar
      4731cda [Manish Amde] formatting
      5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
      cbd9f14 [Manish Amde] modified scala.math to math
      dad9652 [Manish Amde] removed unused imports
      e0426ee [Manish Amde] renamed parameter
      718506b [Manish Amde] added unit test
      1517155 [Manish Amde] updated documentation
      9dbdabe [Manish Amde] merge from master
      719d009 [Manish Amde] updating user documentation
      fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
      0287772 [Evan Sparks] Fixing scalastyle issue.
      2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
      2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
      abc5a23 [Evan Sparks] Parameterizing max memory.
      50b143a [Manish Amde] adding support for very deep trees
      d88f6be4
    • Cheng Hao's avatar
      [SPARK-2570] [SQL] Fix the bug of ClassCastException · 29809a6d
      Cheng Hao authored
      Exception thrown when running the example of HiveFromSpark.
      Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
      	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
      	at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
      	at org.apache.spark.examples.sql.hive.HiveFromSpark$.main(HiveFromSpark.scala:45)
      	at org.apache.spark.examples.sql.hive.HiveFromSpark.main(HiveFromSpark.scala)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
      	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
      	at java.lang.reflect.Method.invoke(Method.java:606)
      	at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:303)
      	at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:55)
      	at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #1475 from chenghao-intel/hive_from_spark and squashes the following commits:
      
      d4c0500 [Cheng Hao] Fix the bug of ClassCastException
      29809a6d
  5. Jul 10, 2014
    • Artjom-Metro's avatar
      SPARK-2427: Fix Scala examples that use the wrong command line arguments index · ae8ca4df
      Artjom-Metro authored
      The Scala examples HBaseTest and HdfsTest don't use the correct indexes for the command line arguments. This due to to the fix of JIRA 1565, where these examples were not correctly adapted to the new usage of the submit script.
      
      Author: Artjom-Metro <Artjom-Metro@users.noreply.github.com>
      Author: Artjom-Metro <artjom31415@googlemail.com>
      
      Closes #1353 from Artjom-Metro/fix_examples and squashes the following commits:
      
      6111801 [Artjom-Metro] Reduce the default number of iterations
      cfaa73c [Artjom-Metro] Fix some examples that use the wrong index to access the command line arguments
      ae8ca4df
    • Prashant Sharma's avatar
      [SPARK-1776] Have Spark's SBT build read dependencies from Maven. · 628932b8
      Prashant Sharma authored
      Patch introduces the new way of working also retaining the existing ways of doing things.
      
      For example build instruction for yarn in maven is
      `mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
      in sbt it can become
      `MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
      Also supports
      `sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:
      
      a8ac951 [Prashant Sharma] Updated sbt version.
      62b09bb [Prashant Sharma] Improvements.
      fa6221d [Prashant Sharma] Excluding sql from mima
      4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
      72651ca [Prashant Sharma] Addresses code reivew comments.
      acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
      ac4312c [Prashant Sharma] Revert "minor fix"
      6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
      65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
      446768e [Prashant Sharma] minor fix
      89b9777 [Prashant Sharma] Merge conflicts
      d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
      dccc8ac [Prashant Sharma] updated mima to check against 1.0
      a49c61b [Prashant Sharma] Fix for tools jar
      a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
      cf88758 [Prashant Sharma] cleanup
      9439ea3 [Prashant Sharma] Small fix to run-examples script.
      96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
      36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
      4973dbd [Patrick Wendell] Example build using pom reader.
      628932b8
    • Raymond Liu's avatar
      Clean up SparkKMeans example's code · 2b18ea98
      Raymond Liu authored
      remove unused code
      
      Author: Raymond Liu <raymond.liu@intel.com>
      
      Closes #1352 from colorant/kmeans and squashes the following commits:
      
      ddcd1dd [Raymond Liu] Clean up SparkKMeans example's code
      2b18ea98
  6. Jul 07, 2014
  7. Jun 17, 2014
    • Yin Huai's avatar
      [SPARK-2060][SQL] Querying JSON Datasets with SQL and DSL in Spark SQL · d2f4f30b
      Yin Huai authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-2060
      
      Programming guide: http://yhuai.github.io/site/sql-programming-guide.html
      
      Scala doc of SQLContext: http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #999 from yhuai/newJson and squashes the following commits:
      
      227e89e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ce8eedd [Yin Huai] rxin's comments.
      bc9ac51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      94ffdaa [Yin Huai] Remove "get" from method names.
      ce31c81 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      e2773a6 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      79ea9ba [Yin Huai] Fix typos.
      5428451 [Yin Huai] Newline
      1f908ce [Yin Huai] Remove extra line.
      d7a005c [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      7ea750e [Yin Huai] marmbrus's comments.
      6a5f5ef [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      83013fb [Yin Huai] Update Java Example.
      e7a6c19 [Yin Huai] SchemaRDD.javaToPython should convert a field with the StructType to a Map.
      6d20b85 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4fbddf0 [Yin Huai] Programming guide.
      9df8c5a [Yin Huai] Python API.
      7027634 [Yin Huai] Java API.
      cff84cc [Yin Huai] Use a SchemaRDD for a JSON dataset.
      d0bd412 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      ab810b0 [Yin Huai] Make JsonRDD private.
      6df0891 [Yin Huai] Apache header.
      8347f2e [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      66f9e76 [Yin Huai] Update docs and use the entire dataset to infer the schema.
      8ffed79 [Yin Huai] Update the example.
      a5a4b52 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      4325475 [Yin Huai] If a sampled dataset is used for schema inferring, update the schema of the JsonTable after first execution.
      65b87f0 [Yin Huai] Fix sampling...
      8846af5 [Yin Huai] API doc.
      52a2275 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      0387523 [Yin Huai] Address PR comments.
      666b957 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      a2313a6 [Yin Huai] Address PR comments.
      f3ce176 [Yin Huai] After type conflict resolution, if a NullType is found, StringType is used.
      0576406 [Yin Huai] Add Apache license header.
      af91b23 [Yin Huai] Merge remote-tracking branch 'upstream/master' into newJson
      f45583b [Yin Huai] Infer the schema of a JSON dataset (a text file with one JSON object per line or a RDD[String] with one JSON object per string) and returns a SchemaRDD.
      f31065f [Yin Huai] A query plan or a SchemaRDD can print out its schema.
      d2f4f30b
  8. Jun 11, 2014
    • Tor Myklebust's avatar
      [SPARK-1672][MLLIB] Separate user and product partitioning in ALS · d9203350
      Tor Myklebust authored
      Some clean up work following #593.
      
      1. Allow to set different number user blocks and number product blocks in `ALS`.
      2. Update `MovieLensALS` to reflect the change.
      
      Author: Tor Myklebust <tmyklebu@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #1014 from mengxr/SPARK-1672 and squashes the following commits:
      
      0e910dd [Xiangrui Meng] change private[this] to private[recommendation]
      36420c7 [Xiangrui Meng] set exclusion rules for ALS
      9128b77 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
      294efe9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-1672
      9bab77b [Xiangrui Meng] clean up add numUserBlocks and numProductBlocks to MovieLensALS
      84c8e8c [Xiangrui Meng] Merge branch 'master' into SPARK-1672
      d17a8bf [Xiangrui Meng] merge master
      a4925fd [Tor Myklebust] Style.
      bd8a75c [Tor Myklebust] Merge branch 'master' of github.com:apache/spark into alsseppar
      021f54b [Tor Myklebust] Separate user and product blocks.
      dcf583a [Tor Myklebust] Remove the partitioner member variable; instead, thread that needle everywhere it needs to go.
      23d6f91 [Tor Myklebust] Stop making the partitioner configurable.
      495784f [Tor Myklebust] Merge branch 'master' of https://github.com/apache/spark
      674933a [Tor Myklebust] Fix style.
      40edc23 [Tor Myklebust] Fix missing space.
      f841345 [Tor Myklebust] Fix daft bug creating 'pairs', also for -> foreach.
      5ec9e6c [Tor Myklebust] Clean a couple of things up using 'map'.
      36a0f43 [Tor Myklebust] Make the partitioner private.
      d872b09 [Tor Myklebust] Add negative id ALS test.
      df27697 [Tor Myklebust] Support custom partitioners.  Currently we use the same partitioner for users and products.
      c90b6d8 [Tor Myklebust] Scramble user and product ids before bucketing.
      c774d7d [Tor Myklebust] Make the partitioner a member variable and use it instead of modding directly.
      d9203350
  9. Jun 10, 2014
    • Nick Pentreath's avatar
      SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats · f971d6cb
      Nick Pentreath authored
      So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.
      
      This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark.
      
      # Overview
      The basics are as follows:
      1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark
      2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives)
      3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString```
      4. ```PickleSerializer``` on the Python side deserializes.
      
      This works "out the box" for simple ```Writable```s:
      * ```Text```
      * ```IntWritable```, ```DoubleWritable```, ```FloatWritable```
      * ```NullWritable```
      * ```BooleanWritable```
      * ```BytesWritable```
      * ```MapWritable```
      
      It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).
      
      I've tested it out with ```ESInputFormat```  as an example and it works very nicely:
      ```python
      conf = {"es.resource" : "index/type" }
      rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
      rdd.first()
      ```
      
      I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.
      
      # Some things still outstanding:
      1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~
      2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~
      3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~
      4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR
      
      Author: Nick Pentreath <nick.pentreath@gmail.com>
      
      Closes #455 from MLnick/pyspark-inputformats and squashes the following commits:
      
      268df7e [Nick Pentreath] Documentation changes mer @pwendell comments
      761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry.
      4c972d8 [Nick Pentreath] Add license headers
      d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      cde6af9 [Nick Pentreath] Parameterize converter trait
      5ebacfa [Nick Pentreath] Update docs for PySpark input formats
      a985492 [Nick Pentreath] Move Converter examples to own package
      365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface.
      eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests
      1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight
      3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python
      b65606f [Nick Pentreath] Add converter interface
      5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None
      085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs
      43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide
      94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods
      1a4a1d6 [Nick Pentreath] Address @mateiz style comments
      01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase
      9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      84fe8e3 [Nick Pentreath] Python programming guide space formatting
      d0f52b6 [Nick Pentreath] Python programming guide
      7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      93ef995 [Nick Pentreath] Add back context.py changes
      9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py
      077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py
      5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      35b8e3a [Nick Pentreath] Another fix for test ordering
      bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e001b94 [Nick Pentreath] Fix test failures due to ordering
      78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide
      64eb051 [Nick Pentreath] Scalastyle fix
      e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring
      c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests
      1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir
      17a656b [Nick Pentreath] remove binary sequencefile for tests
      f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark
      450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      31a2fff [Nick Pentreath] Scalastyle fixes
      fc5099e [Nick Pentreath] Add Apache license headers
      4e08983 [Nick Pentreath] Clean up docs for PySpark context methods
      b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies
      951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      f6aac55 [Nick Pentreath] Bring back msgpack
      9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge
      a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering
      7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging
      25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps
      65360d5 [Nick Pentreath] Adding test SequenceFiles
      0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      d72bf18 [Nick Pentreath] msgpack
      dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e67212a [Nick Pentreath] Add back msgpack dependency
      f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      97ef708 [Nick Pentreath] Remove old writeToStream
      2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data.
      174f520 [Nick Pentreath] Add back graphx settings
      703ee65 [Nick Pentreath] Add back msgpack
      619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      eb40036 [Nick Pentreath] Remove unused comment lines
      4d7ef2e [Nick Pentreath] Fix indentation
      f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments
      0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer
      4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names
      818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD
      4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up
      4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code
      d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
      f971d6cb
  10. Jun 08, 2014
  11. Jun 05, 2014
  12. Jun 04, 2014
    • Xiangrui Meng's avatar
      [SPARK-1752][MLLIB] Standardize text format for vectors and labeled points · 189df165
      Xiangrui Meng authored
      We should standardize the text format used to represent vectors and labeled points. The proposed formats are the following:
      
      1. dense vector: `[v0,v1,..]`
      2. sparse vector: `(size,[i0,i1],[v0,v1])`
      3. labeled point: `(label,vector)`
      
      where "(..)" indicates a tuple and "[...]" indicate an array. `loadLabeledPoints` is added to pyspark's `MLUtils`. I didn't add `loadVectors` to pyspark because `RDD.saveAsTextFile` cannot stringify dense vectors in the proposed format automatically.
      
      `MLUtils#saveLabeledData` and `MLUtils#loadLabeledData` are deprecated. Users should use `RDD#saveAsTextFile` and `MLUtils#loadLabeledPoints` instead. In Scala, `MLUtils#loadLabeledPoints` is compatible with the format used by `MLUtils#loadLabeledData`.
      
      CC: @mateiz, @srowen
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #685 from mengxr/labeled-io and squashes the following commits:
      
      2d1116a [Xiangrui Meng] make loadLabeledData/saveLabeledData deprecated since 1.0.1
      297be75 [Xiangrui Meng] change LabeledPoint.parse to LabeledPointParser.parse to maintain binary compatibility
      d6b1473 [Xiangrui Meng] Merge branch 'master' into labeled-io
      56746ea [Xiangrui Meng] replace # by .
      623a5f0 [Xiangrui Meng] merge master
      f06d5ba [Xiangrui Meng] add docs and minor updates
      640fe0c [Xiangrui Meng] throw SparkException
      5bcfbc4 [Xiangrui Meng] update test to add scientific notations
      e86bf38 [Xiangrui Meng] remove NumericTokenizer
      050fca4 [Xiangrui Meng] use StringTokenizer
      6155b75 [Xiangrui Meng] merge master
      f644438 [Xiangrui Meng] remove parse methods based on eval from pyspark
      a41675a [Xiangrui Meng] python loadLabeledPoint uses Scala's implementation
      ce9a475 [Xiangrui Meng] add deserialize_labeled_point to pyspark with tests
      e9fcd49 [Xiangrui Meng] add serializeLabeledPoint and tests
      aea4ae3 [Xiangrui Meng] minor updates
      810d6df [Xiangrui Meng] update tokenizer/parser implementation
      7aac03a [Xiangrui Meng] remove Scala parsers
      c1885c1 [Xiangrui Meng] add headers and minor changes
      b0c50cb [Xiangrui Meng] add customized parser
      d731817 [Xiangrui Meng] style update
      63dc396 [Xiangrui Meng] add loadLabeledPoints to pyspark
      ea122b5 [Xiangrui Meng] Merge branch 'master' into labeled-io
      cd6c78f [Xiangrui Meng] add __str__ and parse to LabeledPoint
      a7a178e [Xiangrui Meng] add stringify to pyspark's Vectors
      5c2dbfa [Xiangrui Meng] add parse to pyspark's Vectors
      7853f88 [Xiangrui Meng] update pyspark's SparseVector.__str__
      e761d32 [Xiangrui Meng] make LabelPoint.parse compatible with the dense format used before v1.0 and deprecate loadLabeledData and saveLabeledData
      9e63a02 [Xiangrui Meng] add loadVectors and loadLabeledPoints
      19aa523 [Xiangrui Meng] update toString and add parsers for Vectors and LabeledPoint
      189df165
  13. Jun 03, 2014
    • Joseph E. Gonzalez's avatar
      Synthetic GraphX Benchmark · 894ecde0
      Joseph E. Gonzalez authored
      This PR accomplishes two things:
      
      1. It introduces a Synthetic Benchmark application that generates an arbitrarily large log-normal graph and executes either PageRank or connected components on the graph.  This can be used to profile GraphX system on arbitrary clusters without access to large graph datasets
      
      2. This PR improves the implementation of the log-normal graph generator.
      
      Author: Joseph E. Gonzalez <joseph.e.gonzalez@gmail.com>
      Author: Ankur Dave <ankurdave@gmail.com>
      
      Closes #720 from jegonzal/graphx_synth_benchmark and squashes the following commits:
      
      e40812a [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
      bccccad [Ankur Dave] Fix long lines
      374678a [Ankur Dave] Bugfix and style changes
      1bdf39a [Joseph E. Gonzalez] updating options
      d943972 [Joseph E. Gonzalez] moving the benchmark application into the examples folder.
      f4f839a [Joseph E. Gonzalez] Creating a synthetic benchmark script.
      894ecde0
    • baishuo(白硕)'s avatar
      fix java.lang.ClassCastException · aa41a522
      baishuo(白硕) authored
      get Exception when run:bin/run-example org.apache.spark.examples.sql.RDDRelation
      Exception's detail is:
      Exception in thread "main" java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer
      	at scala.runtime.BoxesRunTime.unboxToInt(BoxesRunTime.java:106)
      	at org.apache.spark.sql.catalyst.expressions.GenericRow.getInt(Row.scala:145)
      	at org.apache.spark.examples.sql.RDDRelation$.main(RDDRelation.scala:49)
      	at org.apache.spark.examples.sql.RDDRelation.main(RDDRelation.scala)
      change sql("SELECT COUNT(*) FROM records").collect().head.getInt(0) to sql("SELECT COUNT(*) FROM records").collect().head.getLong(0), then the Exception do not occur any more
      
      Author: baishuo(白硕) <vc_java@hotmail.com>
      
      Closes #949 from baishuo/master and squashes the following commits:
      
      f4b319f [baishuo(白硕)] fix java.lang.ClassCastException
      aa41a522
  14. May 25, 2014
  15. May 19, 2014
    • Xiangrui Meng's avatar
      [SPARK-1874][MLLIB] Clean up MLlib sample data · bcb9dce6
      Xiangrui Meng authored
      1. Added synthetic datasets for `MovieLensALS`, `LinearRegression`, `BinaryClassification`.
      2. Embedded instructions in the help message of those example apps.
      
      Per discussion with Matei on the JIRA page, new example data is under `data/mllib`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #833 from mengxr/mllib-sample-data and squashes the following commits:
      
      59f0a18 [Xiangrui Meng] add sample binary classification data
      3c2f92f [Xiangrui Meng] add linear regression data
      050f1ca [Xiangrui Meng] add a sample dataset for MovieLensALS example
      bcb9dce6
  16. May 17, 2014
    • Andrew Or's avatar
      [SPARK-1824] Remove <master> from Python examples · cf6cbe9f
      Andrew Or authored
      A recent PR (#552) fixed this for all Scala / Java examples. We need to do it for python too.
      
      Note that this blocks on #799, which makes `bin/pyspark` go through Spark submit. With only the changes in this PR, the only way to run these examples is through Spark submit. Once #799 goes in, you can use `bin/pyspark` to run them too. For example,
      
      ```
      bin/pyspark examples/src/main/python/pi.py 100 --master local-cluster[4,1,512]
      ```
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #802 from andrewor14/python-examples and squashes the following commits:
      
      cf50b9f [Andrew Or] De-indent python comments (minor)
      50f80b1 [Andrew Or] Remove pyFiles from SparkContext construction
      c362f69 [Andrew Or] Update docs to use spark-submit for python applications
      7072c6a [Andrew Or] Merge branch 'master' of github.com:apache/spark into python-examples
      427a5f0 [Andrew Or] Update docs
      d32072c [Andrew Or] Remove <master> from examples + update usages
      cf6cbe9f
  17. May 14, 2014
    • Tathagata Das's avatar
      Fixed streaming examples docs to use run-example instead of spark-submit · 68f28dab
      Tathagata Das authored
      Pretty self-explanatory
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #722 from tdas/example-fix and squashes the following commits:
      
      7839979 [Tathagata Das] Minor changes.
      0673441 [Tathagata Das] Fixed java docs of java streaming example
      e687123 [Tathagata Das] Fixed scala style errors.
      9b8d112 [Tathagata Das] Fixed streaming examples docs to use run-example instead of spark-submit.
      68f28dab
  18. May 10, 2014
    • Sean Owen's avatar
      SPARK-1789. Multiple versions of Netty dependencies cause FlumeStreamSuite failure · 2b7bd29e
      Sean Owen authored
      TL;DR is there is a bit of JAR hell trouble with Netty, that can be mostly resolved and will resolve a test failure.
      
      I hit the error described at http://apache-spark-user-list.1001560.n3.nabble.com/SparkContext-startup-time-out-td1753.html while running FlumeStreamingSuite, and have for a short while (is it just me?)
      
      velvia notes:
      "I have found a workaround.  If you add akka 2.2.4 to your dependencies, then everything works, probably because akka 2.2.4 brings in newer version of Jetty."
      
      There are at least 3 versions of Netty in play in the build:
      
      - the new Flume 1.4.0 dependency brings in io.netty:netty:3.4.0.Final, and that is the immediate problem
      - the custom version of akka 2.2.3 depends on io.netty:netty:3.6.6.
      - but, Spark Core directly uses io.netty:netty-all:4.0.17.Final
      
      The POMs try to exclude other versions of netty, but are excluding org.jboss.netty:netty, when in fact older versions of io.netty:netty (not netty-all) are also an issue.
      
      The org.jboss.netty:netty excludes are largely unnecessary. I replaced many of them with io.netty:netty exclusions until everything agreed on io.netty:netty-all:4.0.17.Final.
      
      But this didn't work, since Akka 2.2.3 doesn't work with Netty 4.x. Down-grading to 3.6.6.Final across the board made some Spark code not compile.
      
      If the build *keeps* io.netty:netty:3.6.6.Final as well, everything seems to work. Part of the reason seems to be that Netty 3.x used the old `org.jboss.netty` packages. This is less than ideal, but is no worse than the current situation.
      
      So this PR resolves the issue and improves the JAR hell, even if it leaves the existing theoretical Netty 3-vs-4 conflict:
      
      - Remove org.jboss.netty excludes where possible, for clarity; they're not needed except with Hadoop artifacts
      - Add io.netty:netty excludes where needed -- except, let akka keep its io.netty:netty
      - Change a bit of test code that actually depended on Netty 3.x, to use 4.x equivalent
      - Update SBT build accordingly
      
      A better change would be to update Akka far enough such that it agrees on Netty 4.x, but I don't know if that's feasible.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #723 from srowen/SPARK-1789 and squashes the following commits:
      
      43661b7 [Sean Owen] Update and add Netty excludes to prevent some JAR conflicts that cause test issues
      2b7bd29e
    • Matei Zaharia's avatar
      SPARK-1708. Add a ClassTag on Serializer and things that depend on it · 7eefc9d2
      Matei Zaharia authored
      This pull request contains a rebased patch from @heathermiller (https://github.com/heathermiller/spark/pull/1) to add ClassTags on Serializer and types that depend on it (Broadcast and AccumulableCollection). Putting these in the public API signatures now will allow us to use Scala Pickling for serialization down the line without breaking binary compatibility.
      
      One question remaining is whether we also want them on Accumulator -- Accumulator is passed as part of a bigger Task or TaskResult object via the closure serializer so it doesn't seem super useful to add the ClassTag there. Broadcast and AccumulableCollection in contrast were being serialized directly.
      
      CC @rxin, @pwendell, @heathermiller
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #700 from mateiz/spark-1708 and squashes the following commits:
      
      1a3d8b0 [Matei Zaharia] Use fake ClassTag in Java
      3b449ed [Matei Zaharia] test fix
      2209a27 [Matei Zaharia] Code style fixes
      9d48830 [Matei Zaharia] Add a ClassTag on Serializer and things that depend on it
      7eefc9d2
  19. May 08, 2014
    • Evan Sparks's avatar
      Fixing typo in als.py · 5c5e7d58
      Evan Sparks authored
      XtY should be Xty.
      
      Author: Evan Sparks <evan.sparks@gmail.com>
      
      Closes #696 from etrain/patch-2 and squashes the following commits:
      
      634cb8d [Evan Sparks] Fixing typo in als.py
      5c5e7d58
    • Prashant Sharma's avatar
      SPARK-1565, update examples to be used with spark-submit script. · 44dd57fb
      Prashant Sharma authored
      Commit for initial feedback, basically I am curious if we should prompt user for providing args esp. when its mandatory. And can we skip if they are not ?
      
      Also few other things that did not work like
      `bin/spark-submit examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.SparkALS --arg 100 500 10 5 2`
      
      Not all the args get passed properly, may be I have messed up something will try to sort it out hopefully.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #552 from ScrapCodes/SPARK-1565/update-examples and squashes the following commits:
      
      669dd23 [Prashant Sharma] Review comments
      2727e70 [Prashant Sharma] SPARK-1565, update examples to be used with spark-submit script.
      44dd57fb
  20. May 07, 2014
    • Evan Sparks's avatar
      Use numpy directly for matrix multiply. · 6ed7e2cd
      Evan Sparks authored
      Using matrix multiply to compute XtX and XtY yields a 5-20x speedup depending on problem size.
      
      For example - the following takes 19s locally after this change vs. 5m21s before the change. (16x speedup).
      bin/pyspark examples/src/main/python/als.py local[8] 1000 1000 50 10 10
      
      Author: Evan Sparks <evan.sparks@gmail.com>
      
      Closes #687 from etrain/patch-1 and squashes the following commits:
      
      e094dbc [Evan Sparks] Touching only diaganols on update.
      d1ab9b6 [Evan Sparks] Use numpy directly for matrix multiply.
      6ed7e2cd
    • Sandeep's avatar
      SPARK-1668: Add implicit preference as an option to examples/MovieLensALS · 108c4c16
      Sandeep authored
      Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #597 from techaddict/SPARK-1668 and squashes the following commits:
      
      8b371dc [Sandeep] Second Pass on reviews by mengxr
      eca9d37 [Sandeep] based on mengxr's suggestions
      937e54c [Sandeep] Changes
      5149d40 [Sandeep] Changes based on review
      1dd7657 [Sandeep] use mean()
      42444d7 [Sandeep] Based on Suggestions by mengxr
      e3082fa [Sandeep] SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/
      108c4c16
    • Manish Amde's avatar
      SPARK-1544 Add support for deep decision trees. · f269b016
      Manish Amde authored
      @etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels.
      
      To summarize:
      1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver).
      2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth.
      
      cc: @atalwalkar, @hirakendu, @mengxr
      
      Author: Manish Amde <manish9ue@gmail.com>
      Author: manishamde <manish9ue@gmail.com>
      Author: Evan Sparks <sparks@cs.berkeley.edu>
      
      Closes #475 from manishamde/deep_tree and squashes the following commits:
      
      968ca9d [Manish Amde] merged master
      7fc9545 [Manish Amde] added docs
      ce004a1 [Manish Amde] minor formatting
      b27ad2c [Manish Amde] formatting
      426bb28 [Manish Amde] programming guide blurb
      8053fed [Manish Amde] more formatting
      5eca9e4 [Manish Amde] grammar
      4731cda [Manish Amde] formatting
      5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
      cbd9f14 [Manish Amde] modified scala.math to math
      dad9652 [Manish Amde] removed unused imports
      e0426ee [Manish Amde] renamed parameter
      718506b [Manish Amde] added unit test
      1517155 [Manish Amde] updated documentation
      9dbdabe [Manish Amde] merge from master
      719d009 [Manish Amde] updating user documentation
      fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
      0287772 [Evan Sparks] Fixing scalastyle issue.
      2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
      2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
      abc5a23 [Evan Sparks] Parameterizing max memory.
      50b143a [Manish Amde] adding support for very deep trees
      f269b016
  21. May 06, 2014
    • Sandeep's avatar
      [HOTFIX] SPARK-1637: There are some Streaming examples added after the PR #571 was last updated. · fdae095d
      Sandeep authored
      This resulted in Compilation Errors.
      cc @mateiz project not compiling currently.
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #673 from techaddict/SPARK-1637-HOTFIX and squashes the following commits:
      
      b512f4f [Sandeep] [SPARK-1637][HOTFIX] There are some Streaming examples added after the PR #571 was last updated. This resulted in Compilation Errors.
      fdae095d
    • Sandeep's avatar
      SPARK-1637: Clean up examples for 1.0 · a000b5c3
      Sandeep authored
      - [x] Move all of them into subpackages of org.apache.spark.examples (right now some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
      - [x] Move Python examples into examples/src/main/python
      - [x] Update docs to reflect these changes
      
      Author: Sandeep <sandeep@techaddict.me>
      
      This patch had conflicts when merged, resolved by
      Committer: Matei Zaharia <matei@databricks.com>
      
      Closes #571 from techaddict/SPARK-1637 and squashes the following commits:
      
      47ef86c [Sandeep] Changes based on Discussions on PR, removing use of RawTextHelper from examples
      8ed2d3f [Sandeep] Docs Updated for changes, Change for java examples
      5f96121 [Sandeep] Move Python examples into examples/src/main/python
      0a8dd77 [Sandeep] Move all Scala Examples to org.apache.spark.examples (some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
      a000b5c3
  22. May 05, 2014
    • Xiangrui Meng's avatar
      [SPARK-1594][MLLIB] Cleaning up MLlib APIs and guide · 98750a74
      Xiangrui Meng authored
      Final pass before the v1.0 release.
      
      * Remove `VectorRDDs`
      * Move `BinaryClassificationMetrics` from `evaluation.binary` to `evaluation`
      * Change default value of `addIntercept` to false and allow to add intercept in Ridge and Lasso.
      * Clean `DecisionTree` package doc and test suite.
      * Mark model constructors `private[spark]`
      * Rename `loadLibSVMData` to `loadLibSVMFile` and hide `LabelParser` from users.
      * Add `saveAsLibSVMFile`.
      * Add `appendBias` to `MLUtils`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #524 from mengxr/mllib-cleaning and squashes the following commits:
      
      295dc8b [Xiangrui Meng] update loadLibSVMFile doc
      1977ac1 [Xiangrui Meng] fix doc of appendBias
      649fcf0 [Xiangrui Meng] rename loadLibSVMData to loadLibSVMFile; hide LabelParser from user APIs
      54b812c [Xiangrui Meng] add appendBias
      a71e7d0 [Xiangrui Meng] add saveAsLibSVMFile
      d976295 [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      b7e5cec [Xiangrui Meng] remove some experimental annotations and make model constructors private[mllib]
      9b02b93 [Xiangrui Meng] minor code style update
      a593ddc [Xiangrui Meng] fix python tests
      fc28c18 [Xiangrui Meng] mark more classes experimental
      f6cbbff [Xiangrui Meng] fix Java tests
      0af70b0 [Xiangrui Meng] minor
      6e139ef [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      94e6dce [Xiangrui Meng] move BinaryLabelCounter and BinaryConfusionMatrixImpl to evaluation.binary
      df34907 [Xiangrui Meng] clean DecisionTreeSuite to use LocalSparkContext
      c81807f [Xiangrui Meng] set the default value of AddIntercept to false
      03389c0 [Xiangrui Meng] allow to add intercept in Ridge and Lasso
      c66c56f [Xiangrui Meng] move tree md to package object doc
      a2695df [Xiangrui Meng] update guide for BinaryClassificationMetrics
      9194f4c [Xiangrui Meng] move BinaryClassificationMetrics one level up
      1c1a0e3 [Xiangrui Meng] remove VectorRDDs because it only contains one function that is not necessary for us to maintain
      98750a74
    • Tathagata Das's avatar
      [SPARK-1504], [SPARK-1505], [SPARK-1558] Updated Spark Streaming guide · a975a19f
      Tathagata Das authored
      - SPARK-1558: Updated custom receiver guide to match it with the new API
      - SPARK-1504: Added deployment and monitoring subsection to streaming
      - SPARK-1505: Added migration guide for migrating from 0.9.x and below to Spark 1.0
      - Updated various Java streaming examples to use JavaReceiverInputDStream to highlight the API change.
      - Removed the requirement for cleaner ttl from streaming guide
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #652 from tdas/doc-fix and squashes the following commits:
      
      cb4f4b7 [Tathagata Das] Possible fix for flaky graceful shutdown test.
      ab71f7f [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into doc-fix
      8d6ff9b [Tathagata Das] Addded migration guide to Spark Streaming.
      7d171df [Tathagata Das] Added reference to JavaReceiverInputStream in examples and streaming guide.
      49edd7c [Tathagata Das] Change java doc links to use Java docs.
      11528d7 [Tathagata Das] Updated links on index page.
      ff80970 [Tathagata Das] More updates to streaming guide.
      4dc42e9 [Tathagata Das] Added monitoring and other documentation in the streaming guide.
      14c6564 [Tathagata Das] Updated custom receiver guide.
      a975a19f
  23. Apr 30, 2014
    • WangTao's avatar
      Handle the vals that never used · 7025dda8
      WangTao authored
      In XORShiftRandom.scala, use val "million" instead of constant "1e6.toInt".
      Delete vals that never used in other files.
      
      Author: WangTao <barneystinson@aliyun.com>
      
      Closes #565 from WangTaoTheTonic/master and squashes the following commits:
      
      17cacfc [WangTao] Handle the unused assignment, method parameters and symbol inspected by Intellij IDEA
      37b4090 [WangTao] Handle the vals that never used
      7025dda8
  24. Apr 29, 2014
    • Xiangrui Meng's avatar
      [SPARK-1636][MLLIB] Move main methods to examples · 3f38334f
      Xiangrui Meng authored
      * `NaiveBayes` -> `SparseNaiveBayes`
      * `KMeans` -> `DenseKMeans`
      * `SVMWithSGD` and `LogisticRegerssionWithSGD` -> `BinaryClassification`
      * `ALS` -> `MovieLensALS`
      * `LinearRegressionWithSGD`, `LassoWithSGD`, and `RidgeRegressionWithSGD` -> `LinearRegression`
      * `DecisionTree` -> `DecisionTreeRunner`
      
      `scopt` is used for parsing command-line parameters. `scopt` has MIT license and it only depends on `scala-library`.
      
      Example help message:
      
      ~~~
      BinaryClassification: an example app for binary classification.
      Usage: BinaryClassification [options] <input>
      
        --numIterations <value>
              number of iterations
        --stepSize <value>
              initial step size, default: 1.0
        --algorithm <value>
              algorithm (SVM,LR), default: LR
        --regType <value>
              regularization type (L1,L2), default: L2
        --regParam <value>
              regularization parameter, default: 0.1
        <input>
              input paths to labeled examples in LIBSVM format
      ~~~
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #584 from mengxr/mllib-main and squashes the following commits:
      
      7b58c60 [Xiangrui Meng] minor
      6e35d7e [Xiangrui Meng] make imports explicit and fix code style
      c6178c9 [Xiangrui Meng] update TS PCA/SVD to use new spark-submit
      6acff75 [Xiangrui Meng] use scopt for DecisionTreeRunner
      be86069 [Xiangrui Meng] use main instead of extending App
      b3edf68 [Xiangrui Meng] move DecisionTree's main method to examples
      8bfaa5a [Xiangrui Meng] change NaiveBayesParams to Params
      fe23dcb [Xiangrui Meng] remove main from KMeans and add DenseKMeans as an example
      67f4448 [Xiangrui Meng] remove main methods from linear regression algorithms and add LinearRegression example
      b066bbc [Xiangrui Meng] remove main from ALS and add MovieLensALS example
      b040f3b [Xiangrui Meng] change BinaryClassificationParams to Params
      577945b [Xiangrui Meng] remove unused imports from NB
      3d299bc [Xiangrui Meng] remove main from LR/SVM and add an example app for binary classification
      f70878e [Xiangrui Meng] remove main from NaiveBayes and add an example NaiveBayes app
      01ec2cd [Xiangrui Meng] Merge branch 'master' into mllib-main
      9420692 [Xiangrui Meng] add scopt to examples dependencies
      3f38334f
    • witgo's avatar
      Improved build configuration · 030f2c21
      witgo authored
      1, Fix SPARK-1441: compile spark core error with hadoop 0.23.x
      2, Fix SPARK-1491: maven hadoop-provided profile fails to build
      3, Fix org.scala-lang: * ,org.apache.avro:* inconsistent versions dependency
      4, A modified on the sql/catalyst/pom.xml,sql/hive/pom.xml,sql/core/pom.xml (Four spaces formatted into two spaces)
      
      Author: witgo <witgo@qq.com>
      
      Closes #480 from witgo/format_pom and squashes the following commits:
      
      03f652f [witgo] review commit
      b452680 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      bee920d [witgo] revert fix SPARK-1629: Spark Core missing commons-lang dependence
      7382a07 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      6902c91 [witgo] fix SPARK-1629: Spark Core missing commons-lang dependence
      0da4bc3 [witgo] merge master
      d1718ed [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      e345919 [witgo] add avro dependency to yarn-alpha
      77fad08 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      62d0862 [witgo] Fix org.scala-lang: * inconsistent versions dependency
      1a162d7 [witgo] Merge branch 'master' of https://github.com/apache/spark into format_pom
      934f24d [witgo] review commit
      cf46edc [witgo] exclude jruby
      06e7328 [witgo] Merge branch 'SparkBuild' into format_pom
      99464d2 [witgo] fix maven hadoop-provided profile fails to build
      0c6c1fc [witgo] Fix compile spark core error with hadoop 0.23.x
      6851bec [witgo] Maintain consistent SparkBuild.scala, pom.xml
      030f2c21
  25. Apr 28, 2014
    • Tathagata Das's avatar
      [SPARK-1633][Streaming] Java API unit test and example for custom streaming receiver in Java · 1d84964b
      Tathagata Das authored
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #558 from tdas/more-fixes and squashes the following commits:
      
      c0c84e6 [Tathagata Das] Removing extra println()
      d8a8cf4 [Tathagata Das] More tweaks to make unit test work in Jenkins.
      b7caa98 [Tathagata Das] More tweaks.
      d337367 [Tathagata Das] More tweaks
      22d6f2d [Tathagata Das] Merge remote-tracking branch 'apache/master' into more-fixes
      40a961b [Tathagata Das] Modified java test to reduce flakiness.
      9410ca6 [Tathagata Das] Merge remote-tracking branch 'apache/master' into more-fixes
      86d9147 [Tathagata Das] scala style fix
      2f3d7b1 [Tathagata Das] Added Scala custom receiver example.
      d677611 [Tathagata Das] Merge remote-tracking branch 'apache/master' into more-fixes
      bec3fc2 [Tathagata Das] Added license.
      51d6514 [Tathagata Das] Fixed docs on receiver.
      81aafa0 [Tathagata Das] Added Java test for Receiver API, and added JavaCustomReceiver example.
      1d84964b
  26. Apr 25, 2014
    • baishuo(白硕)'s avatar
      Update KafkaWordCount.scala · 8aaef5c7
      baishuo(白硕) authored
      modify the required args number
      
      Author: baishuo(白硕) <vc_java@hotmail.com>
      
      Closes #523 from baishuo/master and squashes the following commits:
      
      0368ba9 [baishuo(白硕)] Update KafkaWordCount.scala
      8aaef5c7
  27. Apr 24, 2014
    • Mridul Muralidharan's avatar
      SPARK-1586 Windows build fixes · 968c0187
      Mridul Muralidharan authored
      Unfortunately, this is not exhaustive - particularly hive tests still fail due to path issues.
      
      Author: Mridul Muralidharan <mridulm80@apache.org>
      
      This patch had conflicts when merged, resolved by
      Committer: Matei Zaharia <matei@databricks.com>
      
      Closes #505 from mridulm/windows_fixes and squashes the following commits:
      
      ef12283 [Mridul Muralidharan] Move to org.apache.commons.lang3 for StringEscapeUtils. Earlier version was buggy appparently
      cdae406 [Mridul Muralidharan] Remove leaked changes from > 2G fix branch
      3267f4b [Mridul Muralidharan] Fix build failures
      35b277a [Mridul Muralidharan] Fix Scalastyle failures
      bc69d14 [Mridul Muralidharan] Change from hardcoded path separator
      10c4d78 [Mridul Muralidharan] Use explicit encoding while using getBytes
      1337abd [Mridul Muralidharan] fix classpath while running in windows
      968c0187
    • Sandeep's avatar
      Fix Scala Style · a03ac222
      Sandeep authored
      Any comments are welcome
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #531 from techaddict/stylefix-1 and squashes the following commits:
      
      7492730 [Sandeep] Pass 4
      98b2428 [Sandeep] fix rxin suggestions
      b5e2e6f [Sandeep] Pass 3
      05932d7 [Sandeep] fix if else styling 2
      08690e5 [Sandeep] fix if else styling
      a03ac222
Loading