Skip to content
Snippets Groups Projects
  1. Jul 25, 2014
    • Cheng Lian's avatar
      [SPARK-2410][SQL] Merging Hive Thrift/JDBC server · 06dc0d2c
      Cheng Lian authored
      JIRA issue:
      
      - Main: [SPARK-2410](https://issues.apache.org/jira/browse/SPARK-2410)
      - Related: [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678)
      
      Cherry picked the Hive Thrift/JDBC server from [branch-1.0-jdbc](https://github.com/apache/spark/tree/branch-1.0-jdbc).
      
      (Thanks chenghao-intel for his initial contribution of the Spark SQL CLI.)
      
      TODO
      
      - [x] Use `spark-submit` to launch the server, the CLI and beeline
      - [x] Migration guideline draft for Shark users
      
      ----
      
      Hit by a bug in `SparkSubmitArguments` while working on this PR: all application options that are recognized by `SparkSubmitArguments` are stolen as `SparkSubmit` options. For example:
      
      ```bash
      $ spark-submit --class org.apache.hive.beeline.BeeLine spark-internal --help
      ```
      
      This actually shows usage information of `SparkSubmit` rather than `BeeLine`.
      
      ~~Fixed this bug here since the `spark-internal` related stuff also touches `SparkSubmitArguments` and I'd like to avoid conflict.~~
      
      **UPDATE** The bug mentioned above is now tracked by [SPARK-2678](https://issues.apache.org/jira/browse/SPARK-2678). Decided to revert changes to this bug since it involves more subtle considerations and worth a separate PR.
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #1399 from liancheng/thriftserver and squashes the following commits:
      
      090beea [Cheng Lian] Revert changes related to SPARK-2678, decided to move them to another PR
      21c6cf4 [Cheng Lian] Updated Spark SQL programming guide docs
      fe0af31 [Cheng Lian] Reordered spark-submit options in spark-shell[.cmd]
      199e3fb [Cheng Lian] Disabled MIMA for hive-thriftserver
      1083e9d [Cheng Lian] Fixed failed test suites
      7db82a1 [Cheng Lian] Fixed spark-submit application options handling logic
      9cc0f06 [Cheng Lian] Starts beeline with spark-submit
      cfcf461 [Cheng Lian] Updated documents and build scripts for the newly added hive-thriftserver profile
      061880f [Cheng Lian] Addressed all comments by @pwendell
      7755062 [Cheng Lian] Adapts test suites to spark-submit settings
      40bafef [Cheng Lian] Fixed more license header issues
      e214aab [Cheng Lian] Added missing license headers
      b8905ba [Cheng Lian] Fixed minor issues in spark-sql and start-thriftserver.sh
      f975d22 [Cheng Lian] Updated docs for Hive compatibility and Shark migration guide draft
      3ad4e75 [Cheng Lian] Starts spark-sql shell with spark-submit
      a5310d1 [Cheng Lian] Make HiveThriftServer2 play well with spark-submit
      61f39f4 [Cheng Lian] Starts Hive Thrift server via spark-submit
      2c4c539 [Cheng Lian] Cherry picked the Hive Thrift server
      06dc0d2c
  2. Jul 10, 2014
    • Prashant Sharma's avatar
      [SPARK-1776] Have Spark's SBT build read dependencies from Maven. · 628932b8
      Prashant Sharma authored
      Patch introduces the new way of working also retaining the existing ways of doing things.
      
      For example build instruction for yarn in maven is
      `mvn -Pyarn -PHadoop2.2 clean package -DskipTests`
      in sbt it can become
      `MAVEN_PROFILES="yarn, hadoop-2.2" sbt/sbt clean assembly`
      Also supports
      `sbt/sbt -Pyarn -Phadoop-2.2 -Dhadoop.version=2.2.0 clean assembly`
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #772 from ScrapCodes/sbt-maven and squashes the following commits:
      
      a8ac951 [Prashant Sharma] Updated sbt version.
      62b09bb [Prashant Sharma] Improvements.
      fa6221d [Prashant Sharma] Excluding sql from mima
      4b8875e [Prashant Sharma] Sbt assembly no longer builds tools by default.
      72651ca [Prashant Sharma] Addresses code reivew comments.
      acab73d [Prashant Sharma] Revert "Small fix to run-examples script."
      ac4312c [Prashant Sharma] Revert "minor fix"
      6af91ac [Prashant Sharma] Ported oldDeps back. + fixes issues with prev commit.
      65cf06c [Prashant Sharma] Servelet API jars mess up with the other servlet jars on the class path.
      446768e [Prashant Sharma] minor fix
      89b9777 [Prashant Sharma] Merge conflicts
      d0a02f2 [Prashant Sharma] Bumped up pom versions, Since the build now depends on pom it is better updated there. + general cleanups.
      dccc8ac [Prashant Sharma] updated mima to check against 1.0
      a49c61b [Prashant Sharma] Fix for tools jar
      a2f5ae1 [Prashant Sharma] Fixes a bug in dependencies.
      cf88758 [Prashant Sharma] cleanup
      9439ea3 [Prashant Sharma] Small fix to run-examples script.
      96cea1f [Prashant Sharma] SPARK-1776 Have Spark's SBT build read dependencies from Maven.
      36efa62 [Patrick Wendell] Set project name in pom files and added eclipse/intellij plugins.
      4973dbd [Patrick Wendell] Example build using pom reader.
      628932b8
  3. Jul 03, 2014
    • Prashant Sharma's avatar
      [SPARK-2109] Setting SPARK_MEM for bin/pyspark does not work. · 731f683b
      Prashant Sharma authored
      Trivial fix.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #1050 from ScrapCodes/SPARK-2109/pyspark-script-bug and squashes the following commits:
      
      77072b9 [Prashant Sharma] Changed echos to redirect to STDERR.
      13f48a0 [Prashant Sharma] [SPARK-2109] Setting SPARK_MEM for bin/pyspark does not work.
      731f683b
  4. Jun 23, 2014
  5. Jun 12, 2014
    • Patrick Wendell's avatar
      SPARK-1843: Replace assemble-deps with env variable. · 1c04652c
      Patrick Wendell authored
      (This change is actually small, I moved some logic into
      compute-classpath that was previously in spark-class).
      
      Assemble deps has existed for a while to allow developers to
      run local code with new changes quickly. When I'm developing I
      typically use a simpler approach which just prepends the Spark
      classes to the classpath before the assembly jar. This is well
      defined in the JVM and the Spark classes take precedence over those
      in the assembly.
      
      This approach is portable across both builds which is the main reason I'd
      like to switch to it. It's also a bit easier to toggle on and off quickly.
      
      The way you use this is the following:
      ```
      $ ./bin/spark-shell # Use spark with the normal assembly
      $ export SPARK_PREPEND_CLASSES=true
      $ ./bin/spark-shell # Now it's using compiled classes
      $ unset SPARK_PREPEND_CLASSES
      $ ./bin/spark-shell # Back to normal
      ```
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #877 from pwendell/assemble-deps and squashes the following commits:
      
      8a11345 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into assemble-deps
      faa3168 [Patrick Wendell] Adding a warning for compatibility
      3f151a7 [Patrick Wendell] Small fix
      bbfb73c [Patrick Wendell] Review feedback
      328e9f8 [Patrick Wendell] SPARK-1843: Replace assemble-deps with env variable.
      1c04652c
  6. Jun 11, 2014
    • Andrew Or's avatar
      HOTFIX: A few PySpark tests were not actually run · fe78b8b6
      Andrew Or authored
      This is a hot fix for the hot fix in fb499be1. The changes in that commit did not actually cause the `doctest` module in python to be loaded for the following tests:
      - pyspark/broadcast.py
      - pyspark/accumulators.py
      - pyspark/serializers.py
      
      (@pwendell I might have told you the wrong thing)
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1053 from andrewor14/python-test-fix and squashes the following commits:
      
      d2e5401 [Andrew Or] Explain why these tests are handled differently
      0bd6fdd [Andrew Or] Fix 3 pyspark tests not being invoked
      fe78b8b6
  7. Jun 10, 2014
    • Patrick Wendell's avatar
      HOTFIX: Fix Python tests on Jenkins. · fb499be1
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #1036 from pwendell/jenkins-test and squashes the following commits:
      
      9c99856 [Patrick Wendell] Better output during tests
      71e7b74 [Patrick Wendell] Removing incorrect python path
      74984db [Patrick Wendell] HOTFIX: Allow PySpark tests to run on Jenkins.
      fb499be1
  8. Jun 08, 2014
    • maji2014's avatar
      Update run-example · e9261d08
      maji2014 authored
      Old code can only be ran under spark_home and use "bin/run-example".
       Error "./run-example: line 55: ./bin/spark-submit: No such file or directory" appears when running in other place. So change this
      
      Author: maji2014 <maji3@asiainfo-linkage.com>
      
      Closes #1011 from maji2014/master and squashes the following commits:
      
      2cc1af6 [maji2014] Update run-example
      
      Closes #988.
      e9261d08
  9. May 25, 2014
    • Colin Patrick Mccabe's avatar
      spark-submit: add exec at the end of the script · 6e9fb632
      Colin Patrick Mccabe authored
      Add an 'exec' at the end of the spark-submit script, to avoid keeping a
      bash process hanging around while it runs.  This makes ps look a little
      bit nicer.
      
      Author: Colin Patrick Mccabe <cmccabe@cloudera.com>
      
      Closes #858 from cmccabe/SPARK-1907 and squashes the following commits:
      
      7023b64 [Colin Patrick Mccabe] spark-submit: add exec at the end of the script
      6e9fb632
  10. May 21, 2014
    • Sumedh Mungee's avatar
      [SPARK-1250] Fixed misleading comments in bin/pyspark, bin/spark-class · 6e337380
      Sumedh Mungee authored
      Fixed a couple of misleading comments in bin/pyspark and bin/spark-class. The comments make it seem like the script is looking for the Scala installation when in fact it is looking for Spark.
      
      Author: Sumedh Mungee <smungee@gmail.com>
      
      Closes #843 from smungee/spark-1250-fix-comments and squashes the following commits:
      
      26870f3 [Sumedh Mungee] [SPARK-1250] Fixed misleading comments in bin/pyspark and bin/spark-class
      6e337380
  11. May 19, 2014
    • Matei Zaharia's avatar
      SPARK-1879. Increase MaxPermSize since some of our builds have many classes · 5af99d76
      Matei Zaharia authored
      See https://issues.apache.org/jira/browse/SPARK-1879 -- builds with Hadoop2 and Hive ran out of PermGen space in spark-shell, when those things added up with the Scala compiler.
      
      Note that users can still override it by setting their own Java options with this change. Their options will come later in the command string than the -XX:MaxPermSize=128m.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #823 from mateiz/spark-1879 and squashes the following commits:
      
      6bc0ee8 [Matei Zaharia] Increase MaxPermSize to 128m since some of our builds have lots of classes
      5af99d76
    • Matei Zaharia's avatar
      [SPARK-1876] Windows fixes to deal with latest distribution layout changes · 7b70a707
      Matei Zaharia authored
      - Look for JARs in the right place
      - Launch examples the same way as on Unix
      - Load datanucleus JARs if they exist
      - Don't attempt to parse local paths as URIs in SparkSubmit, since paths with C:\ are not valid URIs
      - Also fixed POM exclusion rules for datanucleus (it wasn't properly excluding it, whereas SBT was)
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #819 from mateiz/win-fixes and squashes the following commits:
      
      d558f96 [Matei Zaharia] Fix comment
      228577b [Matei Zaharia] Review comments
      d3b71c7 [Matei Zaharia] Properly exclude datanucleus files in Maven assembly
      144af84 [Matei Zaharia] Update Windows scripts to match latest binary package layout
      7b70a707
  12. May 18, 2014
    • Neville Li's avatar
      Fix spark-submit path in spark-shell & pyspark · ebcd2d68
      Neville Li authored
      Author: Neville Li <neville@spotify.com>
      
      Closes #812 from nevillelyh/neville/v1.0 and squashes the following commits:
      
      0dc33ed [Neville Li] Fix spark-submit path in pyspark
      becec64 [Neville Li] Fix spark-submit path in spark-shell
      ebcd2d68
  13. May 17, 2014
    • Andrew Or's avatar
      [SPARK-1808] Route bin/pyspark through Spark submit · 4b8ec6fc
      Andrew Or authored
      **Problem.** For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`.
      
      **Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.
      
      **Details.** `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest.
      
      For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case.
      
      This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #799 from andrewor14/pyspark-submit and squashes the following commits:
      
      bf37e36 [Andrew Or] Minor changes
      01066fa [Andrew Or] bin/pyspark for Windows
      c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes)
      1866f85 [Andrew Or] Windows is not cooperating
      456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set
      7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
      b7ba0d8 [Andrew Or] Address a few comments (minor)
      06eb138 [Andrew Or] Use shlex instead of writing our own parser
      05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
      a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly
      6fba412 [Andrew Or] Deal with quotes + address various comments
      fe4c8a7 [Andrew Or] Update --help for bin/pyspark
      afe47bf [Andrew Or] Fix spark shell
      f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
      a371d26 [Andrew Or] Route bin/pyspark through Spark submit
      4b8ec6fc
  14. May 12, 2014
    • Andrew Or's avatar
      [SPARK-1736] Spark submit for Windows · beb9cbac
      Andrew Or authored
      Tested on Windows 7.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #745 from andrewor14/windows-submit and squashes the following commits:
      
      c0b58fb [Andrew Or] Allow spaces in parameters
      162e54d [Andrew Or] Merge branch 'master' of github.com:apache/spark into windows-submit
      91597ce [Andrew Or] Make spark-shell.cmd use spark-submit.cmd
      af6fd29 [Andrew Or] Add spark submit for Windows
      beb9cbac
  15. May 11, 2014
    • Patrick Wendell's avatar
      SPARK-1652: Set driver memory correctly in spark-submit. · 05c9aa9e
      Patrick Wendell authored
      The previous check didn't account for the fact that the default
      deploy mode is "client" unless otherwise specified. Also, this
      sets the more narrowly defined SPARK_DRIVER_MEMORY instead of setting
      SPARK_MEM.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #730 from pwendell/spark-submit and squashes the following commits:
      
      430b98f [Patrick Wendell] Feedback from Aaron
      e788edf [Patrick Wendell] Changes based on Aaron's feedback
      f508146 [Patrick Wendell] SPARK-1652: Set driver memory correctly in spark-submit.
      05c9aa9e
  16. May 09, 2014
    • Patrick Wendell's avatar
      SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`. · 06b15baa
      Patrick Wendell authored
      Gives a nicely formatted message to the user when `run-example` is run to
      tell them to use `spark-submit`.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #704 from pwendell/examples and squashes the following commits:
      
      1996ee8 [Patrick Wendell] Feedback form Andrew
      3eb7803 [Patrick Wendell] Suggestions from TD
      2474668 [Patrick Wendell] SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`.
      06b15baa
  17. May 05, 2014
    • Andrew Or's avatar
      [SPARK-1681] Include datanucleus jars in Spark Hive distribution · cf0a8f02
      Andrew Or authored
      This copies the datanucleus jars over from `lib_managed` into `dist/lib`, if any. The `CLASSPATH` must also be updated to reflect this change.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #610 from andrewor14/hive-distribution and squashes the following commits:
      
      a4bc96f [Andrew Or] Rename search path in jar error check
      fa205e1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into hive-distribution
      7855f58 [Andrew Or] Have jar command respect JAVA_HOME + check for jar errors both cases
      c16bbfd [Andrew Or] Merge branch 'master' of github.com:apache/spark into hive-distribution
      32f6826 [Andrew Or] Leave the double colons
      940a1bb [Andrew Or] Add back 2>/dev/null
      58357cc [Andrew Or] Include datanucleus jars in Spark distribution built with Hive support
      cf0a8f02
  18. May 04, 2014
    • Patrick Wendell's avatar
      SPARK-1703 Warn users if Spark is run on JRE6 but compiled with JDK7. · 0c98a8f6
      Patrick Wendell authored
      This add some guards and good warning messages if users hit this issue. /cc @aarondav with whom I discussed parts of the design.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #627 from pwendell/jdk6 and squashes the following commits:
      
      a38a958 [Patrick Wendell] Code review feedback
      94e9f84 [Patrick Wendell] SPARK-1703 Warn users if Spark is run on JRE6 but compiled with JDK7.
      0c98a8f6
    • witgo's avatar
      The default version of yarn is equal to the hadoop version · fb054322
      witgo authored
      This is a part of [PR 590](https://github.com/apache/spark/pull/590)
      
      Author: witgo <witgo@qq.com>
      
      Closes #626 from witgo/yarn_version and squashes the following commits:
      
      c390631 [witgo] restore  the yarn dependency declarations
      f8a4ad8 [witgo] revert remove the dependency of avro in yarn-alpha
      2df6cf5 [witgo] review commit
      a1d876a [witgo] review commit
      20e7e3e [witgo] review commit
      c76763b [witgo] The default value of yarn.version is equal to hadoop.version
      fb054322
  19. May 01, 2014
    • Patrick Wendell's avatar
      SPARK-1691: Support quoted arguments inside of spark-submit. · 98b65593
      Patrick Wendell authored
      This is a fairly straightforward fix. The bug was reported by @vanzin and the fix was proposed by @deanwampler and myself. Please take a look!
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #609 from pwendell/quotes and squashes the following commits:
      
      8bed767 [Patrick Wendell] SPARK-1691: Support quoted arguments inside of spark-submit.
      98b65593
  20. Apr 30, 2014
    • Sandy Ryza's avatar
      SPARK-1004. PySpark on YARN · ff5be9a4
      Sandy Ryza authored
      This reopens https://github.com/apache/incubator-spark/pull/640 against the new repo
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #30 from sryza/sandy-spark-1004 and squashes the following commits:
      
      89889d4 [Sandy Ryza] Move unzipping py4j to the generate-resources phase so that it gets included in the jar the first time
      5165a02 [Sandy Ryza] Fix docs
      fd0df79 [Sandy Ryza] PySpark on YARN
      ff5be9a4
  21. Apr 28, 2014
    • Patrick Wendell's avatar
      SPARK-1654 and SPARK-1653: Fixes in spark-submit. · 949e3931
      Patrick Wendell authored
      Deals with two issues:
      1. Spark shell didn't correctly pass quoted arguments to spark-submit.
      ```./bin/spark-shell --driver-java-options "-Dfoo=f -Dbar=b"```
      2. Spark submit used deprecated environment variables (SPARK_CLASSPATH)
         which triggered warnings. Now we use new, more narrowly scoped,
         variables.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #576 from pwendell/spark-submit and squashes the following commits:
      
      67004c9 [Patrick Wendell] SPARK-1654 and SPARK-1653: Fixes in spark-submit.
      949e3931
  22. Apr 25, 2014
    • Patrick Wendell's avatar
      SPARK-1619 Launch spark-shell with spark-submit · dc3b640a
      Patrick Wendell authored
      This simplifies the shell a bunch and passes all arguments through to spark-submit.
      
      There is a tiny incompatibility from 0.9.1 which is that you can't put `-c` _or_ `--cores`, only `--cores`. However, spark-submit will give a good error message in this case, I don't think many people used this, and it's a trivial change for users.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #542 from pwendell/spark-shell and squashes the following commits:
      
      9eb3e6f [Patrick Wendell] Updating Spark docs
      b552459 [Patrick Wendell] Andrew's feedback
      97720fa [Patrick Wendell] Review feedback
      aa2900b [Patrick Wendell] SPARK-1619 Launch spark-shell with spark-submit
      dc3b640a
  23. Apr 24, 2014
    • Mridul Muralidharan's avatar
      SPARK-1586 Windows build fixes · 968c0187
      Mridul Muralidharan authored
      Unfortunately, this is not exhaustive - particularly hive tests still fail due to path issues.
      
      Author: Mridul Muralidharan <mridulm80@apache.org>
      
      This patch had conflicts when merged, resolved by
      Committer: Matei Zaharia <matei@databricks.com>
      
      Closes #505 from mridulm/windows_fixes and squashes the following commits:
      
      ef12283 [Mridul Muralidharan] Move to org.apache.commons.lang3 for StringEscapeUtils. Earlier version was buggy appparently
      cdae406 [Mridul Muralidharan] Remove leaked changes from > 2G fix branch
      3267f4b [Mridul Muralidharan] Fix build failures
      35b277a [Mridul Muralidharan] Fix Scalastyle failures
      bc69d14 [Mridul Muralidharan] Change from hardcoded path separator
      10c4d78 [Mridul Muralidharan] Use explicit encoding while using getBytes
      1337abd [Mridul Muralidharan] fix classpath while running in windows
      968c0187
  24. Apr 23, 2014
    • Patrick Wendell's avatar
      SPARK-1119 and other build improvements · cd4ed293
      Patrick Wendell authored
      1. Makes assembly and examples jar naming consistent in maven/sbt.
      2. Updates make-distribution.sh to use Maven and fixes some bugs.
      3. Updates the create-release script to call make-distribution script.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #502 from pwendell/make-distribution and squashes the following commits:
      
      1a97f0d [Patrick Wendell] SPARK-1119 and other build improvements
      cd4ed293
  25. Apr 21, 2014
    • Patrick Wendell's avatar
      Clean up and simplify Spark configuration · fb98488f
      Patrick Wendell authored
      Over time as we've added more deployment modes, this have gotten a bit unwieldy with user-facing configuration options in Spark. Going forward we'll advise all users to run `spark-submit` to launch applications. This is a WIP patch but it makes the following improvements:
      
      1. Improved `spark-env.sh.template` which was missing a lot of things users now set in that file.
      2. Removes the shipping of SPARK_CLASSPATH, SPARK_JAVA_OPTS, and SPARK_LIBRARY_PATH to the executors on the cluster. This was an ugly hack. Instead it introduces config variables spark.executor.extraJavaOpts, spark.executor.extraLibraryPath, and spark.executor.extraClassPath.
      3. Adds ability to set these same variables for the driver using `spark-submit`.
      4. Allows you to load system properties from a `spark-defaults.conf` file when running `spark-submit`. This will allow setting both SparkConf options and other system properties utilized by `spark-submit`.
      5. Made `SPARK_LOCAL_IP` an environment variable rather than a SparkConf property. This is more consistent with it being set on each node.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #299 from pwendell/config-cleanup and squashes the following commits:
      
      127f301 [Patrick Wendell] Improvements to testing
      a006464 [Patrick Wendell] Moving properties file template.
      b4b496c [Patrick Wendell] spark-defaults.properties -> spark-defaults.conf
      0086939 [Patrick Wendell] Minor style fixes
      af09e3e [Patrick Wendell] Mention config file in docs and clean-up docs
      b16e6a2 [Patrick Wendell] Cleanup of spark-submit script and Scala quick start guide
      af0adf7 [Patrick Wendell] Automatically add user jar
      a56b125 [Patrick Wendell] Responses to Tom's review
      d50c388 [Patrick Wendell] Merge remote-tracking branch 'apache/master' into config-cleanup
      a762901 [Patrick Wendell] Fixing test failures
      ffa00fe [Patrick Wendell] Review feedback
      fda0301 [Patrick Wendell] Note
      308f1f6 [Patrick Wendell] Properly escape quotes and other clean-up for YARN
      e83cd8f [Patrick Wendell] Changes to allow re-use of test applications
      be42f35 [Patrick Wendell] Handle case where SPARK_HOME is not set
      c2a2909 [Patrick Wendell] Test compile fixes
      4ee6f9d [Patrick Wendell] Making YARN doc changes consistent
      afc9ed8 [Patrick Wendell] Cleaning up line limits and two compile errors.
      b08893b [Patrick Wendell] Additional improvements.
      ace4ead [Patrick Wendell] Responses to review feedback.
      b72d183 [Patrick Wendell] Review feedback for spark env file
      46555c1 [Patrick Wendell] Review feedback and import clean-ups
      437aed1 [Patrick Wendell] Small fix
      761ebcd [Patrick Wendell] Library path and classpath for drivers
      7cc70e4 [Patrick Wendell] Clean up terminology inside of spark-env script
      5b0ba8e [Patrick Wendell] Don't ship executor envs
      84cc5e5 [Patrick Wendell] Small clean-up
      1f75238 [Patrick Wendell] SPARK_JAVA_OPTS --> SPARK_MASTER_OPTS for master settings
      4982331 [Patrick Wendell] Remove SPARK_LIBRARY_PATH
      6eaf7d0 [Patrick Wendell] executorJavaOpts
      0faa3b6 [Patrick Wendell] Stash of adding config options in submit script and YARN
      ac2d65e [Patrick Wendell] Change spark.local.dir -> SPARK_LOCAL_DIRS
      fb98488f
  26. Apr 10, 2014
    • Andrew Or's avatar
      [SPARK-1276] Add a HistoryServer to render persisted UI · 79820fe8
      Andrew Or authored
      The new feature of event logging, introduced in #42, allows the user to persist the details of his/her Spark application to storage, and later replay these events to reconstruct an after-the-fact SparkUI.
      Currently, however, a persisted UI can only be rendered through the standalone Master. This greatly limits the use case of this new feature as many people also run Spark on Yarn / Mesos.
      
      This PR introduces a new entity called the HistoryServer, which, given a log directory, keeps track of all completed applications independently of a Spark Master. Unlike Master, the HistoryServer needs not be running while the application is still running. It is relatively light-weight in that it only maintains static information of applications and performs no scheduling.
      
      To quickly test it out, generate event logs with ```spark.eventLog.enabled=true``` and run ```sbin/start-history-server.sh <log-dir-path>```. Your HistoryServer awaits on port 18080.
      
      Comments and feedback are most welcome.
      
      ---
      
      A few other changes introduced in this PR include refactoring the WebUI interface, which is beginning to have a lot of duplicate code now that we have added more functionality to it. Two new SparkListenerEvents have been introduced (SparkListenerApplicationStart/End) to keep track of application name and start/finish times. This PR also clarifies the semantics of the ReplayListenerBus introduced in #42.
      
      A potential TODO in the future (not part of this PR) is to render live applications in addition to just completed applications. This is useful when applications fail, a condition that our current HistoryServer does not handle unless the user manually signals application completion (by creating the APPLICATION_COMPLETION file). Handling live applications becomes significantly more challenging, however, because it is now necessary to render the same SparkUI multiple times. To avoid reading the entire log every time, which is inefficient, we must handle reading the log from where we previously left off, but this becomes fairly complicated because we must deal with the arbitrary behavior of each input stream.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #204 from andrewor14/master and squashes the following commits:
      
      7b7234c [Andrew Or] Finished -> Completed
      b158d98 [Andrew Or] Address Patrick's comments
      69d1b41 [Andrew Or] Do not block on posting SparkListenerApplicationEnd
      19d5dd0 [Andrew Or] Merge github.com:apache/spark
      f7f5bf0 [Andrew Or] Make history server's web UI port a Spark configuration
      2dfb494 [Andrew Or] Decouple checking for application completion from replaying
      d02dbaa [Andrew Or] Expose Spark version and include it in event logs
      2282300 [Andrew Or] Add documentation for the HistoryServer
      567474a [Andrew Or] Merge github.com:apache/spark
      6edf052 [Andrew Or] Merge github.com:apache/spark
      19e1fb4 [Andrew Or] Address Thomas' comments
      248cb3d [Andrew Or] Limit number of live applications + add configurability
      a3598de [Andrew Or] Do not close file system with ReplayBus + fix bind address
      bc46fc8 [Andrew Or] Merge github.com:apache/spark
      e2f4ff9 [Andrew Or] Merge github.com:apache/spark
      050419e [Andrew Or] Merge github.com:apache/spark
      81b568b [Andrew Or] Fix strange error messages...
      0670743 [Andrew Or] Decouple page rendering from loading files from disk
      1b2f391 [Andrew Or] Minor changes
      a9eae7e [Andrew Or] Merge branch 'master' of github.com:apache/spark
      d5154da [Andrew Or] Styling and comments
      5dbfbb4 [Andrew Or] Merge branch 'master' of github.com:apache/spark
      60bc6d5 [Andrew Or] First complete implementation of HistoryServer (only for finished apps)
      7584418 [Andrew Or] Report application start/end times to HistoryServer
      8aac163 [Andrew Or] Add basic application table
      c086bd5 [Andrew Or] Add HistoryServer and scripts ++ Refactor WebUI interface
      79820fe8
  27. Apr 08, 2014
  28. Apr 07, 2014
    • Aaron Davidson's avatar
      SPARK-1099: Introduce local[*] mode to infer number of cores · 0307db0f
      Aaron Davidson authored
      This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #182 from aarondav/110 and squashes the following commits:
      
      a88294c [Aaron Davidson] Rebased changes for new spark-shell
      a9f393e [Aaron Davidson] SPARK-1099: Introduce local[*] mode to infer number of cores
      0307db0f
  29. Apr 06, 2014
    • Aaron Davidson's avatar
      SPARK-1314: Use SPARK_HIVE to determine if we include Hive in packaging · 41065584
      Aaron Davidson authored
      Previously, we based our decision regarding including datanucleus jars based on the existence of a spark-hive-assembly jar, which was incidentally built whenever "sbt assembly" is run. This means that a typical and previously supported pathway would start using hive jars.
      
      This patch has the following features/bug fixes:
      
      - Use of SPARK_HIVE (default false) to determine if we should include Hive in the assembly jar.
      - Analagous feature in Maven with -Phive (previously, there was no support for adding Hive to any of our jars produced by Maven)
      - assemble-deps fixed since we no longer use a different ASSEMBLY_DIR
      - avoid adding log message in compute-classpath.sh to the classpath :)
      
      Still TODO before mergeable:
      - We need to download the datanucleus jars outside of sbt. Perhaps we can have spark-class download them if SPARK_HIVE is set similar to how sbt downloads itself.
      - Spark SQL documentation updates.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #237 from aarondav/master and squashes the following commits:
      
      5dc4329 [Aaron Davidson] Typo fixes
      dd4f298 [Aaron Davidson] Doc update
      dd1a365 [Aaron Davidson] Eliminate need for SPARK_HIVE at runtime by d/ling datanucleus from Maven
      a9269b5 [Aaron Davidson] [WIP] Use SPARK_HIVE to determine if we include Hive in packaging
      41065584
  30. Apr 04, 2014
    • Aaron Davidson's avatar
      SPARK-1404: Always upgrade spark-env.sh vars to environment vars · 01cf4c40
      Aaron Davidson authored
      This was broken when spark-env.sh was made idempotent, as the idempotence check is an environment variable, but the spark-env.sh variables may not have been.
      
      Tested in zsh, bash, and sh.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #310 from aarondav/SPARK-1404 and squashes the following commits:
      
      c3406a5 [Aaron Davidson] Add extra export in spark-shell
      6a0e340 [Aaron Davidson] SPARK-1404: Always upgrade spark-env.sh vars to environment vars
      01cf4c40
  31. Apr 03, 2014
    • Diana Carroll's avatar
      [SPARK-1134] Fix and document passing of arguments to IPython · a599e43d
      Diana Carroll authored
      This is based on @dianacarroll's previous pull request https://github.com/apache/spark/pull/227, and @joshrosen's comments on https://github.com/apache/spark/pull/38. Since we do want to allow passing arguments to IPython, this does the following:
      * It documents that IPython can't be used with standalone jobs for now. (Later versions of IPython will deal with PYTHONSTARTUP properly and enable this, see https://github.com/ipython/ipython/pull/5226, but no released version has that fix.)
      * If you run `pyspark` with `IPYTHON=1`, it passes your command-line arguments to it. This way you can do stuff like `IPYTHON=1 bin/pyspark notebook`.
      * The old `IPYTHON_OPTS` remains, but I've removed it from the documentation. This is in case people read an old tutorial that uses it.
      
      This is not a perfect solution and I'd also be okay with keeping things as they are today (ignoring `$@` for IPython and using IPYTHON_OPTS), and only doing the doc change. With this change though, when IPython fixes https://github.com/ipython/ipython/pull/5226, people will immediately be able to do `IPYTHON=1 bin/pyspark myscript.py` to run a standalone script and get all the benefits of running scripts in IPython (presumably better debugging and such). Without it, there will be no way to run scripts in IPython.
      
      @joshrosen you should probably take the final call on this.
      
      Author: Diana Carroll <dcarroll@cloudera.com>
      
      Closes #294 from mateiz/spark-1134 and squashes the following commits:
      
      747bb13 [Diana Carroll] SPARK-1134 bug with ipython prevents non-interactive use with spark; only call ipython if no command line arguments were supplied
      a599e43d
  32. Apr 01, 2014
  33. Mar 29, 2014
    • Bernardo Gomez Palacio's avatar
      [SPARK-1186] : Enrich the Spark Shell to support additional arguments. · fda86d8b
      Bernardo Gomez Palacio authored
      Enrich the Spark Shell functionality to support the following options.
      
      ```
      Usage: spark-shell [OPTIONS]
      
      OPTIONS:
          -h  --help              : Print this help information.
          -c  --cores             : The maximum number of cores to be used by the Spark Shell.
          -em --executor-memory   : The memory used by each executor of the Spark Shell, the number
                                    is followed by m for megabytes or g for gigabytes, e.g. "1g".
          -dm --driver-memory     : The memory used by the Spark Shell, the number is followed
                                    by m for megabytes or g for gigabytes, e.g. "1g".
          -m  --master            : A full string that describes the Spark Master, defaults to "local"
                                    e.g. "spark://localhost:7077".
          --log-conf              : Enables logging of the supplied SparkConf as INFO at start of the
                                    Spark Context.
      
      e.g.
          spark-shell -m spark://localhost:7077 -c 4 -dm 512m -em 2g
      ```
      
      **Note**: this commit reflects the changes applied to _master_ based on [5d98cfc1].
      
      [ticket: SPARK-1186] : Enrich the Spark Shell to support additional arguments.
                              https://spark-project.atlassian.net/browse/SPARK-1186
      
      Author      : bernardo.gomezpalcio@gmail.com
      
      Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>
      
      Closes #116 from berngp/feature/enrich-spark-shell and squashes the following commits:
      
      c5f455f [Bernardo Gomez Palacio] [SPARK-1186] : Enrich the Spark Shell to support additional arguments.
      fda86d8b
    • Sandy Ryza's avatar
      SPARK-1126. spark-app preliminary · 16178160
      Sandy Ryza authored
      This is a starting version of the spark-app script for running compiled binaries against Spark.  It still needs tests and some polish.  The only testing I've done so far has been using it to launch jobs in yarn-standalone mode against a pseudo-distributed cluster.
      
      This leaves out the changes required for launching python scripts.  I think it might be best to save those for another JIRA/PR (while keeping to the design so that they won't require backwards-incompatible changes).
      
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #86 from sryza/sandy-spark-1126 and squashes the following commits:
      
      d428d85 [Sandy Ryza] Commenting, doc, and import fixes from Patrick's comments
      e7315c6 [Sandy Ryza] Fix failing tests
      34de899 [Sandy Ryza] Change --more-jars to --jars and fix docs
      299ddca [Sandy Ryza] Fix scalastyle
      a94c627 [Sandy Ryza] Add newline at end of SparkSubmit
      04bc4e2 [Sandy Ryza] SPARK-1126. spark-submit script
      16178160
  34. Mar 27, 2014
    • Thomas Graves's avatar
      SPARK-1330 removed extra echo from comput_classpath.sh · 426042ad
      Thomas Graves authored
      remove the extra echo which prevents spark-class from working.  Note that I did not update the comment above it, which is also wrong because I'm not sure what it should do.
      
      Should hive only be included if explicitly built with sbt hive/assembly or should sbt assembly build it?
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #241 from tgravescs/SPARK-1330 and squashes the following commits:
      
      b10d708 [Thomas Graves] SPARK-1330 removed extra echo from comput_classpath.sh
      426042ad
  35. Mar 25, 2014
    • Aaron Davidson's avatar
      SPARK-1286: Make usage of spark-env.sh idempotent · 007a7334
      Aaron Davidson authored
      Various spark scripts load spark-env.sh. This can cause growth of any variables that may be appended to (SPARK_CLASSPATH, SPARK_REPL_OPTS) and it makes the precedence order for options specified in spark-env.sh less clear.
      
      One use-case for the latter is that we want to set options from the command-line of spark-shell, but these options will be overridden by subsequent loading of spark-env.sh. If we were to load the spark-env.sh first and then set our command-line options, we could guarantee correct precedence order.
      
      Note that we use SPARK_CONF_DIR if available to support the sbin/ scripts, which always set this variable from sbin/spark-config.sh. Otherwise, we default to the ../conf/ as usual.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #184 from aarondav/idem and squashes the following commits:
      
      e291f91 [Aaron Davidson] Use "private" variables in load-spark-env.sh
      8da8360 [Aaron Davidson] Add .sh extension to load-spark-env.sh
      93a2471 [Aaron Davidson] SPARK-1286: Make usage of spark-env.sh idempotent
      007a7334
  36. Mar 24, 2014
    • Patrick Wendell's avatar
      SPARK-1094 Support MiMa for reporting binary compatibility accross versions. · dc126f21
      Patrick Wendell authored
      This adds some changes on top of the initial work by @scrapcodes in #20:
      
      The goal here is to do automated checking of Spark commits to determine whether they break binary compatibility.
      
      1. Special case for inner classes of package-private objects.
      2. Made tools classes accessible when running `spark-class`.
      3. Made some declared types in MLLib more general.
      4. Various other improvements to exclude-generation script.
      5. In-code documentation.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Prashant Sharma <prashant.s@imaginea.com>
      Author: Prashant Sharma <scrapcodes@gmail.com>
      
      Closes #207 from pwendell/mima and squashes the following commits:
      
      22ae267 [Patrick Wendell] New binary changes after upmerge
      6c2030d [Patrick Wendell] Merge remote-tracking branch 'apache/master' into mima
      3666cf1 [Patrick Wendell] Minor style change
      0e0f570 [Patrick Wendell] Small fix and removing directory listings
      647c547 [Patrick Wendell] Reveiw feedback.
      c39f3b5 [Patrick Wendell] Some enhancements to binary checking.
      4c771e0 [Prashant Sharma] Added a tool to generate mima excludes and also adapted build to pick automatically.
      b551519 [Prashant Sharma] adding a new exclude after rebasing with master
      651844c [Prashant Sharma] Support MiMa for reporting binary compatibility accross versions.
      dc126f21
Loading