Skip to content
Snippets Groups Projects
  1. May 08, 2014
    • Marcelo Vanzin's avatar
      [SPARK-1631] Correctly set the Yarn app name when launching the AM. · 3f779d87
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #539 from vanzin/yarn-app-name and squashes the following commits:
      
      7d1ca4f [Marcelo Vanzin] [SPARK-1631] Correctly set the Yarn app name when launching the AM.
      3f779d87
    • Andrew Or's avatar
      [SPARK-1755] Respect SparkSubmit --name on YARN · 8b784129
      Andrew Or authored
      Right now, SparkSubmit ignores the `--name` flag for both yarn-client and yarn-cluster. This is a bug.
      
      In client mode, SparkSubmit treats `--name` as a [cluster config](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L170) and does not propagate this to SparkContext.
      
      In cluster mode, SparkSubmit passes this flag to `org.apache.spark.deploy.yarn.Client`, which only uses it for the [YARN ResourceManager](https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L80), but does not propagate this to SparkContext.
      
      This PR ensures that `spark.app.name` is always set if SparkSubmit receives the `--name` flag, which is what the usage promises. This makes it possible for applications to start a SparkContext with an empty conf `val sc = new SparkContext(new SparkConf)`, and inherit the app name from SparkSubmit.
      
      Tested both modes on a YARN cluster.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #699 from andrewor14/yarn-app-name and squashes the following commits:
      
      98f6a79 [Andrew Or] Fix tests
      dea932f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-app-name
      c86d9ca [Andrew Or] Respect SparkSubmit --name on YARN
      8b784129
    • Bouke van der Bijl's avatar
      Include the sbin/spark-config.sh in spark-executor · 2fd2752e
      Bouke van der Bijl authored
      This is needed because broadcast values are broken on pyspark on Mesos, it tries to import pyspark but can't, as the PYTHONPATH is not set due to changes in ff5be9a4
      
      https://issues.apache.org/jira/browse/SPARK-1725
      
      Author: Bouke van der Bijl <boukevanderbijl@gmail.com>
      
      Closes #651 from bouk/include-spark-config-in-mesos-executor and squashes the following commits:
      
      b2f1295 [Bouke van der Bijl] Inline PYTHONPATH in spark-executor
      eedbbcc [Bouke van der Bijl] Include the sbin/spark-config.sh in spark-executor
      2fd2752e
    • Funes's avatar
      Bug fix of sparse vector conversion · 191279ce
      Funes authored
      Fixed a small bug caused by the inconsistency of index/data array size and vector length.
      
      Author: Funes <tianshaocun@gmail.com>
      Author: funes <tianshaocun@gmail.com>
      
      Closes #661 from funes/bugfix and squashes the following commits:
      
      edb2b9d [funes] remove unused import
      75dced3 [Funes] update test case
      d129a66 [Funes] Add test for sparse breeze by vector builder
      64e7198 [Funes] Copy data only when necessary
      b85806c [Funes] Bug fix of sparse vector conversion
      191279ce
    • DB Tsai's avatar
      [SPARK-1157][MLlib] Bug fix: lossHistory should exclude rejection steps, and remove miniBatch · 910a13b3
      DB Tsai authored
      Getting the lossHistory from Breeze's API which already excludes the rejection steps in line search. Also, remove the miniBatch in LBFGS since those quasi-Newton methods approximate the inverse of Hessian. It doesn't make sense if the gradients are computed from a varying objective.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #582 from dbtsai/dbtsai-lbfgs-bug and squashes the following commits:
      
      9cc6cf9 [DB Tsai] Removed the miniBatch in LBFGS.
      1ba6a33 [DB Tsai] Formatting the code.
      d72c679 [DB Tsai] Using Breeze's states to get the loss.
      910a13b3
    • DB Tsai's avatar
      MLlib documentation fix · d38febee
      DB Tsai authored
      Fixed the documentation for that `loadLibSVMData` is changed to `loadLibSVMFile`.
      
      Author: DB Tsai <dbtsai@alpinenow.com>
      
      Closes #703 from dbtsai/dbtsai-docfix and squashes the following commits:
      
      71dd508 [DB Tsai] loadLibSVMData is changed to loadLibSVMFile
      d38febee
    • Takuya UESHIN's avatar
      [SPARK-1754] [SQL] Add missing arithmetic DSL operations. · 322b1808
      Takuya UESHIN authored
      Add missing arithmetic DSL operations: `unary_-`, `%`.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #689 from ueshin/issues/SPARK-1754 and squashes the following commits:
      
      a09ef69 [Takuya UESHIN] Add also missing ! (not) operation.
      f73ae2c [Takuya UESHIN] Remove redundant tests.
      5b3f087 [Takuya UESHIN] Add tests relating DSL operations.
      e09c5b8 [Takuya UESHIN] Add missing arithmetic DSL operations.
      322b1808
    • Evan Sparks's avatar
      Fixing typo in als.py · 5c5e7d58
      Evan Sparks authored
      XtY should be Xty.
      
      Author: Evan Sparks <evan.sparks@gmail.com>
      
      Closes #696 from etrain/patch-2 and squashes the following commits:
      
      634cb8d [Evan Sparks] Fixing typo in als.py
      5c5e7d58
    • Andrew Or's avatar
      [SPARK-1745] Move interrupted flag from TaskContext constructor (minor) · c3f8b78c
      Andrew Or authored
      It makes little sense to start a TaskContext that is interrupted. Indeed, I searched for all use cases of it and didn't find a single instance in which `interrupted` is true on construction.
      
      This was inspired by reviewing #640, which adds an additional `@volatile var completed` that is similar. These are not the most urgent changes, but I wanted to push them out before I forget.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #675 from andrewor14/task-context and squashes the following commits:
      
      9575e02 [Andrew Or] Add space
      69455d1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into task-context
      c471490 [Andrew Or] Oops, removed one flag too many. Adding it back.
      85311f8 [Andrew Or] Move interrupted flag from TaskContext constructor
      c3f8b78c
    • Prashant Sharma's avatar
      SPARK-1565, update examples to be used with spark-submit script. · 44dd57fb
      Prashant Sharma authored
      Commit for initial feedback, basically I am curious if we should prompt user for providing args esp. when its mandatory. And can we skip if they are not ?
      
      Also few other things that did not work like
      `bin/spark-submit examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.SparkALS --arg 100 500 10 5 2`
      
      Not all the args get passed properly, may be I have messed up something will try to sort it out hopefully.
      
      Author: Prashant Sharma <prashant.s@imaginea.com>
      
      Closes #552 from ScrapCodes/SPARK-1565/update-examples and squashes the following commits:
      
      669dd23 [Prashant Sharma] Review comments
      2727e70 [Prashant Sharma] SPARK-1565, update examples to be used with spark-submit script.
      44dd57fb
    • Michael Armbrust's avatar
      [SQL] Improve SparkSQL Aggregates · 19c8fb02
      Michael Armbrust authored
      * Add native min/max (was using hive before).
      * Handle nulls correctly in Avg and Sum.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #683 from marmbrus/aggFixes and squashes the following commits:
      
      64fe30b [Michael Armbrust] Improve SparkSQL Aggregates * Add native min/max (was using hive before). * Handle nulls correctly in Avg and Sum.
      19c8fb02
  2. May 07, 2014
    • Evan Sparks's avatar
      Use numpy directly for matrix multiply. · 6ed7e2cd
      Evan Sparks authored
      Using matrix multiply to compute XtX and XtY yields a 5-20x speedup depending on problem size.
      
      For example - the following takes 19s locally after this change vs. 5m21s before the change. (16x speedup).
      bin/pyspark examples/src/main/python/als.py local[8] 1000 1000 50 10 10
      
      Author: Evan Sparks <evan.sparks@gmail.com>
      
      Closes #687 from etrain/patch-1 and squashes the following commits:
      
      e094dbc [Evan Sparks] Touching only diaganols on update.
      d1ab9b6 [Evan Sparks] Use numpy directly for matrix multiply.
      6ed7e2cd
    • Sandeep's avatar
      SPARK-1668: Add implicit preference as an option to examples/MovieLensALS · 108c4c16
      Sandeep authored
      Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #597 from techaddict/SPARK-1668 and squashes the following commits:
      
      8b371dc [Sandeep] Second Pass on reviews by mengxr
      eca9d37 [Sandeep] based on mengxr's suggestions
      937e54c [Sandeep] Changes
      5149d40 [Sandeep] Changes based on review
      1dd7657 [Sandeep] use mean()
      42444d7 [Sandeep] Based on Suggestions by mengxr
      e3082fa [Sandeep] SPARK-1668: Add implicit preference as an option to examples/MovieLensALS Add --implicitPrefs as an command-line option to the example app MovieLensALS under examples/
      108c4c16
    • Manish Amde's avatar
      SPARK-1544 Add support for deep decision trees. · f269b016
      Manish Amde authored
      @etrain and I came with a PR for arbitrarily deep decision trees at the cost of multiple passes over the data at deep tree levels.
      
      To summarize:
      1) We take a parameter that indicates the amount of memory users want to reserve for computation on each worker (and 2x that at the driver).
      2) Using that information, we calculate two things - the maximum depth to which we train as usual (which is, implicitly, the maximum number of nodes we want to train in parallel), and the size of the groups we should use in the case where we exceed this depth.
      
      cc: @atalwalkar, @hirakendu, @mengxr
      
      Author: Manish Amde <manish9ue@gmail.com>
      Author: manishamde <manish9ue@gmail.com>
      Author: Evan Sparks <sparks@cs.berkeley.edu>
      
      Closes #475 from manishamde/deep_tree and squashes the following commits:
      
      968ca9d [Manish Amde] merged master
      7fc9545 [Manish Amde] added docs
      ce004a1 [Manish Amde] minor formatting
      b27ad2c [Manish Amde] formatting
      426bb28 [Manish Amde] programming guide blurb
      8053fed [Manish Amde] more formatting
      5eca9e4 [Manish Amde] grammar
      4731cda [Manish Amde] formatting
      5e82202 [Manish Amde] added documentation, fixed off by 1 error in max level calculation
      cbd9f14 [Manish Amde] modified scala.math to math
      dad9652 [Manish Amde] removed unused imports
      e0426ee [Manish Amde] renamed parameter
      718506b [Manish Amde] added unit test
      1517155 [Manish Amde] updated documentation
      9dbdabe [Manish Amde] merge from master
      719d009 [Manish Amde] updating user documentation
      fecf89a [manishamde] Merge pull request #6 from etrain/deep_tree
      0287772 [Evan Sparks] Fixing scalastyle issue.
      2f1e093 [Manish Amde] minor: added doc for maxMemory parameter
      2f6072c [manishamde] Merge pull request #5 from etrain/deep_tree
      abc5a23 [Evan Sparks] Parameterizing max memory.
      50b143a [Manish Amde] adding support for very deep trees
      f269b016
    • baishuo(白硕)'s avatar
      Update GradientDescentSuite.scala · 0c19bb16
      baishuo(白硕) authored
      use more faster way to construct an array
      
      Author: baishuo(白硕) <vc_java@hotmail.com>
      
      Closes #588 from baishuo/master and squashes the following commits:
      
      45b95fb [baishuo(白硕)] Update GradientDescentSuite.scala
      c03b61c [baishuo(白硕)] Update GradientDescentSuite.scala
      b666d27 [baishuo(白硕)] Update GradientDescentSuite.scala
      0c19bb16
    • Xiangrui Meng's avatar
      [SPARK-1743][MLLIB] add loadLibSVMFile and saveAsLibSVMFile to pyspark · 3188553f
      Xiangrui Meng authored
      Make loading/saving labeled data easier for pyspark users.
      
      Also changed type check in `SparseVector` to allow numpy integers.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #672 from mengxr/pyspark-mllib-util and squashes the following commits:
      
      2943fa7 [Xiangrui Meng] format docs
      d61668d [Xiangrui Meng] add loadLibSVMFile and saveAsLibSVMFile to pyspark
      3188553f
    • Thomas Graves's avatar
      SPARK-1569 Spark on Yarn, authentication broken by pr299 · 4bec84b6
      Thomas Graves authored
      Pass the configs as java options since the executor needs to know before it registers whether to create the connection using authentication or not.    We could see about passing only the authentication configs but for now I just had it pass them all.
      
      I also updating it to use a list to construct the command to make it the same as ClientBase and avoid any issues with spaces.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #649 from tgravescs/SPARK-1569 and squashes the following commits:
      
      0178ab8 [Thomas Graves] add akka settings
      22a8735 [Thomas Graves] Change to only path spark.auth* configs
      8ccc1d4 [Thomas Graves] SPARK-1569 Spark on Yarn, authentication broken
      4bec84b6
    • Andrew Or's avatar
      [SPARK-1688] Propagate PySpark worker stderr to driver · 52008722
      Andrew Or authored
      When at least one of the following conditions is true, PySpark cannot be loaded:
      
      1. PYTHONPATH is not set
      2. PYTHONPATH does not contain the python directory (or jar, in the case of YARN)
      3. The jar does not contain pyspark files (YARN)
      4. The jar does not contain py4j files (YARN)
      
      However, we currently throw the same random `java.io.EOFException` for all of the above cases, when trying to read from the python daemon's output. This message is super unhelpful.
      
      This PR includes the python stderr and the PYTHONPATH in the exception propagated to the driver. Now, the exception message looks something like:
      
      ```
      Error from python worker:
        : No module named pyspark
      PYTHONPATH was:
        /path/to/spark/python:/path/to/some/jar
      java.io.EOFException
        <stack trace>
      ```
      
      whereas before it was just
      
      ```
      java.io.EOFException
        <stack trace>
      ```
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #603 from andrewor14/pyspark-exception and squashes the following commits:
      
      10d65d3 [Andrew Or] Throwable -> Exception, worker -> daemon
      862d1d7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
      a5ed798 [Andrew Or] Use block string and interpolation instead of var (minor)
      cc09c45 [Andrew Or] Account for the fact that the python daemon may not have terminated yet
      444f019 [Andrew Or] Use the new RedirectThread + include system PYTHONPATH
      aab00ae [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
      0cc2402 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
      783efe2 [Andrew Or] Make python daemon stderr indentation consistent
      9524172 [Andrew Or] Avoid potential NPE / error stream contention + Move things around
      29f9688 [Andrew Or] Add back original exception type
      e92d36b [Andrew Or] Include python worker stderr in the exception propagated to the driver
      7c69360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
      cdbc185 [Andrew Or] Fix python attribute not found exception when PYTHONPATH is not set
      dcc0353 [Andrew Or] Check both python and system environment variables for PYTHONPATH
      6c09c21 [Andrew Or] Validate PYTHONPATH and PySpark modules before starting python workers
      52008722
    • Andrew Ash's avatar
      Typo fix: fetchting -> fetching · d00981a9
      Andrew Ash authored
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #680 from ash211/patch-3 and squashes the following commits:
      
      9ce3746 [Andrew Ash] Typo fix: fetchting -> fetching
      d00981a9
    • Andrew Ash's avatar
      Nicer logging for SecurityManager startup · 7f6f4a10
      Andrew Ash authored
      Happy to open a jira ticket if you'd like to track one there.
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #678 from ash211/SecurityManagerLogging and squashes the following commits:
      
      2aa0b7a [Andrew Ash] Nicer logging for SecurityManager startup
      7f6f4a10
    • Cheng Hao's avatar
      [SQL] Fix Performance Issue in data type casting · ca431868
      Cheng Hao authored
      Using lazy val object instead of function in the class Cast, which improved the performance nearly by 2X in my local micro-benchmark.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #679 from chenghao-intel/fix_type_casting and squashes the following commits:
      
      71b0902 [Cheng Hao] using lazy val object instead of function for data type casting
      ca431868
    • Aaron Davidson's avatar
      SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions · 3308722c
      Aaron Davidson authored
      This patch includes several cleanups to PythonRDD, focused around fixing [SPARK-1579](https://issues.apache.org/jira/browse/SPARK-1579) cleanly. Listed in order of approximate importance:
      
      - The Python daemon waits for Spark to close the socket before exiting,
        in order to avoid causing spurious IOExceptions in Spark's
        `PythonRDD::WriterThread`.
      - Removes the Python Monitor Thread, which polled for task cancellations
        in order to kill the Python worker. Instead, we do this in the
        onCompleteCallback, since this is guaranteed to be called during
        cancellation.
      - Adds a "completed" variable to TaskContext to avoid the issue noted in
        [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), where onCompleteCallbacks may be execution-order dependent.
        Along with this, I removed the "context.interrupted = true" flag in
        the onCompleteCallback.
      - Extracts PythonRDD::WriterThread to its own class.
      
      Since this patch provides an alternative solution to [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), I did test it with
      
      ```
      sc.textFile("latlon.tsv").take(5)
      ```
      
      many times without error.
      
      Additionally, in order to test the unswallowed exceptions, I performed
      
      ```
      sc.textFile("s3n://<big file>").count()
      ```
      
      and cut my internet during execution. Prior to this patch, we got the "stdin writer exited early" message, which was unhelpful. Now, we get the SocketExceptions propagated through Spark to the user and get proper (though unsuccessful) task retries.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #640 from aarondav/pyspark-io and squashes the following commits:
      
      b391ff8 [Aaron Davidson] Detect "clean socket shutdowns" and stop waiting on the socket
      c0c49da [Aaron Davidson] SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
      3308722c
    • Kan Zhang's avatar
      [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations... · 967635a2
      Kan Zhang authored
      ... that do not change schema
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #448 from kanzhang/SPARK-1460 and squashes the following commits:
      
      111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD
      91dc787 [Kan Zhang] Taking into account newly added Ordering param
      79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
      967635a2
    • Cheng Hao's avatar
      [WIP][Spark-SQL] Optimize the Constant Folding for Expression · 3eb53bd5
      Cheng Hao authored
      Currently, expression does not support the "constant null" well in constant folding.
      e.g. Sum(a, 0) actually always produces Literal(0, NumericType) in runtime.
      
      For example:
      ```
      explain select isnull(key+null)  from src;
      == Logical Plan ==
      Project [HiveGenericUdf#isnull((key#30 + CAST(null, IntegerType))) AS c_0#28]
       MetastoreRelation default, src, None
      
      == Optimized Logical Plan ==
      Project [true AS c_0#28]
       MetastoreRelation default, src, None
      
      == Physical Plan ==
      Project [true AS c_0#28]
       HiveTableScan [], (MetastoreRelation default, src, None), None
      ```
      
      I've create a new Optimization rule called NullPropagation for such kind of constant folding.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #482 from chenghao-intel/optimize_constant_folding and squashes the following commits:
      
      2f14b50 [Cheng Hao] Fix code style issues
      68b9fad [Cheng Hao] Remove the Literal pattern matching for NullPropagation
      29c8166 [Cheng Hao] Update the code for feedback of code review
      50444cc [Cheng Hao] Remove the unnecessary null checking
      80f9f18 [Cheng Hao] Update the UnitTest for aggregation constant folding
      27ea3d7 [Cheng Hao] Fix Constant Folding Bugs & Add More Unittests
      b28e03a [Cheng Hao] Merge pull request #1 from marmbrus/pr/482
      9ccefdb [Michael Armbrust] Add tests for optimized expression evaluation.
      543ef9d [Cheng Hao] fix code style issues
      9cf0396 [Cheng Hao] update code according to the code review comment
      536c005 [Cheng Hao] Add Exceptional case for constant folding
      3c045c7 [Cheng Hao] Optimize the Constant Folding by adding more rules
      2645d4f [Cheng Hao] Constant Folding(null propagation)
      3eb53bd5
    • Patrick Wendell's avatar
      SPARK-1746: Support setting SPARK_JAVA_OPTS on executors for backwards compatibility · 913a0a9c
      Patrick Wendell authored
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #676 from pwendell/worker-opts and squashes the following commits:
      
      54456c4 [Patrick Wendell] SPARK-1746: Support setting SPARK_JAVA_OPTS on executors for backwards compatibility
      913a0a9c
  3. May 06, 2014
    • Sandeep's avatar
      [HOTFIX] SPARK-1637: There are some Streaming examples added after the PR #571 was last updated. · fdae095d
      Sandeep authored
      This resulted in Compilation Errors.
      cc @mateiz project not compiling currently.
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #673 from techaddict/SPARK-1637-HOTFIX and squashes the following commits:
      
      b512f4f [Sandeep] [SPARK-1637][HOTFIX] There are some Streaming examples added after the PR #571 was last updated. This resulted in Compilation Errors.
      fdae095d
    • Ethan Jewett's avatar
      Proposal: clarify Scala programming guide on caching ... · 48ba3b8c
      Ethan Jewett authored
      ... with regards to saved map output. Wording taken partially from Matei Zaharia's email to the Spark user list. http://apache-spark-user-list.1001560.n3.nabble.com/performance-improvement-on-second-operation-without-caching-td5227.html
      
      Author: Ethan Jewett <esjewett@gmail.com>
      
      Closes #668 from esjewett/Doc-update and squashes the following commits:
      
      11793ce [Ethan Jewett] Update based on feedback
      171e670 [Ethan Jewett] Clarify Scala programming guide on caching ...
      48ba3b8c
    • Sean Owen's avatar
      SPARK-1727. Correct small compile errors, typos, and markdown issues in (primarly) MLlib docs · 25ad8f93
      Sean Owen authored
      While play-testing the Scala and Java code examples in the MLlib docs, I noticed a number of small compile errors, and some typos. This led to finding and fixing a few similar items in other docs.
      
      Then in the course of building the site docs to check the result, I found a few small suggestions for the build instructions. I also found a few more formatting and markdown issues uncovered when I accidentally used maruku instead of kramdown.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #653 from srowen/SPARK-1727 and squashes the following commits:
      
      6e7c38a [Sean Owen] Final doc updates - one more compile error, and use of mean instead of sum and count
      8f5e847 [Sean Owen] Fix markdown syntax issues that maruku flags, even though we use kramdown (but only those that do not affect kramdown's output)
      99966a9 [Sean Owen] Update issue tracker URL in docs
      23c9ac3 [Sean Owen] Add Scala Naive Bayes example, to use existing example data file (whose format needed a tweak)
      8c81982 [Sean Owen] Fix small compile errors and typos across MLlib docs
      25ad8f93
    • Sandeep's avatar
      SPARK-1637: Clean up examples for 1.0 · a000b5c3
      Sandeep authored
      - [x] Move all of them into subpackages of org.apache.spark.examples (right now some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
      - [x] Move Python examples into examples/src/main/python
      - [x] Update docs to reflect these changes
      
      Author: Sandeep <sandeep@techaddict.me>
      
      This patch had conflicts when merged, resolved by
      Committer: Matei Zaharia <matei@databricks.com>
      
      Closes #571 from techaddict/SPARK-1637 and squashes the following commits:
      
      47ef86c [Sandeep] Changes based on Discussions on PR, removing use of RawTextHelper from examples
      8ed2d3f [Sandeep] Docs Updated for changes, Change for java examples
      5f96121 [Sandeep] Move Python examples into examples/src/main/python
      0a8dd77 [Sandeep] Move all Scala Examples to org.apache.spark.examples (some are in org.apache.spark.streaming.examples, for instance, and others are in org.apache.spark.examples.mllib)
      a000b5c3
    • Patrick Wendell's avatar
      SPARK-1737: Warn rather than fail when Java 7+ is used to create distributions · 39b8b148
      Patrick Wendell authored
      Also moves a few lines of code around in make-distribution.sh.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #669 from pwendell/make-distribution and squashes the following commits:
      
      8bfac49 [Patrick Wendell] Small fix
      46918ec [Patrick Wendell] SPARK-1737: Warn rather than fail when Java 7+ is used to create distributions.
      39b8b148
    • Matei Zaharia's avatar
      [SPARK-1549] Add Python support to spark-submit · 951a5d93
      Matei Zaharia authored
      This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.
      
      This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.
      
      In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.
      
      In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #664 from mateiz/py-submit and squashes the following commits:
      
      15e9669 [Matei Zaharia] Fix some uses of path.separator property
      051278c [Matei Zaharia] Small style fixes
      0afe886 [Matei Zaharia] Add license headers
      4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
      15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
      47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
      d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
      951a5d93
    • witgo's avatar
      SPARK-1734: spark-submit throws an exception: Exception in thread "main"... · ec09acdd
      witgo authored
      ... java.lang.ClassNotFoundException: org.apache.spark.broadcast.TorrentBroadcastFactory
      
      Author: witgo <witgo@qq.com>
      
      Closes #665 from witgo/SPARK-1734 and squashes the following commits:
      
      cacf238 [witgo] SPARK-1734: spark-submit throws an exception: Exception in thread "main" java.lang.ClassNotFoundException: org.apache.spark.broadcast.TorrentBroadcastFactory
      ec09acdd
    • Mark Hamstra's avatar
      [SPARK-1685] Cancel retryTimer on restart of Worker or AppClient · fbfe69de
      Mark Hamstra authored
      See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.
      
      Author: Mark Hamstra <markhamstra@gmail.com>
      
      Closes #602 from markhamstra/SPARK-1685 and squashes the following commits:
      
      11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer
      69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient
      fbfe69de
    • Patrick Wendell's avatar
      Fix two download suggestions in the docs: · 7b978c1a
      Patrick Wendell authored
      1) On the quick start page provide a direct link to the downloads (suggested by @pbailis).
      2) On the index page, don't suggest users always have to build Spark, since many won't.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #662 from pwendell/quick-start and squashes the following commits:
      
      0622f27 [Patrick Wendell] Fix two download suggestions in the docs:
      7b978c1a
    • Thomas Graves's avatar
      SPARK-1474: Spark on yarn assembly doesn't include AmIpFilter · 1e829905
      Thomas Graves authored
      We use org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter in spark on yarn but are not included it in the assembly jar.
      
      I tested this on yarn cluster by removing the yarn jars from the classpath and spark runs fine now.
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #406 from tgravescs/SPARK-1474 and squashes the following commits:
      
      1548bf9 [Thomas Graves] SPARK-1474: Spark on yarn assembly doesn't include org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
      1e829905
    • ArcherShao's avatar
      Update OpenHashSet.scala · 0a5a4681
      ArcherShao authored
      Modify wrong comment of function addWithoutResize.
      
      Author: ArcherShao <ArcherShao@users.noreply.github.com>
      
      Closes #667 from ArcherShao/patch-3 and squashes the following commits:
      
      a607358 [ArcherShao] Update OpenHashSet.scala
      0a5a4681
    • Michael Armbrust's avatar
      [SQL] SPARK-1732 - Support for null primitive values. · 3c64750b
      Michael Armbrust authored
      I also removed a println that I bumped into.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #658 from marmbrus/nullPrimitives and squashes the following commits:
      
      a3ec4f3 [Michael Armbrust] Remove println.
      695606b [Michael Armbrust] Support for null primatives from using scala and java reflection.
      3c64750b
    • Andrew Or's avatar
      [SPARK-1735] Add the missing special profiles to make-distribution.sh · a2262cdb
      Andrew Or authored
      73b0cbcc introduced a few special profiles that are not covered in the `make-distribution.sh`. This affects hadoop versions 2.2.x, 2.3.x, and 2.4.x. Without these special profiles, a java version error for protobufs is thrown at run time.
      
      I took the opportunity to rewrite the way we construct the maven command. Previously, the only hadoop version that triggered the `yarn-alpha` profile was 0.23.x, which was inconsistent with the [docs](https://github.com/apache/spark/blob/master/docs/building-with-maven.md). This is now generalized to hadoop versions from 0.23.x to 2.1.x.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #660 from andrewor14/hadoop-distribution and squashes the following commits:
      
      6740126 [Andrew Or] Generalize the yarn profile to hadoop versions 2.2+
      88f192d [Andrew Or] Add the required special profiles to make-distribution.sh
      a2262cdb
  4. May 05, 2014
    • Cheng Lian's avatar
      [SPARK-1678][SPARK-1679] In-memory compression bug fix and made compression... · 6d721c5f
      Cheng Lian authored
      [SPARK-1678][SPARK-1679] In-memory compression bug fix and made compression configurable, disabled by default
      
      In-memory compression is now configurable in `SparkConf` by the `spark.sql.inMemoryCompression.enabled` property, and is disabled by default.
      
      To help code review, the bug fix is in [the first commit](https://github.com/liancheng/spark/commit/d537a367edf0bf24d0b925cc58b21d805ccbc11f), compression configuration is in [the second one](https://github.com/liancheng/spark/commit/4ce09aa8aa820bbbbbaa0f3f084a6cff1d4e6195).
      
      Author: Cheng Lian <lian.cs.zju@gmail.com>
      
      Closes #608 from liancheng/spark-1678 and squashes the following commits:
      
      66c3a8d [Cheng Lian] Renamed in-memory compression configuration key
      f8fb3a0 [Cheng Lian] Added assertion for testing .hasNext of various decoder
      4ce09aa [Cheng Lian] Made in-memory compression configurable via SparkConf
      d537a36 [Cheng Lian] Fixed SPARK-1678
      6d721c5f
    • Xiangrui Meng's avatar
      [SPARK-1594][MLLIB] Cleaning up MLlib APIs and guide · 98750a74
      Xiangrui Meng authored
      Final pass before the v1.0 release.
      
      * Remove `VectorRDDs`
      * Move `BinaryClassificationMetrics` from `evaluation.binary` to `evaluation`
      * Change default value of `addIntercept` to false and allow to add intercept in Ridge and Lasso.
      * Clean `DecisionTree` package doc and test suite.
      * Mark model constructors `private[spark]`
      * Rename `loadLibSVMData` to `loadLibSVMFile` and hide `LabelParser` from users.
      * Add `saveAsLibSVMFile`.
      * Add `appendBias` to `MLUtils`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #524 from mengxr/mllib-cleaning and squashes the following commits:
      
      295dc8b [Xiangrui Meng] update loadLibSVMFile doc
      1977ac1 [Xiangrui Meng] fix doc of appendBias
      649fcf0 [Xiangrui Meng] rename loadLibSVMData to loadLibSVMFile; hide LabelParser from user APIs
      54b812c [Xiangrui Meng] add appendBias
      a71e7d0 [Xiangrui Meng] add saveAsLibSVMFile
      d976295 [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      b7e5cec [Xiangrui Meng] remove some experimental annotations and make model constructors private[mllib]
      9b02b93 [Xiangrui Meng] minor code style update
      a593ddc [Xiangrui Meng] fix python tests
      fc28c18 [Xiangrui Meng] mark more classes experimental
      f6cbbff [Xiangrui Meng] fix Java tests
      0af70b0 [Xiangrui Meng] minor
      6e139ef [Xiangrui Meng] Merge branch 'master' into mllib-cleaning
      94e6dce [Xiangrui Meng] move BinaryLabelCounter and BinaryConfusionMatrixImpl to evaluation.binary
      df34907 [Xiangrui Meng] clean DecisionTreeSuite to use LocalSparkContext
      c81807f [Xiangrui Meng] set the default value of AddIntercept to false
      03389c0 [Xiangrui Meng] allow to add intercept in Ridge and Lasso
      c66c56f [Xiangrui Meng] move tree md to package object doc
      a2695df [Xiangrui Meng] update guide for BinaryClassificationMetrics
      9194f4c [Xiangrui Meng] move BinaryClassificationMetrics one level up
      1c1a0e3 [Xiangrui Meng] remove VectorRDDs because it only contains one function that is not necessary for us to maintain
      98750a74
Loading