Skip to content
Snippets Groups Projects
  1. Apr 13, 2014
    • Xusen Yin's avatar
      [SPARK-1415] Hadoop min split for wholeTextFiles() · 037fe4d2
      Xusen Yin authored
      JIRA issue [here](https://issues.apache.org/jira/browse/SPARK-1415).
      
      New Hadoop API of `InputFormat` does not provide the `minSplits` parameter, which makes the API incompatible between `HadoopRDD` and `NewHadoopRDD`. The PR is for constructing compatible APIs.
      
      Though `minSplits` is deprecated by New Hadoop API, we think it is better to make APIs compatible here.
      
      **Note** that `minSplits` in `wholeTextFiles` could only be treated as a *suggestion*, the real number of splits may not be greater than `minSplits` due to `isSplitable()=false`.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #376 from yinxusen/hadoop-min-split and squashes the following commits:
      
      76417f6 [Xusen Yin] refine comments
      c10af60 [Xusen Yin] refine comments and rewrite new class for wholeTextFile
      766d05b [Xusen Yin] refine Java API and comments
      4875755 [Xusen Yin] add minSplits for WholeTextFiles
      037fe4d2
    • Patrick Wendell's avatar
      SPARK-1480: Clean up use of classloaders · 4bc07eeb
      Patrick Wendell authored
      The Spark codebase is a bit fast-and-loose when accessing classloaders and this has caused a few bugs to surface in master.
      
      This patch defines some utility methods for accessing classloaders. This makes the intention when accessing a classloader much more explicit in the code and fixes a few cases where the wrong one was chosen.
      
      case (a) -> We want the classloader that loaded Spark
      case (b) -> We want the context class loader, or if not present, we want (a)
      
      This patch provides a better fix for SPARK-1403 (https://issues.apache.org/jira/browse/SPARK-1403) than the current work around, which it reverts. It also fixes a previously unreported bug that the `./spark-submit` script did not work for running with `local` master. It didn't work because the executor classloader did not properly delegate to the context class loader (if it is defined) and in local mode the context class loader is set by the `./spark-submit` script. A unit test is added for that case.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #398 from pwendell/class-loaders and squashes the following commits:
      
      b4a1a58 [Patrick Wendell] Minor clean up
      14f1272 [Patrick Wendell] SPARK-1480: Clean up use of classloaders
      4bc07eeb
  2. Apr 12, 2014
    • Bharath Bhushan's avatar
      [SPARK-1403] Move the class loader creation back to where it was in 0.9.0 · ca11919e
      Bharath Bhushan authored
      [SPARK-1403] I investigated why spark 0.9.0 loads fine on mesos while spark 1.0.0 fails. What I found was that in SparkEnv.scala, while creating the SparkEnv object, the current thread's classloader is null. But in 0.9.0, at the same place, it is set to org.apache.spark.repl.ExecutorClassLoader . I saw that https://github.com/apache/spark/commit/7edbea41b43e0dc11a2de156be220db8b7952d01 moved it to it current place. I moved it back and saw that 1.0.0 started working fine on mesos.
      
      I just created a minimal patch that allows me to run spark on mesos correctly. It seems like SecurityManager's creation needs to be taken into account for a correct fix. Also moving the creation of the serializer out of SparkEnv might be a part of the right solution. PTAL.
      
      Author: Bharath Bhushan <manku.timma@outlook.com>
      
      Closes #322 from manku-timma/spark-1403 and squashes the following commits:
      
      606c2b9 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
      ec8f870 [Bharath Bhushan] revert the logger change for java 6 compatibility as PR 334 is doing it
      728beca [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
      044027d [Bharath Bhushan] fix compile error
      6f260a4 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
      b3a053f [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
      04b9662 [Bharath Bhushan] add missing line
      4803c19 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
      f3c9a14 [Bharath Bhushan] Merge remote-tracking branch 'upstream/master' into spark-1403
      42d3d6a [Bharath Bhushan] used code fragment from @ueshin to fix the problem in a better way
      89109d7 [Bharath Bhushan] move the class loader creation back to where it was in 0.9.0
      ca11919e
    • Andrew Or's avatar
      [Fix #204] Update out-dated comments · c2d160fb
      Andrew Or authored
      This PR is self-explanatory.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #381 from andrewor14/master and squashes the following commits:
      
      3e8dde2 [Andrew Or] Fix comments for #204
      c2d160fb
    • Tathagata Das's avatar
      [SPARK-1386] Web UI for Spark Streaming · 6aa08c39
      Tathagata Das authored
      When debugging Spark Streaming applications it is necessary to monitor certain metrics that are not shown in the Spark application UI. For example, what is average processing time of batches? What is the scheduling delay? Is the system able to process as fast as it is receiving data? How many records I am receiving through my receivers?
      
      While the StreamingListener interface introduced in the 0.9 provided some of this information, it could only be accessed programmatically. A UI that shows information specific to the streaming applications is necessary for easier debugging. This PR introduces such a UI. It shows various statistics related to the streaming application. Here is a screenshot of the UI running on my local machine.
      
      http://i.imgur.com/1ooDGhm.png
      
      This UI is integrated into the Spark UI running at 4040.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #290 from tdas/streaming-web-ui and squashes the following commits:
      
      fc73ca5 [Tathagata Das] Merge pull request #9 from andrewor14/ui-refactor
      642dd88 [Andrew Or] Merge SparkUISuite.scala into UISuite.scala
      eb30517 [Andrew Or] Merge github.com:apache/spark into ui-refactor
      f4f4cbe [Tathagata Das] More minor fixes.
      34bb364 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
      252c566 [Tathagata Das] Merge pull request #8 from andrewor14/ui-refactor
      e038b4b [Tathagata Das] Addressed Patrick's comments.
      125a054 [Andrew Or] Disable serving static resources with gzip
      90feb8d [Andrew Or] Address Patrick's comments
      89dae36 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
      72fe256 [Tathagata Das] Merge pull request #6 from andrewor14/ui-refactor
      2fc09c8 [Tathagata Das] Added binary check exclusions
      aa396d4 [Andrew Or] Rename tabs and pages (No more IndexPage.scala)
      f8e1053 [Tathagata Das] Added Spark and Streaming UI unit tests.
      caa5e05 [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
      585cd65 [Tathagata Das] Merge pull request #5 from andrewor14/ui-refactor
      914b8ff [Tathagata Das] Moved utils functions to UIUtils.
      548c98c [Andrew Or] Wide refactoring of WebUI, UITab, and UIPage (see commit message)
      6de06b0 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-web-ui
      ee6543f [Tathagata Das] Minor changes based on Andrew's comments.
      fa760fe [Tathagata Das] Fixed long line.
      1c0bcef [Tathagata Das] Refactored streaming UI into two files.
      1af239b [Tathagata Das] Changed streaming UI to attach itself as a tab with the Spark UI.
      827e81a [Tathagata Das] Merge branch 'streaming-web-ui' of github.com:tdas/spark into streaming-web-ui
      168fe86 [Tathagata Das] Merge pull request #2 from andrewor14/ui-refactor
      3e986f8 [Tathagata Das] Merge remote-tracking branch 'apache/master' into streaming-web-ui
      c78c92d [Andrew Or] Remove outdated comment
      8f7323b [Andrew Or] End of file new lines, indentation, and imports (minor)
      0d61ee8 [Andrew Or] Merge branch 'streaming-web-ui' of github.com:tdas/spark into ui-refactor
      9a48fa1 [Andrew Or] Allow adding tabs to SparkUI dynamically + add example
      61358e3 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into streaming-web-ui
      53be2c5 [Tathagata Das] Minor style updates.
      ed25dfc [Andrew Or] Generalize SparkUI header to display tabs dynamically
      a37ad4f [Andrew Or] Comments, imports and formatting (minor)
      cd000b0 [Andrew Or] Merge github.com:apache/spark into ui-refactor
      7d57444 [Andrew Or] Refactoring the UI interface to add flexibility
      aef4dd5 [Tathagata Das] Added Apache licenses.
      db27bad [Tathagata Das] Added last batch processing time to StreamingUI.
      4d86e98 [Tathagata Das] Added basic stats to the StreamingUI and refactored the UI to a Page to make it easier to transition to using SparkUI later.
      93f1c69 [Tathagata Das] Added network receiver information to the Streaming UI.
      56cc7fb [Tathagata Das] First cut implementation of Streaming UI.
      6aa08c39
    • Sean Owen's avatar
      SPARK-1057 (alternative) Remove fastutil · 165e06a7
      Sean Owen authored
      (This is for discussion at this point -- I'm not suggesting this should be committed.)
      
      This is what removing fastutil looks like. Much of it is straightforward, like using `java.io` buffered stream classes, and Guava for murmurhash3.
      
      Uses of the `FastByteArrayOutputStream` were a little trickier. In only one case though do I think the change to use `java.io` actually entails an extra array copy.
      
      The rest is using `OpenHashMap` and `OpenHashSet`.  These are now written in terms of more scala-like operations.
      
      `OpenHashMap` is where I made three non-trivial changes to make it work, and they need review:
      
      - It is no longer private
      - The key must be a `ClassTag`
      - Unless a lot of other code changes, the key type can't enforce being a supertype of `Null`
      
      It all works and tests pass, and I think there is reason to believe it's OK from a speed perspective.
      
      But what about those last changes?
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #266 from srowen/SPARK-1057-alternate and squashes the following commits:
      
      2601129 [Sean Owen] Fix Map return type error not previously caught
      ec65502 [Sean Owen] Updates from matei's review
      00bc81e [Sean Owen] Remove use of fastutil and replace with use of java.io, spark.util and Guava classes
      165e06a7
  3. Apr 11, 2014
    • baishuo(白硕)'s avatar
      Update WindowedDStream.scala · aa8bb117
      baishuo(白硕) authored
      update the content of Exception when windowDuration is not multiple of parent.slideDuration
      
      Author: baishuo(白硕) <vc_java@hotmail.com>
      
      Closes #390 from baishuo/windowdstream and squashes the following commits:
      
      533c968 [baishuo(白硕)] Update WindowedDStream.scala
      aa8bb117
    • Xusen Yin's avatar
      [WIP] [SPARK-1328] Add vector statistics · fdfb45e6
      Xusen Yin authored
      As with the new vector system in MLlib, we find that it is good to add some new APIs to precess the `RDD[Vector]`. Beside, the former implementation of `computeStat` is not stable which could loss precision, and has the possibility to cause `Nan` in scientific computing, just as said in the [SPARK-1328](https://spark-project.atlassian.net/browse/SPARK-1328).
      
      APIs contain:
      
      * rowMeans(): RDD[Double]
      * rowNorm2(): RDD[Double]
      * rowSDs(): RDD[Double]
      * colMeans(): Vector
      * colMeans(size: Int): Vector
      * colNorm2(): Vector
      * colNorm2(size: Int): Vector
      * colSDs(): Vector
      * colSDs(size: Int): Vector
      * maxOption((Vector, Vector) => Boolean): Option[Vector]
      * minOption((Vector, Vector) => Boolean): Option[Vector]
      * rowShrink(): RDD[Vector]
      * colShrink(): RDD[Vector]
      
      This is working in process now, and some more APIs will add to `LabeledPoint`. Moreover, the implicit declaration will move from `MLUtils` to `MLContext` later.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #268 from yinxusen/vector-statistics and squashes the following commits:
      
      d61363f [Xusen Yin] rebase to latest master
      16ae684 [Xusen Yin] fix minor error and remove useless method
      10cf5d3 [Xusen Yin] refine some return type
      b064714 [Xusen Yin] remove computeStat in MLUtils
      cbbefdb [Xiangrui Meng] update multivariate statistical summary interface and clean tests
      4eaf28a [Xusen Yin] merge VectorRDDStatistics into RowMatrix
      48ee053 [Xusen Yin] fix minor error
      e624f93 [Xusen Yin] fix scala style error
      1fba230 [Xusen Yin] merge while loop together
      69e1f37 [Xusen Yin] remove lazy eval, and minor memory footprint
      548e9de [Xusen Yin] minor revision
      86522c4 [Xusen Yin] add comments on functions
      dc77e38 [Xusen Yin] test sparse vector RDD
      18cf072 [Xusen Yin] change def to lazy val to make sure that the computations in function be evaluated only once
      f7a3ca2 [Xusen Yin] fix the corner case of maxmin
      967d041 [Xusen Yin] full revision with Aggregator class
      138300c [Xusen Yin] add new Aggregator class
      1376ff4 [Xusen Yin] rename variables and adjust code
      4a5c38d [Xusen Yin] add scala doc, refine code and comments
      036b7a5 [Xusen Yin] fix the bug of Nan occur
      f6e8e9a [Xusen Yin] add sparse vectors test
      4cfbadf [Xusen Yin] fix bug of min max
      4e4fbd1 [Xusen Yin] separate seqop and combop out as independent functions
      a6d5a2e [Xusen Yin] rewrite for only computing non-zero elements
      3980287 [Xusen Yin] rename variables
      62a2c3e [Xusen Yin] use axpy and in-place if possible
      9a75ebd [Xusen Yin] add case class to wrap return values
      d816ac7 [Xusen Yin] remove useless APIs
      c4651bb [Xusen Yin] remove row-wise APIs and refine code
      1338ea1 [Xusen Yin] all-in-one version test passed
      cc65810 [Xusen Yin] add parallel mean and variance
      9af2e95 [Xusen Yin] refine the code style
      ad6c82d [Xusen Yin] add shrink test
      e09d5d2 [Xusen Yin] add scala docs and refine shrink method
      8ef3377 [Xusen Yin] pass all tests
      28cf060 [Xusen Yin] fix error of column means
      54b19ab [Xusen Yin] add new API to shrink RDD[Vector]
      8c6c0e1 [Xusen Yin] add basic statistics
      fdfb45e6
    • Xiangrui Meng's avatar
      [FIX] make coalesce test deterministic in RDDSuite · 7038b00b
      Xiangrui Meng authored
      Make coalesce test deterministic by setting pre-defined seeds. (Saw random failures in other PRs.)
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #387 from mengxr/fix-random and squashes the following commits:
      
      59bc16f [Xiangrui Meng] make coalesce test deterministic in RDDSuite
      7038b00b
    • Patrick Wendell's avatar
      HOTFIX: Ignore python metastore files in RAT checks. · 6a0f8e35
      Patrick Wendell authored
      This was causing some errors with pull request tests.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #393 from pwendell/hotfix and squashes the following commits:
      
      6201dd3 [Patrick Wendell] HOTFIX: Ignore python metastore files in RAT checks.
      6a0f8e35
    • Xiangrui Meng's avatar
      [SPARK-1225, 1241] [MLLIB] Add AreaUnderCurve and BinaryClassificationMetrics · f5ace8da
      Xiangrui Meng authored
      This PR implements a generic version of `AreaUnderCurve` using the `RDD.sliding` implementation from https://github.com/apache/spark/pull/136 . It also contains refactoring of https://github.com/apache/spark/pull/160 for binary classification evaluation.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #364 from mengxr/auc and squashes the following commits:
      
      a05941d [Xiangrui Meng] replace TP/FP/TN/FN by their full names
      3f42e98 [Xiangrui Meng] add (0, 0), (1, 1) to roc, and (0, 1) to pr
      fb4b6d2 [Xiangrui Meng] rename Evaluator to Metrics and add more metrics
      b1b7dab [Xiangrui Meng] fix code styles
      9dc3518 [Xiangrui Meng] add tests for BinaryClassificationEvaluator
      ca31da5 [Xiangrui Meng] remove PredictionAndResponse
      3d71525 [Xiangrui Meng] move binary evalution classes to evaluation.binary
      8f78958 [Xiangrui Meng] add PredictionAndResponse
      dda82d5 [Xiangrui Meng] add confusion matrix
      aa7e278 [Xiangrui Meng] add initial version of binary classification evaluator
      221ebce [Xiangrui Meng] add a new test to sliding
      a920865 [Xiangrui Meng] Merge branch 'sliding' into auc
      a9b250a [Xiangrui Meng] move sliding to mllib
      cab9a52 [Xiangrui Meng] use last for the last element
      db6cb30 [Xiangrui Meng] remove unnecessary toSeq
      9916202 [Xiangrui Meng] change RDD.sliding return type to RDD[Seq[T]]
      284d991 [Xiangrui Meng] change SlidedRDD to SlidingRDD
      c1c6c22 [Xiangrui Meng] add AreaUnderCurve
      65461b2 [Xiangrui Meng] Merge branch 'sliding' into auc
      5ee6001 [Xiangrui Meng] add TODO
      d2a600d [Xiangrui Meng] add sliding to rdd
      f5ace8da
    • Patrick Wendell's avatar
      Some clean up in build/docs · 98225a6e
      Patrick Wendell authored
      (a) Deleted an outdated line from the docs
      (b) Removed a work around that is no longer necessary given the mesos version bump.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #382 from pwendell/maven-clean and squashes the following commits:
      
      f0447fa [Patrick Wendell] Minor doc clean-up
      98225a6e
    • Thomas Graves's avatar
      SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken · 446bb341
      Thomas Graves authored
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #344 from tgravescs/SPARK-1417 and squashes the following commits:
      
      c450b5f [Thomas Graves] fix test
      e1c1d7e [Thomas Graves] add missing $ to appUIAddress
      e982ddb [Thomas Graves] use appUIHostPort in appUIAddress
      0803ec2 [Thomas Graves] Review comment updates - remove extra newline, simplify assert in test
      658a8ec [Thomas Graves] Add a appUIHostPort routine
      0614208 [Thomas Graves] Fix test
      2a6b1b7 [Thomas Graves] SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken
      446bb341
  4. Apr 10, 2014
    • Patrick Wendell's avatar
      SPARK-1202: Improvements to task killing in the UI. · 44f654ee
      Patrick Wendell authored
      1. Adds a separate endpoint for the killing logic that is outside of a page.
      2. Narrows the scope of the killingEnabled tracking.
      3. Some style improvements.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #386 from pwendell/kill-link and squashes the following commits:
      
      8efe02b [Patrick Wendell] Improvements to task killing in the UI.
      44f654ee
    • Harvey Feng's avatar
      Add Spark v0.9.1 to ec2 launch script and use it as the default · 7b4203ab
      Harvey Feng authored
      Mainly ported from branch-0.9.
      
      Author: Harvey Feng <hyfeng224@gmail.com>
      
      Closes #385 from harveyfeng/0.9.1-ec2 and squashes the following commits:
      
      769ac2f [Harvey Feng] Add Spark v0.9.1 to ec2 launch script and use it as the default
      7b4203ab
    • Ivan Wick's avatar
      Set spark.executor.uri from environment variable (needed by Mesos) · 5cd11d51
      Ivan Wick authored
      The Mesos backend uses this property when setting up a slave process.  It is similarly set in the Scala repl (org.apache.spark.repl.SparkILoop), but I couldn't find any analogous for pyspark.
      
      Author: Ivan Wick <ivanwick+github@gmail.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Matei Zaharia <matei@databricks.com>
      
      Closes #311 from ivanwick/master and squashes the following commits:
      
      da0c3e4 [Ivan Wick] Set spark.executor.uri from environment variable (needed by Mesos)
      5cd11d51
    • Sundeep Narravula's avatar
      SPARK-1202 - Add a "cancel" button in the UI for stages · 2c557837
      Sundeep Narravula authored
      Author: Sundeep Narravula <sundeepn@superduel.local>
      Author: Sundeep Narravula <sundeepn@dhcpx-204-110.corp.yahoo.com>
      
      Closes #246 from sundeepn/uikilljob and squashes the following commits:
      
      5fdd0e2 [Sundeep Narravula] Fix test string
      f6fdff1 [Sundeep Narravula] Format fix; reduced line size to less than 100 chars
      d1daeb9 [Sundeep Narravula] Incorporating review comments.
      8d97923 [Sundeep Narravula] Ability to kill jobs thru the UI. This behavior can be turned on be settings the following variable: spark.ui.killEnabled=true (default=false) Adding DAGScheduler event StageCancelled and corresponding handlers. Added cancellation reason to handlers.
      2c557837
    • Michael Armbrust's avatar
      [SQL] Improve column pruning in the optimizer. · f99401a6
      Michael Armbrust authored
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #378 from marmbrus/columnPruning and squashes the following commits:
      
      779da56 [Michael Armbrust] More consistent naming.
      1a4e9ea [Michael Armbrust] More comments.
      2f4e7b9 [Michael Armbrust] Improve column pruning in the optimizer.
      f99401a6
    • Sandeep's avatar
      Remove Unnecessary Whitespace's · 930b70f0
      Sandeep authored
      stack these together in a commit else they show up chunk by chunk in different commits.
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #380 from techaddict/white_space and squashes the following commits:
      
      b58f294 [Sandeep] Remove Unnecessary Whitespace's
      930b70f0
    • Andrew Ash's avatar
      Update tuning.md · f0466625
      Andrew Ash authored
      http://stackoverflow.com/questions/9699071/what-is-the-javas-internal-represention-for-string-modified-utf-8-utf-16
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #384 from ash211/patch-2 and squashes the following commits:
      
      da1b0be [Andrew Ash] Update tuning.md
      f0466625
    • Patrick Wendell's avatar
      Revert "SPARK-1433: Upgrade Mesos dependency to 0.17.0" · 7b52b663
      Patrick Wendell authored
      This reverts commit 12c077d5.
      7b52b663
    • Sandeep's avatar
      SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining · 3bd31294
      Sandeep authored
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #356 from techaddict/1428 and squashes the following commits:
      
      3bdf5f6 [Sandeep] SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining
      3bd31294
    • Andrew Or's avatar
      [SPARK-1276] Add a HistoryServer to render persisted UI · 79820fe8
      Andrew Or authored
      The new feature of event logging, introduced in #42, allows the user to persist the details of his/her Spark application to storage, and later replay these events to reconstruct an after-the-fact SparkUI.
      Currently, however, a persisted UI can only be rendered through the standalone Master. This greatly limits the use case of this new feature as many people also run Spark on Yarn / Mesos.
      
      This PR introduces a new entity called the HistoryServer, which, given a log directory, keeps track of all completed applications independently of a Spark Master. Unlike Master, the HistoryServer needs not be running while the application is still running. It is relatively light-weight in that it only maintains static information of applications and performs no scheduling.
      
      To quickly test it out, generate event logs with ```spark.eventLog.enabled=true``` and run ```sbin/start-history-server.sh <log-dir-path>```. Your HistoryServer awaits on port 18080.
      
      Comments and feedback are most welcome.
      
      ---
      
      A few other changes introduced in this PR include refactoring the WebUI interface, which is beginning to have a lot of duplicate code now that we have added more functionality to it. Two new SparkListenerEvents have been introduced (SparkListenerApplicationStart/End) to keep track of application name and start/finish times. This PR also clarifies the semantics of the ReplayListenerBus introduced in #42.
      
      A potential TODO in the future (not part of this PR) is to render live applications in addition to just completed applications. This is useful when applications fail, a condition that our current HistoryServer does not handle unless the user manually signals application completion (by creating the APPLICATION_COMPLETION file). Handling live applications becomes significantly more challenging, however, because it is now necessary to render the same SparkUI multiple times. To avoid reading the entire log every time, which is inefficient, we must handle reading the log from where we previously left off, but this becomes fairly complicated because we must deal with the arbitrary behavior of each input stream.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #204 from andrewor14/master and squashes the following commits:
      
      7b7234c [Andrew Or] Finished -> Completed
      b158d98 [Andrew Or] Address Patrick's comments
      69d1b41 [Andrew Or] Do not block on posting SparkListenerApplicationEnd
      19d5dd0 [Andrew Or] Merge github.com:apache/spark
      f7f5bf0 [Andrew Or] Make history server's web UI port a Spark configuration
      2dfb494 [Andrew Or] Decouple checking for application completion from replaying
      d02dbaa [Andrew Or] Expose Spark version and include it in event logs
      2282300 [Andrew Or] Add documentation for the HistoryServer
      567474a [Andrew Or] Merge github.com:apache/spark
      6edf052 [Andrew Or] Merge github.com:apache/spark
      19e1fb4 [Andrew Or] Address Thomas' comments
      248cb3d [Andrew Or] Limit number of live applications + add configurability
      a3598de [Andrew Or] Do not close file system with ReplayBus + fix bind address
      bc46fc8 [Andrew Or] Merge github.com:apache/spark
      e2f4ff9 [Andrew Or] Merge github.com:apache/spark
      050419e [Andrew Or] Merge github.com:apache/spark
      81b568b [Andrew Or] Fix strange error messages...
      0670743 [Andrew Or] Decouple page rendering from loading files from disk
      1b2f391 [Andrew Or] Minor changes
      a9eae7e [Andrew Or] Merge branch 'master' of github.com:apache/spark
      d5154da [Andrew Or] Styling and comments
      5dbfbb4 [Andrew Or] Merge branch 'master' of github.com:apache/spark
      60bc6d5 [Andrew Or] First complete implementation of HistoryServer (only for finished apps)
      7584418 [Andrew Or] Report application start/end times to HistoryServer
      8aac163 [Andrew Or] Add basic application table
      c086bd5 [Andrew Or] Add HistoryServer and scripts ++ Refactor WebUI interface
      79820fe8
    • witgo's avatar
      Fix SPARK-1413: Parquet messes up stdout and stdin when used in Spark REPL · a74fbbbc
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #325 from witgo/SPARK-1413 and squashes the following commits:
      
      e57cd8e [witgo] use scala reflection to access and call the SLF4JBridgeHandler  methods
      45c8f40 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      5e35d87 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      0d5f819 [witgo] review commit
      45e5b70 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      fa69dcf [witgo] Merge branch 'master' into SPARK-1413
      3c98dc4 [witgo] Merge branch 'master' into SPARK-1413
      38160cb [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      ba09bcd [witgo] remove set the parquet log level
      a63d574 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      5231ecd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      3feb635 [witgo] parquet logger use parent handler
      fa00d5d [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      8bb6ffd [witgo] enableLogForwarding note fix
      edd9630 [witgo]  move to
      f447f50 [witgo] merging master
      5ad52bd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      76670c1 [witgo] review commit
      70f3c64 [witgo] Fix SPARK-1413
      a74fbbbc
    • Patrick Wendell's avatar
      e6d4a74d
    • Sandeep's avatar
      SPARK-1446: Spark examples should not do a System.exit · e55cc4ba
      Sandeep authored
      Spark examples should exit nice using SparkContext.stop() method, rather than System.exit
      System.exit can cause issues like in SPARK-1407
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #370 from techaddict/1446 and squashes the following commits:
      
      e9234cf [Sandeep] SPARK-1446: Spark examples should not do a System.exit Spark examples should exit nice using SparkContext.stop() method, rather than System.exit System.exit can cause issues like in SPARK-1407
      e55cc4ba
  5. Apr 09, 2014
    • William Benton's avatar
      SPARK-729: Closures not always serialized at capture time · 8ca3b2bc
      William Benton authored
      [SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) concerns when free variables in closure arguments to transformations are captured.  Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created).  There are a few possible approaches to solving this problem and this PR will discuss some of them.  The approach I took has the advantage of being simple, obviously correct, and minimally-invasive, but it preserves something that has been bothering me about Spark's closure handling, so I'd like to discuss an alternative and get some feedback on whether or not it is worth pursuing.
      
      ## What I did
      
      The basic approach I took depends on the work I did for #143, and so this PR is based atop that.  Specifically: #143 modifies `ClosureCleaner.clean` to preemptively determine whether or not closures are serializable immediately upon closure cleaning (rather than waiting for an job involving that closure to be scheduled).  Thus non-serializable closure exceptions will be triggered by the line defining the closure rather than triggered where the closure is used.
      
      Since the easiest way to determine whether or not a closure is serializable is to attempt to serialize it, the code in #143 is creating a serialized closure as part of `ClosureCleaner.clean`.  `clean` currently modifies its argument, but the method in `SparkContext` that wraps it to return a value (a reference to the modified-in-place argument).  This branch modifies `ClosureCleaner.clean` so that it returns a value:  if it is cleaning a serializable closure, it returns the result of deserializing its serialized argument; therefore it is returning a closure with an environment captured at cleaning time.  `SparkContext.clean` then returns the result of `ClosureCleaner.clean`, rather than a reference to its modified-in-place argument.
      
      I've added tests for this behavior (777a1bc).  The pull request as it stands, given the changes in #143, is nearly trivial.  There is some overhead from deserializing the closure, but it is minimal and the benefit of obvious operational correctness (vs. a more sophisticated but harder-to-validate transformation in `ClosureCleaner`) seems pretty important.  I think this is a fine way to solve this problem, but it's not perfect.
      
      ## What we might want to do
      
      The thing that has been bothering me about Spark's handling of closures is that it seems like we should be able to statically ensure that cleaning and serialization happen exactly once for a given closure.  If we serialize a closure in order to determine whether or not it is serializable, we should be able to hang on to the generated byte buffer and use it instead of re-serializing the closure later.  By replacing closures with instances of a sum type that encodes whether or not a closure has been cleaned or serialized, we could handle clean, to-be-cleaned, and serialized closures separately with case matches.  Here's a somewhat-concrete sketch (taken from my git stash) of what this might look like:
      
      ```scala
      package org.apache.spark.util
      
      import java.nio.ByteBuffer
      import scala.reflect.ClassManifest
      
      sealed abstract class ClosureBox[T] { def func: T }
      final case class RawClosure[T](func: T) extends ClosureBox[T] {}
      final case class CleanedClosure[T](func: T) extends ClosureBox[T] {}
      final case class SerializedClosure[T](func: T, bytebuf: ByteBuffer) extends ClosureBox[T] {}
      
      object ClosureBoxImplicits {
        implicit def closureBoxFromFunc[T <: AnyRef](fun: T) = new RawClosure[T](fun)
      }
      ```
      
      With these types declared, we'd be able to change `ClosureCleaner.clean` to take a `ClosureBox[T=>U]` (possibly generated by implicit conversion) and return a `ClosureBox[T=>U]` (either a `CleanedClosure[T=>U]` or a `SerializedClosure[T=>U]`, depending on whether or not serializability-checking was enabled) instead of a `T=>U`.  A case match could thus short-circuit cleaning or serializing closures that had already been cleaned or serialized (both in `ClosureCleaner` and in the closure serializer).  Cleaned-and-serialized closures would be represented by a boxed tuple of the original closure and a serialized copy (complete with an environment quiesced at transformation time).  Additional implicit conversions could convert from `ClosureBox` instances to the underlying function type where appropriate.  Tracking this sort of state in the type system seems like the right thing to do to me.
      
      ### Why we might not want to do that
      
      _It's pretty invasive._  Every function type used by every `RDD` subclass would have to change to reflect that they expected a `ClosureBox[T=>U]` instead of a `T=>U`.  This obscures what's going on and is not a little ugly.  Although I really like the idea of using the type system to enforce the clean-or-serialize once discipline, it might not be worth adding another layer of types (even if we could hide some of the extra boilerplate with judicious application of implicit conversions).
      
      _It statically guarantees a property whose absence is unlikely to cause any serious problems as it stands._  It appears that all closures are currently dynamically cleaned once and it's not obvious that repeated closure-cleaning is likely to be a problem in the future.  Furthermore, serializing closures is relatively cheap, so doing it once to check for serialization and once again to actually ship them across the wire doesn't seem like a big deal.
      
      Taken together, these seem like a high price to pay for statically guaranteeing that closures are operated upon only once.
      
      ## Other possibilities
      
      I felt like the serialize-and-deserialize approach was best due to its obvious simplicity.  But it would be possible to do a more sophisticated transformation within `ClosureCleaner.clean`.  It might also be possible for `clean` to modify its argument in a way so that whether or not a given closure had been cleaned would be apparent upon inspection; this would buy us some of the operational benefits of the `ClosureBox` approach but not the static cleanliness.
      
      I'm interested in any feedback or discussion on whether or not the problems with the type-based approach indeed outweigh the advantage, as well as of approaches to this issue and to closure handling in general.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #189 from willb/spark-729 and squashes the following commits:
      
      f4cafa0 [William Benton] Stylistic changes and cleanups
      b3d9c86 [William Benton] Fixed style issues in tests
      9b56ce0 [William Benton] Added array-element capture test
      97e9d91 [William Benton] Split closure-serializability failure tests
      12ef6e3 [William Benton] Skip proactive closure capture for runJob
      8ee3ee7 [William Benton] Predictable closure environment capture
      12c63a7 [William Benton] Added tests for variable capture in closures
      d6e8dd6 [William Benton] Don't check serializability of DStream transforms.
      4ecf841 [William Benton] Make proactive serializability checking optional.
      d8df3db [William Benton] Adds proactive closure-serializablilty checking
      21b4b06 [William Benton] Test cases for SPARK-897.
      d5947b3 [William Benton] Ensure assertions in Graph.apply are asserted.
      8ca3b2bc
    • Xiangrui Meng's avatar
      [SPARK-1357 (fix)] remove empty line after :: DeveloperApi/Experimental :: · 0adc932a
      Xiangrui Meng authored
      Remove empty line after :: DeveloperApi/Experimental :: in comments to make the original doc show up in the preview of the generated html docs. Thanks @andrewor14 !
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #373 from mengxr/api and squashes the following commits:
      
      9c35bdc [Xiangrui Meng] remove the empty line after :: DeveloperApi/Experimental ::
      0adc932a
    • Kan Zhang's avatar
      SPARK-1407 drain event queue before stopping event logger · eb5f2b64
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #366 from kanzhang/SPARK-1407 and squashes the following commits:
      
      cd0629f [Kan Zhang] code refactoring and adding test
      b073ee6 [Kan Zhang] SPARK-1407 drain event queue before stopping event logger
      eb5f2b64
    • Xiangrui Meng's avatar
      [SPARK-1357] [MLLIB] Annotate developer and experimental APIs · bde9cc11
      Xiangrui Meng authored
      Annotate developer and experimental APIs in MLlib.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #298 from mengxr/api and squashes the following commits:
      
      13390e8 [Xiangrui Meng] Merge branch 'master' into api
      dc4cbb3 [Xiangrui Meng] mark distribute matrices experimental
      6b9f8e2 [Xiangrui Meng] add Experimental annotation
      8773d0d [Xiangrui Meng] add DeveloperApi annotation
      da31733 [Xiangrui Meng] update developer and experimental tags
      555e0fe [Xiangrui Meng] Merge branch 'master' into api
      ef1a717 [Xiangrui Meng] mark some constructors private add default parameters to JavaDoc
      00ffbcc [Xiangrui Meng] update tree API annotation
      0b674fa [Xiangrui Meng] mark decision tree APIs
      86b9e34 [Xiangrui Meng] one pass over APIs of GLMs, NaiveBayes, and ALS
      f21d862 [Xiangrui Meng] Merge branch 'master' into api
      2b133d6 [Xiangrui Meng] intial annotation of developer and experimental apis
      bde9cc11
    • Patrick Wendell's avatar
      SPARK-1093: Annotate developer and experimental API's · 87bd1f9e
      Patrick Wendell authored
      This patch marks some existing classes as private[spark] and adds two types of API annotations:
      - `EXPERIMENTAL API` = experimental user-facing module
      - `DEVELOPER API - UNSTABLE` = developer-facing API that might change
      
      There is some discussion of the different mechanisms for doing this here:
      https://issues.apache.org/jira/browse/SPARK-1081
      
      I was pretty aggressive with marking things private. Keep in mind that if we want to open something up in the future we can, but we can never reduce visibility.
      
      A few notes here:
      - In the past we've been inconsistent with the visiblity of the X-RDD classes. This patch marks them private whenever there is an existing function in RDD that can directly creat them (e.g. CoalescedRDD and rdd.coalesce()). One trade-off here is users can't subclass them.
      - Noted that compression and serialization formats don't have to be wire compatible across versions.
      - Compression codecs and serialization formats are semi-private as users typically don't instantiate them directly.
      - Metrics sources are made private - user only interacts with them through Spark's reflection
      
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #274 from pwendell/private-apis and squashes the following commits:
      
      44179e4 [Patrick Wendell] Merge remote-tracking branch 'apache-github/master' into private-apis
      042c803 [Patrick Wendell] spark.annotations -> spark.annotation
      bfe7b52 [Patrick Wendell] Adding experimental for approximate counts
      8d0c873 [Patrick Wendell] Warning in SparkEnv
      99b223a [Patrick Wendell] Cleaning up annotations
      e849f64 [Patrick Wendell] Merge pull request #2 from andrewor14/annotations
      982a473 [Andrew Or] Generalize jQuery matching for non Spark-core API docs
      a01c076 [Patrick Wendell] Merge pull request #1 from andrewor14/annotations
      c1bcb41 [Andrew Or] DeveloperAPI -> DeveloperApi
      0d48908 [Andrew Or] Comments and new lines (minor)
      f3954e0 [Andrew Or] Add identifier tags in comments to work around scaladocs bug
      99192ef [Andrew Or] Dynamically add badges based on annotations
      824011b [Andrew Or] Add support for injecting arbitrary JavaScript to API docs
      037755c [Patrick Wendell] Some changes after working with andrew or
      f7d124f [Patrick Wendell] Small fixes
      c318b24 [Patrick Wendell] Use CSS styles
      e4c76b9 [Patrick Wendell] Logging
      f390b13 [Patrick Wendell] Better visibility for workaround constructors
      d6b0afd [Patrick Wendell] Small chang to existing constructor
      403ba52 [Patrick Wendell] Style fix
      870a7ba [Patrick Wendell] Work around for SI-8479
      7fb13b2 [Patrick Wendell] Changes to UnionRDD and EmptyRDD
      4a9e90c [Patrick Wendell] EXPERIMENTAL API --> EXPERIMENTAL
      c581dce [Patrick Wendell] Changes after building against Shark.
      8452309 [Patrick Wendell] Style fixes
      1ed27d2 [Patrick Wendell] Formatting and coloring of badges
      cd7a465 [Patrick Wendell] Code review feedback
      2f706f1 [Patrick Wendell] Don't use floats
      542a736 [Patrick Wendell] Small fixes
      cf23ec6 [Patrick Wendell] Marking GraphX as alpha
      d86818e [Patrick Wendell] Another naming change
      5a76ed6 [Patrick Wendell] More visiblity clean-up
      42c1f09 [Patrick Wendell] Using better labels
      9d48cbf [Patrick Wendell] Initial pass
      87bd1f9e
    • Xiangrui Meng's avatar
      [SPARK-1390] Refactoring of matrices backed by RDDs · 9689b663
      Xiangrui Meng authored
      This is to refactor interfaces for matrices backed by RDDs. It would be better if we have a clear separation of local matrices and those backed by RDDs. Right now, we have
      
      1. `org.apache.spark.mllib.linalg.SparseMatrix`, which is a wrapper over an RDD of matrix entries, i.e., coordinate list format.
      2. `org.apache.spark.mllib.linalg.TallSkinnyDenseMatrix`, which is a wrapper over RDD[Array[Double]], i.e. row-oriented format.
      
      We will see naming collision when we introduce local `SparseMatrix`, and the name `TallSkinnyDenseMatrix` is not exact if we switch to `RDD[Vector]` from `RDD[Array[Double]]`. It would be better to have "RDD" in the class name to suggest that operations may trigger jobs.
      
      The proposed names are (all under `org.apache.spark.mllib.linalg.rdd`):
      
      1. `RDDMatrix`: trait for matrices backed by one or more RDDs
      2. `CoordinateRDDMatrix`: wrapper of `RDD[(Long, Long, Double)]`
      3. `RowRDDMatrix`: wrapper of `RDD[Vector]` whose rows do not have special ordering
      4. `IndexedRowRDDMatrix`: wrapper of `RDD[(Long, Vector)]` whose rows are associated with indices
      
      The current code also introduces local matrices.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #296 from mengxr/mat and squashes the following commits:
      
      24d8294 [Xiangrui Meng] fix for groupBy returning Iterable
      bfc2b26 [Xiangrui Meng] merge master
      8e4f1f5 [Xiangrui Meng] Merge branch 'master' into mat
      0135193 [Xiangrui Meng] address Reza's comments
      03cd7e1 [Xiangrui Meng] add pca/gram to IndexedRowMatrix add toBreeze to DistributedMatrix for test simplify tests
      b177ff1 [Xiangrui Meng] address Matei's comments
      be119fe [Xiangrui Meng] rename m/n to numRows/numCols for local matrix add tests for matrices
      b881506 [Xiangrui Meng] rename SparkPCA/SVD to TallSkinnyPCA/SVD
      e7d0d4a [Xiangrui Meng] move IndexedRDDMatrixRow to IndexedRowRDDMatrix
      0d1491c [Xiangrui Meng] fix test errors
      a85262a [Xiangrui Meng] rename RDDMatrixRow to IndexedRDDMatrixRow
      b8b6ac3 [Xiangrui Meng] Remove old code
      4cf679c [Xiangrui Meng] port pca to RowRDDMatrix, and add multiply and covariance
      7836e2f [Xiangrui Meng] initial refactoring of matrices backed by RDDs
      9689b663
    • Holden Karau's avatar
      Spark-939: allow user jars to take precedence over spark jars · fa0524fd
      Holden Karau authored
      I still need to do a small bit of re-factoring [mostly the one Java file I'll switch it back to a Scala file and use it in both the close loaders], but comments on other things I should do would be great.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #217 from holdenk/spark-939-allow-user-jars-to-take-precedence-over-spark-jars and squashes the following commits:
      
      cf0cac9 [Holden Karau] Fix the executorclassloader
      1955232 [Holden Karau] Fix long line in TestUtils
      8f89965 [Holden Karau] Fix tests for new class name
      7546549 [Holden Karau] CR feedback, merge some of the testutils methods down, rename the classloader
      644719f [Holden Karau] User the class generator for the repl class loader tests too
      f0b7114 [Holden Karau] Fix the core/src/test/scala/org/apache/spark/executor/ExecutorURLClassLoaderSuite.scala tests
      204b199 [Holden Karau] Fix the generated classes
      9f68f10 [Holden Karau] Start rewriting the ExecutorURLClassLoaderSuite to not use the hard coded classes
      858aba2 [Holden Karau] Remove a bunch of test junk
      261aaee [Holden Karau] simplify executorurlclassloader a bit
      7a7bf5f [Holden Karau] CR feedback
      d4ae848 [Holden Karau] rewrite component into scala
      aa95083 [Holden Karau] CR feedback
      7752594 [Holden Karau] re-add https comment
      a0ef85a [Holden Karau] Fix style issues
      125ea7f [Holden Karau] Easier to just remove those files, we don't need them
      bb8d179 [Holden Karau] Fix issues with the repl class loader
      241b03d [Holden Karau] fix my rat excludes
      a343350 [Holden Karau] Update rat-excludes and remove a useless file
      d90d217 [Holden Karau] Fix fall back with custom class loader and add a test for it
      4919bf9 [Holden Karau] Fix parent calling class loader issue
      8a67302 [Holden Karau] Test are good
      9e2d236 [Holden Karau] It works comrade
      691ee00 [Holden Karau] It works ish
      dc4fe44 [Holden Karau] Does not depend on being in my home directory
      47046ff [Holden Karau] Remove bad import'
      22d83cb [Holden Karau] Add a test suite for the executor url class loader suite
      7ef4628 [Holden Karau] Clean up
      792d961 [Holden Karau] Almost works
      16aecd1 [Holden Karau] Doesn't quite work
      8d2241e [Holden Karau] Adda FakeClass for testing ClassLoader precedence options
      648b559 [Holden Karau] Both class loaders compile. Now for testing
      e1d9f71 [Holden Karau] One loader workers.
      fa0524fd
  6. Apr 08, 2014
    • Xiangrui Meng's avatar
      [SPARK-1434] [MLLIB] change labelParser from anonymous function to trait · b9e0c937
      Xiangrui Meng authored
      This is a patch to address @mateiz 's comment in https://github.com/apache/spark/pull/245
      
      MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #345 from mengxr/label-parser and squashes the following commits:
      
      ac44409 [Xiangrui Meng] use singleton objects for label parsers
      3b1a7c6 [Xiangrui Meng] add tests for label parsers
      c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton
      11c94e0 [Xiangrui Meng] add return types
      7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait
      b9e0c937
    • Holden Karau's avatar
      Spark 1271: Co-Group and Group-By should pass Iterable[X] · ce8ec545
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:
      
      f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
      77048f8 [Holden Karau] Fix merge up to master
      d3fe909 [Holden Karau] use toSeq instead
      7a092a3 [Holden Karau] switch resultitr to resultiterable
      eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
      c5075aa [Holden Karau] If guava 14 had iterables
      2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
      11e730c [Holden Karau] Fix streaming tests
      66b583d [Holden Karau] Fix the core test suite to compile
      4ed579b [Holden Karau] Refactor from iterator to iterable
      d052c07 [Holden Karau] Python tests now pass with iterator pandas
      3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
      cd1e81c [Holden Karau] Try and make pickling list iterators work
      c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
      88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
      a5ee714 [Holden Karau] oops, was checking wrong iterator
      e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
      ec8cc3e [Holden Karau] Fix test issues\!
      4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
      fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
      ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
      b692868 [Holden Karau] Revert
      7e533f7 [Holden Karau] Fix the bug
      8a5153a [Holden Karau] Revert me, but we have some stuff to debug
      b4e86a9 [Holden Karau] Add a join based on the problem in SVD
      c4510e2 [Holden Karau] Revert this but for now put things in list pandas
      b4e0b1d [Holden Karau] Fix style issues
      71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
      b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
      37888ec [Holden Karau] core/tests now pass
      249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
      6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
      fe992fe [Holden Karau] hmmm try and fix up basic operation suite
      172705c [Holden Karau] Fix Java API suite
      caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
      88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
      4991af6 [Holden Karau] Fix some tests
      be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
      687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
      ce8ec545
    • Sandeep's avatar
      SPARK-1433: Upgrade Mesos dependency to 0.17.0 · 12c077d5
      Sandeep authored
      Mesos 0.13.0 was released 6 months ago.
      Upgrade Mesos dependency to 0.17.0
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #355 from techaddict/mesos_update and squashes the following commits:
      
      f1abeee [Sandeep] SPARK-1433: Upgrade Mesos dependency to 0.17.0 Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0
      12c077d5
    • Kay Ousterhout's avatar
      [SPARK-1397] Notify SparkListeners when stages fail or are cancelled. · fac6085c
      Kay Ousterhout authored
      [I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, #305.  You can look at just the second commit: https://github.com/kayousterhout/spark-1/commit/93f08baf731b9eaf5c9792a5373560526e2bccac to see just the changes relevant to this PR]
      
      Previously, when stages fail or get cancelled, the SparkListener is only notified
      indirectly through the SparkListenerJobEnd, where we sometimes pass in a single
      stage that failed.  This worked before job cancellation, because jobs would only fail
      due to a single stage failure.  However, with job cancellation, multiple running stages
      can fail when a job gets cancelled.  Right now, this is not handled correctly, which
      results in stages that get stuck in the “Running Stages” window in the UI even
      though they’re dead.
      
      This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded
      event, and uses this event to tell SparkListeners when stages fail in addition to when
      they complete successfully.  This change is NOT publicly backward compatible for two
      reasons.  First, it changes the SparkListener interface.  We could alternately add a new event,
      SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted.  However,
      this is less consistent with the listener events for tasks / jobs ending, and will result in some
      code duplication for listeners (because failed and completed stages are handled in similar
      ways).  Note that I haven’t finished updating the JSON code to correctly handle the new event
      because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”).
      
      It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed()
      method to no longer include a stage that caused the failure.  I think this change should definitely
      stay, because with cancellation (as described above), a failure isn’t necessarily caused by a
      single stage.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #309 from kayousterhout/stage_cancellation and squashes the following commits:
      
      5533ecd [Kay Ousterhout] Fixes in response to Mark's review
      320c7c7 [Kay Ousterhout] Notify SparkListeners when stages fail or are cancelled.
      fac6085c
    • Aaron Davidson's avatar
      SPARK-1445: compute-classpath should not print error if lib_managed not found · e25b5934
      Aaron Davidson authored
      This was added to the check for the assembly jar, forgot it for the datanucleus jars.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #361 from aarondav/cc and squashes the following commits:
      
      8facc16 [Aaron Davidson] SPARK-1445: compute-classpath should not print error if lib_managed not found
      e25b5934
    • Kan Zhang's avatar
      SPARK-1348 binding Master, Worker, and App Web UI to all interfaces · a8d86b08
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #318 from kanzhang/SPARK-1348 and squashes the following commits:
      
      e625a5f [Kan Zhang] reverting the changes to startJettyServer()
      7a8084e [Kan Zhang] SPARK-1348 binding Master, Worker, and App Web UI to all interfaces
      a8d86b08
    • Henry Saputra's avatar
      Remove extra semicolon in import statement and unused import in ApplicationMaster · 3bc05489
      Henry Saputra authored
      Small nit cleanup to remove extra semicolon and unused import in Yarn's stable ApplicationMaster (it bothers me every time I saw it)
      
      Author: Henry Saputra <hsaputra@apache.org>
      
      Closes #358 from hsaputra/nitcleanup_removesemicolon_import_applicationmaster and squashes the following commits:
      
      bffb685 [Henry Saputra] Remove extra semicolon in import statement and unused import in ApplicationMaster.scala
      3bc05489
Loading