Skip to content
Snippets Groups Projects
  1. Apr 11, 2014
    • Thomas Graves's avatar
      SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken · 446bb341
      Thomas Graves authored
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #344 from tgravescs/SPARK-1417 and squashes the following commits:
      
      c450b5f [Thomas Graves] fix test
      e1c1d7e [Thomas Graves] add missing $ to appUIAddress
      e982ddb [Thomas Graves] use appUIHostPort in appUIAddress
      0803ec2 [Thomas Graves] Review comment updates - remove extra newline, simplify assert in test
      658a8ec [Thomas Graves] Add a appUIHostPort routine
      0614208 [Thomas Graves] Fix test
      2a6b1b7 [Thomas Graves] SPARK-1417: Spark on Yarn - spark UI link from resourcemanager is broken
      446bb341
  2. Apr 10, 2014
    • Patrick Wendell's avatar
      SPARK-1202: Improvements to task killing in the UI. · 44f654ee
      Patrick Wendell authored
      1. Adds a separate endpoint for the killing logic that is outside of a page.
      2. Narrows the scope of the killingEnabled tracking.
      3. Some style improvements.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #386 from pwendell/kill-link and squashes the following commits:
      
      8efe02b [Patrick Wendell] Improvements to task killing in the UI.
      44f654ee
    • Sundeep Narravula's avatar
      SPARK-1202 - Add a "cancel" button in the UI for stages · 2c557837
      Sundeep Narravula authored
      Author: Sundeep Narravula <sundeepn@superduel.local>
      Author: Sundeep Narravula <sundeepn@dhcpx-204-110.corp.yahoo.com>
      
      Closes #246 from sundeepn/uikilljob and squashes the following commits:
      
      5fdd0e2 [Sundeep Narravula] Fix test string
      f6fdff1 [Sundeep Narravula] Format fix; reduced line size to less than 100 chars
      d1daeb9 [Sundeep Narravula] Incorporating review comments.
      8d97923 [Sundeep Narravula] Ability to kill jobs thru the UI. This behavior can be turned on be settings the following variable: spark.ui.killEnabled=true (default=false) Adding DAGScheduler event StageCancelled and corresponding handlers. Added cancellation reason to handlers.
      2c557837
    • Sandeep's avatar
      Remove Unnecessary Whitespace's · 930b70f0
      Sandeep authored
      stack these together in a commit else they show up chunk by chunk in different commits.
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #380 from techaddict/white_space and squashes the following commits:
      
      b58f294 [Sandeep] Remove Unnecessary Whitespace's
      930b70f0
    • Patrick Wendell's avatar
      Revert "SPARK-1433: Upgrade Mesos dependency to 0.17.0" · 7b52b663
      Patrick Wendell authored
      This reverts commit 12c077d5.
      7b52b663
    • Andrew Or's avatar
      [SPARK-1276] Add a HistoryServer to render persisted UI · 79820fe8
      Andrew Or authored
      The new feature of event logging, introduced in #42, allows the user to persist the details of his/her Spark application to storage, and later replay these events to reconstruct an after-the-fact SparkUI.
      Currently, however, a persisted UI can only be rendered through the standalone Master. This greatly limits the use case of this new feature as many people also run Spark on Yarn / Mesos.
      
      This PR introduces a new entity called the HistoryServer, which, given a log directory, keeps track of all completed applications independently of a Spark Master. Unlike Master, the HistoryServer needs not be running while the application is still running. It is relatively light-weight in that it only maintains static information of applications and performs no scheduling.
      
      To quickly test it out, generate event logs with ```spark.eventLog.enabled=true``` and run ```sbin/start-history-server.sh <log-dir-path>```. Your HistoryServer awaits on port 18080.
      
      Comments and feedback are most welcome.
      
      ---
      
      A few other changes introduced in this PR include refactoring the WebUI interface, which is beginning to have a lot of duplicate code now that we have added more functionality to it. Two new SparkListenerEvents have been introduced (SparkListenerApplicationStart/End) to keep track of application name and start/finish times. This PR also clarifies the semantics of the ReplayListenerBus introduced in #42.
      
      A potential TODO in the future (not part of this PR) is to render live applications in addition to just completed applications. This is useful when applications fail, a condition that our current HistoryServer does not handle unless the user manually signals application completion (by creating the APPLICATION_COMPLETION file). Handling live applications becomes significantly more challenging, however, because it is now necessary to render the same SparkUI multiple times. To avoid reading the entire log every time, which is inefficient, we must handle reading the log from where we previously left off, but this becomes fairly complicated because we must deal with the arbitrary behavior of each input stream.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #204 from andrewor14/master and squashes the following commits:
      
      7b7234c [Andrew Or] Finished -> Completed
      b158d98 [Andrew Or] Address Patrick's comments
      69d1b41 [Andrew Or] Do not block on posting SparkListenerApplicationEnd
      19d5dd0 [Andrew Or] Merge github.com:apache/spark
      f7f5bf0 [Andrew Or] Make history server's web UI port a Spark configuration
      2dfb494 [Andrew Or] Decouple checking for application completion from replaying
      d02dbaa [Andrew Or] Expose Spark version and include it in event logs
      2282300 [Andrew Or] Add documentation for the HistoryServer
      567474a [Andrew Or] Merge github.com:apache/spark
      6edf052 [Andrew Or] Merge github.com:apache/spark
      19e1fb4 [Andrew Or] Address Thomas' comments
      248cb3d [Andrew Or] Limit number of live applications + add configurability
      a3598de [Andrew Or] Do not close file system with ReplayBus + fix bind address
      bc46fc8 [Andrew Or] Merge github.com:apache/spark
      e2f4ff9 [Andrew Or] Merge github.com:apache/spark
      050419e [Andrew Or] Merge github.com:apache/spark
      81b568b [Andrew Or] Fix strange error messages...
      0670743 [Andrew Or] Decouple page rendering from loading files from disk
      1b2f391 [Andrew Or] Minor changes
      a9eae7e [Andrew Or] Merge branch 'master' of github.com:apache/spark
      d5154da [Andrew Or] Styling and comments
      5dbfbb4 [Andrew Or] Merge branch 'master' of github.com:apache/spark
      60bc6d5 [Andrew Or] First complete implementation of HistoryServer (only for finished apps)
      7584418 [Andrew Or] Report application start/end times to HistoryServer
      8aac163 [Andrew Or] Add basic application table
      c086bd5 [Andrew Or] Add HistoryServer and scripts ++ Refactor WebUI interface
      79820fe8
    • witgo's avatar
      Fix SPARK-1413: Parquet messes up stdout and stdin when used in Spark REPL · a74fbbbc
      witgo authored
      Author: witgo <witgo@qq.com>
      
      Closes #325 from witgo/SPARK-1413 and squashes the following commits:
      
      e57cd8e [witgo] use scala reflection to access and call the SLF4JBridgeHandler  methods
      45c8f40 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      5e35d87 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      0d5f819 [witgo] review commit
      45e5b70 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      fa69dcf [witgo] Merge branch 'master' into SPARK-1413
      3c98dc4 [witgo] Merge branch 'master' into SPARK-1413
      38160cb [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      ba09bcd [witgo] remove set the parquet log level
      a63d574 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      5231ecd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      3feb635 [witgo] parquet logger use parent handler
      fa00d5d [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      8bb6ffd [witgo] enableLogForwarding note fix
      edd9630 [witgo]  move to
      f447f50 [witgo] merging master
      5ad52bd [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1413
      76670c1 [witgo] review commit
      70f3c64 [witgo] Fix SPARK-1413
      a74fbbbc
    • Patrick Wendell's avatar
      e6d4a74d
  3. Apr 09, 2014
    • William Benton's avatar
      SPARK-729: Closures not always serialized at capture time · 8ca3b2bc
      William Benton authored
      [SPARK-729](https://spark-project.atlassian.net/browse/SPARK-729) concerns when free variables in closure arguments to transformations are captured.  Currently, it is possible for closures to get the environment in which they are serialized (not the environment in which they are created).  There are a few possible approaches to solving this problem and this PR will discuss some of them.  The approach I took has the advantage of being simple, obviously correct, and minimally-invasive, but it preserves something that has been bothering me about Spark's closure handling, so I'd like to discuss an alternative and get some feedback on whether or not it is worth pursuing.
      
      ## What I did
      
      The basic approach I took depends on the work I did for #143, and so this PR is based atop that.  Specifically: #143 modifies `ClosureCleaner.clean` to preemptively determine whether or not closures are serializable immediately upon closure cleaning (rather than waiting for an job involving that closure to be scheduled).  Thus non-serializable closure exceptions will be triggered by the line defining the closure rather than triggered where the closure is used.
      
      Since the easiest way to determine whether or not a closure is serializable is to attempt to serialize it, the code in #143 is creating a serialized closure as part of `ClosureCleaner.clean`.  `clean` currently modifies its argument, but the method in `SparkContext` that wraps it to return a value (a reference to the modified-in-place argument).  This branch modifies `ClosureCleaner.clean` so that it returns a value:  if it is cleaning a serializable closure, it returns the result of deserializing its serialized argument; therefore it is returning a closure with an environment captured at cleaning time.  `SparkContext.clean` then returns the result of `ClosureCleaner.clean`, rather than a reference to its modified-in-place argument.
      
      I've added tests for this behavior (777a1bc).  The pull request as it stands, given the changes in #143, is nearly trivial.  There is some overhead from deserializing the closure, but it is minimal and the benefit of obvious operational correctness (vs. a more sophisticated but harder-to-validate transformation in `ClosureCleaner`) seems pretty important.  I think this is a fine way to solve this problem, but it's not perfect.
      
      ## What we might want to do
      
      The thing that has been bothering me about Spark's handling of closures is that it seems like we should be able to statically ensure that cleaning and serialization happen exactly once for a given closure.  If we serialize a closure in order to determine whether or not it is serializable, we should be able to hang on to the generated byte buffer and use it instead of re-serializing the closure later.  By replacing closures with instances of a sum type that encodes whether or not a closure has been cleaned or serialized, we could handle clean, to-be-cleaned, and serialized closures separately with case matches.  Here's a somewhat-concrete sketch (taken from my git stash) of what this might look like:
      
      ```scala
      package org.apache.spark.util
      
      import java.nio.ByteBuffer
      import scala.reflect.ClassManifest
      
      sealed abstract class ClosureBox[T] { def func: T }
      final case class RawClosure[T](func: T) extends ClosureBox[T] {}
      final case class CleanedClosure[T](func: T) extends ClosureBox[T] {}
      final case class SerializedClosure[T](func: T, bytebuf: ByteBuffer) extends ClosureBox[T] {}
      
      object ClosureBoxImplicits {
        implicit def closureBoxFromFunc[T <: AnyRef](fun: T) = new RawClosure[T](fun)
      }
      ```
      
      With these types declared, we'd be able to change `ClosureCleaner.clean` to take a `ClosureBox[T=>U]` (possibly generated by implicit conversion) and return a `ClosureBox[T=>U]` (either a `CleanedClosure[T=>U]` or a `SerializedClosure[T=>U]`, depending on whether or not serializability-checking was enabled) instead of a `T=>U`.  A case match could thus short-circuit cleaning or serializing closures that had already been cleaned or serialized (both in `ClosureCleaner` and in the closure serializer).  Cleaned-and-serialized closures would be represented by a boxed tuple of the original closure and a serialized copy (complete with an environment quiesced at transformation time).  Additional implicit conversions could convert from `ClosureBox` instances to the underlying function type where appropriate.  Tracking this sort of state in the type system seems like the right thing to do to me.
      
      ### Why we might not want to do that
      
      _It's pretty invasive._  Every function type used by every `RDD` subclass would have to change to reflect that they expected a `ClosureBox[T=>U]` instead of a `T=>U`.  This obscures what's going on and is not a little ugly.  Although I really like the idea of using the type system to enforce the clean-or-serialize once discipline, it might not be worth adding another layer of types (even if we could hide some of the extra boilerplate with judicious application of implicit conversions).
      
      _It statically guarantees a property whose absence is unlikely to cause any serious problems as it stands._  It appears that all closures are currently dynamically cleaned once and it's not obvious that repeated closure-cleaning is likely to be a problem in the future.  Furthermore, serializing closures is relatively cheap, so doing it once to check for serialization and once again to actually ship them across the wire doesn't seem like a big deal.
      
      Taken together, these seem like a high price to pay for statically guaranteeing that closures are operated upon only once.
      
      ## Other possibilities
      
      I felt like the serialize-and-deserialize approach was best due to its obvious simplicity.  But it would be possible to do a more sophisticated transformation within `ClosureCleaner.clean`.  It might also be possible for `clean` to modify its argument in a way so that whether or not a given closure had been cleaned would be apparent upon inspection; this would buy us some of the operational benefits of the `ClosureBox` approach but not the static cleanliness.
      
      I'm interested in any feedback or discussion on whether or not the problems with the type-based approach indeed outweigh the advantage, as well as of approaches to this issue and to closure handling in general.
      
      Author: William Benton <willb@redhat.com>
      
      Closes #189 from willb/spark-729 and squashes the following commits:
      
      f4cafa0 [William Benton] Stylistic changes and cleanups
      b3d9c86 [William Benton] Fixed style issues in tests
      9b56ce0 [William Benton] Added array-element capture test
      97e9d91 [William Benton] Split closure-serializability failure tests
      12ef6e3 [William Benton] Skip proactive closure capture for runJob
      8ee3ee7 [William Benton] Predictable closure environment capture
      12c63a7 [William Benton] Added tests for variable capture in closures
      d6e8dd6 [William Benton] Don't check serializability of DStream transforms.
      4ecf841 [William Benton] Make proactive serializability checking optional.
      d8df3db [William Benton] Adds proactive closure-serializablilty checking
      21b4b06 [William Benton] Test cases for SPARK-897.
      d5947b3 [William Benton] Ensure assertions in Graph.apply are asserted.
      8ca3b2bc
    • Kan Zhang's avatar
      SPARK-1407 drain event queue before stopping event logger · eb5f2b64
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #366 from kanzhang/SPARK-1407 and squashes the following commits:
      
      cd0629f [Kan Zhang] code refactoring and adding test
      b073ee6 [Kan Zhang] SPARK-1407 drain event queue before stopping event logger
      eb5f2b64
    • Patrick Wendell's avatar
      SPARK-1093: Annotate developer and experimental API's · 87bd1f9e
      Patrick Wendell authored
      This patch marks some existing classes as private[spark] and adds two types of API annotations:
      - `EXPERIMENTAL API` = experimental user-facing module
      - `DEVELOPER API - UNSTABLE` = developer-facing API that might change
      
      There is some discussion of the different mechanisms for doing this here:
      https://issues.apache.org/jira/browse/SPARK-1081
      
      I was pretty aggressive with marking things private. Keep in mind that if we want to open something up in the future we can, but we can never reduce visibility.
      
      A few notes here:
      - In the past we've been inconsistent with the visiblity of the X-RDD classes. This patch marks them private whenever there is an existing function in RDD that can directly creat them (e.g. CoalescedRDD and rdd.coalesce()). One trade-off here is users can't subclass them.
      - Noted that compression and serialization formats don't have to be wire compatible across versions.
      - Compression codecs and serialization formats are semi-private as users typically don't instantiate them directly.
      - Metrics sources are made private - user only interacts with them through Spark's reflection
      
      Author: Patrick Wendell <pwendell@gmail.com>
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #274 from pwendell/private-apis and squashes the following commits:
      
      44179e4 [Patrick Wendell] Merge remote-tracking branch 'apache-github/master' into private-apis
      042c803 [Patrick Wendell] spark.annotations -> spark.annotation
      bfe7b52 [Patrick Wendell] Adding experimental for approximate counts
      8d0c873 [Patrick Wendell] Warning in SparkEnv
      99b223a [Patrick Wendell] Cleaning up annotations
      e849f64 [Patrick Wendell] Merge pull request #2 from andrewor14/annotations
      982a473 [Andrew Or] Generalize jQuery matching for non Spark-core API docs
      a01c076 [Patrick Wendell] Merge pull request #1 from andrewor14/annotations
      c1bcb41 [Andrew Or] DeveloperAPI -> DeveloperApi
      0d48908 [Andrew Or] Comments and new lines (minor)
      f3954e0 [Andrew Or] Add identifier tags in comments to work around scaladocs bug
      99192ef [Andrew Or] Dynamically add badges based on annotations
      824011b [Andrew Or] Add support for injecting arbitrary JavaScript to API docs
      037755c [Patrick Wendell] Some changes after working with andrew or
      f7d124f [Patrick Wendell] Small fixes
      c318b24 [Patrick Wendell] Use CSS styles
      e4c76b9 [Patrick Wendell] Logging
      f390b13 [Patrick Wendell] Better visibility for workaround constructors
      d6b0afd [Patrick Wendell] Small chang to existing constructor
      403ba52 [Patrick Wendell] Style fix
      870a7ba [Patrick Wendell] Work around for SI-8479
      7fb13b2 [Patrick Wendell] Changes to UnionRDD and EmptyRDD
      4a9e90c [Patrick Wendell] EXPERIMENTAL API --> EXPERIMENTAL
      c581dce [Patrick Wendell] Changes after building against Shark.
      8452309 [Patrick Wendell] Style fixes
      1ed27d2 [Patrick Wendell] Formatting and coloring of badges
      cd7a465 [Patrick Wendell] Code review feedback
      2f706f1 [Patrick Wendell] Don't use floats
      542a736 [Patrick Wendell] Small fixes
      cf23ec6 [Patrick Wendell] Marking GraphX as alpha
      d86818e [Patrick Wendell] Another naming change
      5a76ed6 [Patrick Wendell] More visiblity clean-up
      42c1f09 [Patrick Wendell] Using better labels
      9d48cbf [Patrick Wendell] Initial pass
      87bd1f9e
    • Holden Karau's avatar
      Spark-939: allow user jars to take precedence over spark jars · fa0524fd
      Holden Karau authored
      I still need to do a small bit of re-factoring [mostly the one Java file I'll switch it back to a Scala file and use it in both the close loaders], but comments on other things I should do would be great.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #217 from holdenk/spark-939-allow-user-jars-to-take-precedence-over-spark-jars and squashes the following commits:
      
      cf0cac9 [Holden Karau] Fix the executorclassloader
      1955232 [Holden Karau] Fix long line in TestUtils
      8f89965 [Holden Karau] Fix tests for new class name
      7546549 [Holden Karau] CR feedback, merge some of the testutils methods down, rename the classloader
      644719f [Holden Karau] User the class generator for the repl class loader tests too
      f0b7114 [Holden Karau] Fix the core/src/test/scala/org/apache/spark/executor/ExecutorURLClassLoaderSuite.scala tests
      204b199 [Holden Karau] Fix the generated classes
      9f68f10 [Holden Karau] Start rewriting the ExecutorURLClassLoaderSuite to not use the hard coded classes
      858aba2 [Holden Karau] Remove a bunch of test junk
      261aaee [Holden Karau] simplify executorurlclassloader a bit
      7a7bf5f [Holden Karau] CR feedback
      d4ae848 [Holden Karau] rewrite component into scala
      aa95083 [Holden Karau] CR feedback
      7752594 [Holden Karau] re-add https comment
      a0ef85a [Holden Karau] Fix style issues
      125ea7f [Holden Karau] Easier to just remove those files, we don't need them
      bb8d179 [Holden Karau] Fix issues with the repl class loader
      241b03d [Holden Karau] fix my rat excludes
      a343350 [Holden Karau] Update rat-excludes and remove a useless file
      d90d217 [Holden Karau] Fix fall back with custom class loader and add a test for it
      4919bf9 [Holden Karau] Fix parent calling class loader issue
      8a67302 [Holden Karau] Test are good
      9e2d236 [Holden Karau] It works comrade
      691ee00 [Holden Karau] It works ish
      dc4fe44 [Holden Karau] Does not depend on being in my home directory
      47046ff [Holden Karau] Remove bad import'
      22d83cb [Holden Karau] Add a test suite for the executor url class loader suite
      7ef4628 [Holden Karau] Clean up
      792d961 [Holden Karau] Almost works
      16aecd1 [Holden Karau] Doesn't quite work
      8d2241e [Holden Karau] Adda FakeClass for testing ClassLoader precedence options
      648b559 [Holden Karau] Both class loaders compile. Now for testing
      e1d9f71 [Holden Karau] One loader workers.
      fa0524fd
  4. Apr 08, 2014
    • Holden Karau's avatar
      Spark 1271: Co-Group and Group-By should pass Iterable[X] · ce8ec545
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:
      
      f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
      77048f8 [Holden Karau] Fix merge up to master
      d3fe909 [Holden Karau] use toSeq instead
      7a092a3 [Holden Karau] switch resultitr to resultiterable
      eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
      c5075aa [Holden Karau] If guava 14 had iterables
      2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
      11e730c [Holden Karau] Fix streaming tests
      66b583d [Holden Karau] Fix the core test suite to compile
      4ed579b [Holden Karau] Refactor from iterator to iterable
      d052c07 [Holden Karau] Python tests now pass with iterator pandas
      3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
      cd1e81c [Holden Karau] Try and make pickling list iterators work
      c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
      88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
      a5ee714 [Holden Karau] oops, was checking wrong iterator
      e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
      ec8cc3e [Holden Karau] Fix test issues\!
      4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
      fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
      ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
      b692868 [Holden Karau] Revert
      7e533f7 [Holden Karau] Fix the bug
      8a5153a [Holden Karau] Revert me, but we have some stuff to debug
      b4e86a9 [Holden Karau] Add a join based on the problem in SVD
      c4510e2 [Holden Karau] Revert this but for now put things in list pandas
      b4e0b1d [Holden Karau] Fix style issues
      71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
      b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
      37888ec [Holden Karau] core/tests now pass
      249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
      6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
      fe992fe [Holden Karau] hmmm try and fix up basic operation suite
      172705c [Holden Karau] Fix Java API suite
      caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
      88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
      4991af6 [Holden Karau] Fix some tests
      be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
      687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
      ce8ec545
    • Sandeep's avatar
      SPARK-1433: Upgrade Mesos dependency to 0.17.0 · 12c077d5
      Sandeep authored
      Mesos 0.13.0 was released 6 months ago.
      Upgrade Mesos dependency to 0.17.0
      
      Author: Sandeep <sandeep@techaddict.me>
      
      Closes #355 from techaddict/mesos_update and squashes the following commits:
      
      f1abeee [Sandeep] SPARK-1433: Upgrade Mesos dependency to 0.17.0 Mesos 0.13.0 was released 6 months ago. Upgrade Mesos dependency to 0.17.0
      12c077d5
    • Kay Ousterhout's avatar
      [SPARK-1397] Notify SparkListeners when stages fail or are cancelled. · fac6085c
      Kay Ousterhout authored
      [I wanted to post this for folks to comment but it depends on (and thus includes the changes in) a currently outstanding PR, #305.  You can look at just the second commit: https://github.com/kayousterhout/spark-1/commit/93f08baf731b9eaf5c9792a5373560526e2bccac to see just the changes relevant to this PR]
      
      Previously, when stages fail or get cancelled, the SparkListener is only notified
      indirectly through the SparkListenerJobEnd, where we sometimes pass in a single
      stage that failed.  This worked before job cancellation, because jobs would only fail
      due to a single stage failure.  However, with job cancellation, multiple running stages
      can fail when a job gets cancelled.  Right now, this is not handled correctly, which
      results in stages that get stuck in the “Running Stages” window in the UI even
      though they’re dead.
      
      This PR changes the SparkListenerStageCompleted event to a SparkListenerStageEnded
      event, and uses this event to tell SparkListeners when stages fail in addition to when
      they complete successfully.  This change is NOT publicly backward compatible for two
      reasons.  First, it changes the SparkListener interface.  We could alternately add a new event,
      SparkListenerStageFailed, and keep the existing SparkListenerStageCompleted.  However,
      this is less consistent with the listener events for tasks / jobs ending, and will result in some
      code duplication for listeners (because failed and completed stages are handled in similar
      ways).  Note that I haven’t finished updating the JSON code to correctly handle the new event
      because I’m waiting for feedback on whether this is a good or bad idea (hence the “WIP”).
      
      It is also not backwards compatible because it changes the publicly visible JobWaiter.jobFailed()
      method to no longer include a stage that caused the failure.  I think this change should definitely
      stay, because with cancellation (as described above), a failure isn’t necessarily caused by a
      single stage.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #309 from kayousterhout/stage_cancellation and squashes the following commits:
      
      5533ecd [Kay Ousterhout] Fixes in response to Mark's review
      320c7c7 [Kay Ousterhout] Notify SparkListeners when stages fail or are cancelled.
      fac6085c
    • Kan Zhang's avatar
      SPARK-1348 binding Master, Worker, and App Web UI to all interfaces · a8d86b08
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #318 from kanzhang/SPARK-1348 and squashes the following commits:
      
      e625a5f [Kan Zhang] reverting the changes to startJettyServer()
      7a8084e [Kan Zhang] SPARK-1348 binding Master, Worker, and App Web UI to all interfaces
      a8d86b08
    • Kay Ousterhout's avatar
      [SPARK-1396] Properly cleanup DAGScheduler on job cancellation. · 6dc5f584
      Kay Ousterhout authored
      Previously, when jobs were cancelled, not all of the state in the
      DAGScheduler was cleaned up, leading to a slow memory leak in the
      DAGScheduler.  As we expose easier ways to cancel jobs, it's more
      important to fix these issues.
      
      This commit also fixes a second and less serious problem, which is that
      previously, when a stage failed, not all of the appropriate stages
      were cancelled.  See the "failure of stage used by two jobs" test
      for an example of this.  This just meant that extra work was done, and is
      not a correctness problem.
      
      This commit adds 3 tests.  “run shuffle with map stage failure” is
      a new test to more thoroughly test this functionality, and passes on
      both the old and new versions of the code.  “trivial job
      cancellation” fails on the old code because all state wasn’t cleaned
      up correctly when jobs were cancelled (we didn’t remove the job from
      resultStageToJob).  “failure of stage used by two jobs” fails on the
      old code because taskScheduler.cancelTasks wasn’t called for one of
      the stages (see test comments).
      
      This should be checked in before #246, which makes it easier to
      cancel stages / jobs.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #305 from kayousterhout/incremental_abort_fix and squashes the following commits:
      
      f33d844 [Kay Ousterhout] Mark review comments
      9217080 [Kay Ousterhout] Properly cleanup DAGScheduler on job cancellation.
      6dc5f584
    • Tathagata Das's avatar
      [SPARK-1103] Automatic garbage collection of RDD, shuffle and broadcast data · 11eabbe1
      Tathagata Das authored
      This PR allows Spark to automatically cleanup metadata and data related to persisted RDDs, shuffles and broadcast variables when the corresponding RDDs, shuffles and broadcast variables fall out of scope from the driver program. This is still a work in progress as broadcast cleanup has not been implemented.
      
      **Implementation Details**
      A new class `ContextCleaner` is responsible cleaning all the state. It is instantiated as part of a `SparkContext`. RDD and ShuffleDependency classes have overridden `finalize()` function that gets called whenever their instances go out of scope. The `finalize()` function enqueues the object’s identifier (i.e. RDD ID, shuffle ID, etc.) with the `ContextCleaner`, which is a very short and cheap operation and should not significantly affect the garbage collection mechanism. The `ContextCleaner`, on a different thread, performs the cleanup, whose details are given below.
      
      *RDD cleanup:*
      `ContextCleaner` calls `RDD.unpersist()` is used to cleanup persisted RDDs. Regarding metadata, the DAGScheduler automatically cleans up all metadata related to a RDD after all jobs have completed. Only the `SparkContext.persistentRDDs` keeps strong references to persisted RDDs. The `TimeStampedHashMap` used for that has been replaced by `TimeStampedWeakValueHashMap` that keeps only weak references to the RDDs, allowing them to be garbage collected.
      
      *Shuffle cleanup:*
      New BlockManager message `RemoveShuffle(<shuffle ID>)` asks the `BlockManagerMaster` and currently active `BlockManager`s to delete all the disk blocks related to the shuffle ID. `ContextCleaner` cleans up shuffle data using this message and also cleans up the metadata in the `MapOutputTracker` of the driver. The `MapOutputTracker` at the workers, that caches the shuffle metadata, maintains a `BoundedHashMap` to limit the shuffle information it caches. Refetching the shuffle information from the driver is not too costly.
      
      *Broadcast cleanup:*
      To be done. [This PR](https://github.com/apache/incubator-spark/pull/543/) adds mechanism for explicit cleanup of broadcast variables. `Broadcast.finalize()` will enqueue its own ID with ContextCleaner and the PRs mechanism will be used to unpersist the Broadcast data.
      
      *Other cleanup:*
      `ShuffleMapTask` and `ResultTask` caches tasks and used TTL based cleanup (using `TimeStampedHashMap`), so nothing got cleaned up if TTL was not set. Instead, they now use `BoundedHashMap` to keep a limited number of map output information. Cost of repopulating the cache if necessary is very small.
      
      **Current state of implementation**
      Implemented RDD and shuffle cleanup. Things left to be done are.
      - Cleaning up for broadcast variable still to be done.
      - Automatic cleaning up keys with empty weak refs as values in `TimeStampedWeakValueHashMap`
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Andrew Or <andrewor14@gmail.com>
      Author: Roman Pastukhov <ignatich@mail.ru>
      
      Closes #126 from tdas/state-cleanup and squashes the following commits:
      
      61b8d6e [Tathagata Das] Fixed issue with Tachyon + new BlockManager methods.
      f489fdc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
      d25a86e [Tathagata Das] Fixed stupid typo.
      cff023c [Tathagata Das] Fixed issues based on Andrew's comments.
      4d05314 [Tathagata Das] Scala style fix.
      2b95b5e [Tathagata Das] Added more documentation on Broadcast implementations, specially which blocks are told about to the driver. Also, fixed Broadcast API to hide destroy functionality.
      41c9ece [Tathagata Das] Added more unit tests for BlockManager, DiskBlockManager, and ContextCleaner.
      6222697 [Tathagata Das] Fixed bug and adding unit test for removeBroadcast in BlockManagerSuite.
      104a89a [Tathagata Das] Fixed failing BroadcastSuite unit tests by introducing blocking for removeShuffle and removeBroadcast in BlockManager*
      a430f06 [Tathagata Das] Fixed compilation errors.
      b27f8e8 [Tathagata Das] Merge pull request #3 from andrewor14/cleanup
      cd72d19 [Andrew Or] Make automatic cleanup configurable (not documented)
      ada45f0 [Andrew Or] Merge branch 'state-cleanup' of github.com:tdas/spark into cleanup
      a2cc8bc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
      c5b1d98 [Andrew Or] Address Patrick's comments
      a6460d4 [Andrew Or] Merge github.com:apache/spark into cleanup
      762a4d8 [Tathagata Das] Merge pull request #1 from andrewor14/cleanup
      f0aabb1 [Andrew Or] Correct semantics for TimeStampedWeakValueHashMap + add tests
      5016375 [Andrew Or] Address TD's comments
      7ed72fb [Andrew Or] Fix style test fail + remove verbose test message regarding broadcast
      634a097 [Andrew Or] Merge branch 'state-cleanup' of github.com:tdas/spark into cleanup
      7edbc98 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into state-cleanup
      8557c12 [Andrew Or] Merge github.com:apache/spark into cleanup
      e442246 [Andrew Or] Merge github.com:apache/spark into cleanup
      88904a3 [Andrew Or] Make TimeStampedWeakValueHashMap a wrapper of TimeStampedHashMap
      fbfeec8 [Andrew Or] Add functionality to query executors for their local BlockStatuses
      34f436f [Andrew Or] Generalize BroadcastBlockId to remove BroadcastHelperBlockId
      0d17060 [Andrew Or] Import, comments, and style fixes (minor)
      c92e4d9 [Andrew Or] Merge github.com:apache/spark into cleanup
      f201a8d [Andrew Or] Test broadcast cleanup in ContextCleanerSuite + remove BoundedHashMap
      e95479c [Andrew Or] Add tests for unpersisting broadcast
      544ac86 [Andrew Or] Clean up broadcast blocks through BlockManager*
      d0edef3 [Andrew Or] Add framework for broadcast cleanup
      ba52e00 [Andrew Or] Refactor broadcast classes
      c7ccef1 [Andrew Or] Merge branch 'bc-unpersist-merge' of github.com:ignatich/incubator-spark into cleanup
      6c9dcf6 [Tathagata Das] Added missing Apache license
      d2f8b97 [Tathagata Das] Removed duplicate unpersistRDD.
      a007307 [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
      620eca3 [Tathagata Das] Changes based on PR comments.
      f2881fd [Tathagata Das] Changed ContextCleaner to use ReferenceQueue instead of finalizer
      e1fba5f [Tathagata Das] Style fix
      892b952 [Tathagata Das] Removed use of BoundedHashMap, and made BlockManagerSlaveActor cleanup shuffle metadata in MapOutputTrackerWorker.
      a7260d3 [Tathagata Das] Added try-catch in context cleaner and null value cleaning in TimeStampedWeakValueHashMap.
      e61daa0 [Tathagata Das] Modifications based on the comments on PR 126.
      ae9da88 [Tathagata Das] Removed unncessary TimeStampedHashMap from DAGScheduler, added try-catches in finalize() methods, and replaced ArrayBlockingQueue to LinkedBlockingQueue to avoid blocking in Java's finalizing thread.
      cb0a5a6 [Tathagata Das] Fixed docs and styles.
      a24fefc [Tathagata Das] Merge remote-tracking branch 'apache/master' into state-cleanup
      8512612 [Tathagata Das] Changed TimeStampedHashMap to use WrappedJavaHashMap.
      e427a9e [Tathagata Das] Added ContextCleaner to automatically clean RDDs and shuffles when they fall out of scope. Also replaced TimeStampedHashMap to BoundedHashMaps and TimeStampedWeakValueHashMap for the necessary hashmap behavior.
      80dd977 [Roman Pastukhov] Fix for Broadcast unpersist patch.
      1e752f1 [Roman Pastukhov] Added unpersist method to Broadcast.
      11eabbe1
  5. Apr 07, 2014
    • Aaron Davidson's avatar
      SPARK-1099: Introduce local[*] mode to infer number of cores · 0307db0f
      Aaron Davidson authored
      This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #182 from aarondav/110 and squashes the following commits:
      
      a88294c [Aaron Davidson] Rebased changes for new spark-shell
      a9f393e [Aaron Davidson] SPARK-1099: Introduce local[*] mode to infer number of cores
      0307db0f
    • Davis Shepherd's avatar
      SPARK-1432: Make sure that all metadata fields are properly cleaned · a3c51c6e
      Davis Shepherd authored
      While working on spark-1337 with @pwendell, we noticed that not all of the metadata maps in JobProgessListener were being properly cleaned. This could lead to a (hypothetical) memory leak issue should a job run long enough. This patch aims to address the issue.
      
      Author: Davis Shepherd <davis@conviva.com>
      
      Closes #338 from dgshep/master and squashes the following commits:
      
      a77b65c [Davis Shepherd] In the contex of SPARK-1337: Make sure that all metadata fields are properly cleaned
      a3c51c6e
  6. Apr 06, 2014
    • Evan Chan's avatar
      SPARK-1154: Clean up app folders in worker nodes · 1440154c
      Evan Chan authored
      This is a fix for [SPARK-1154](https://issues.apache.org/jira/browse/SPARK-1154).   The issue is that worker nodes fill up with a huge number of app-* folders after some time.  This change adds a periodic cleanup task which asynchronously deletes app directories older than a configurable TTL.
      
      Two new configuration parameters have been introduced:
        spark.worker.cleanup_interval
        spark.worker.app_data_ttl
      
      This change does not include moving the downloads of application jars to a location outside of the work directory.  We will address that if we have time, but that potentially involves caching so it will come either as part of this PR or a separate PR.
      
      Author: Evan Chan <ev@ooyala.com>
      Author: Kelvin Chu <kelvinkwchu@yahoo.com>
      
      Closes #288 from velvia/SPARK-1154-cleanup-app-folders and squashes the following commits:
      
      0689995 [Evan Chan] CR from @aarondav - move config, clarify for standalone mode
      9f10d96 [Evan Chan] CR from @pwendell - rename configs and add cleanup.enabled
      f2f6027 [Evan Chan] CR from @andrewor14
      553d8c2 [Kelvin Chu] change the variable name to currentTimeMillis since it actually tracks in seconds
      8dc9cb5 [Kelvin Chu] Fixed a bug in Utils.findOldFiles() after merge.
      cb52f2b [Kelvin Chu] Change the name of findOldestFiles() to findOldFiles()
      72f7d2d [Kelvin Chu] Fix a bug of Utils.findOldestFiles(). file.lastModified is returned in milliseconds.
      ad99955 [Kelvin Chu] Add unit test for Utils.findOldestFiles()
      dc1a311 [Evan Chan] Don't recompute current time with every new file
      e3c408e [Evan Chan] Document the two new settings
      b92752b [Evan Chan] SPARK-1154: Add a periodic task to clean up app directories
      1440154c
    • Sean Owen's avatar
      SPARK-1387. Update build plugins, avoid plugin version warning, centralize versions · 856c50f5
      Sean Owen authored
      Another handful of small build changes to organize and standardize a bit, and avoid warnings:
      
      - Update Maven plugin versions for good measure
      - Since plugins need maven 3.0.4 already, require it explicitly (<3.0.4 had some bugs anyway)
      - Use variables to define versions across dependencies where they should move in lock step
      - ... and make this consistent between Maven/SBT
      
      OK, I also updated the JIRA URL while I was at it here.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #291 from srowen/SPARK-1387 and squashes the following commits:
      
      461eca1 [Sean Owen] Couldn't resist also updating JIRA location to new one
      c2d5cc5 [Sean Owen] Update plugins and Maven version; use variables consistently across Maven/SBT to define dependency versions that should stay in step.
      856c50f5
    • Egor Pakhomov's avatar
      [SPARK-1259] Make RDD locally iterable · e258e504
      Egor Pakhomov authored
      Author: Egor Pakhomov <pahomov.egor@gmail.com>
      
      Closes #156 from epahomov/SPARK-1259 and squashes the following commits:
      
      8ec8f24 [Egor Pakhomov] Make to local iterator shorter
      34aa300 [Egor Pakhomov] Fix toLocalIterator docs
      08363ef [Egor Pakhomov] SPARK-1259 from toLocallyIterable to toLocalIterator
      6a994eb [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
      8be3dcf [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
      33ecb17 [Egor Pakhomov] SPARK-1259 Make RDD locally iterable
      e258e504
  7. Apr 05, 2014
  8. Apr 04, 2014
    • Haoyuan Li's avatar
      SPARK-1305: Support persisting RDD's directly to Tachyon · b50ddfde
      Haoyuan Li authored
      Move the PR#468 of apache-incubator-spark to the apache-spark
      "Adding an option to persist Spark RDD blocks into Tachyon."
      
      Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
      Author: RongGu <gurongwalker@gmail.com>
      
      Closes #158 from RongGu/master and squashes the following commits:
      
      72b7768 [Haoyuan Li] merge master
      9f7fa1b [Haoyuan Li] fix code style
      ae7834b [Haoyuan Li] minor cleanup
      a8b3ec6 [Haoyuan Li] merge master branch
      e0f4891 [Haoyuan Li] better check offheap.
      55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
      7cd4600 [RongGu] remove some logic code for tachyonstore's replication
      51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
      8adfcfa [RongGu] address arron's comment on inTachyonSize
      120e48a [RongGu] changed the root-level dir name in Tachyon
      5cc041c [Haoyuan Li] address aaron's comments
      9b97935 [Haoyuan Li] address aaron's comments
      d9a6438 [Haoyuan Li] fix for pspark
      77d2703 [Haoyuan Li] change python api.git status
      3dcace4 [Haoyuan Li] address matei's comments
      91fa09d [Haoyuan Li] address patrick's comments
      589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
      64348b2 [Haoyuan Li] update conf docs.
      ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
      619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
      be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
      49cc724 [Haoyuan Li] update docs with off_headp option
      4572f9f [RongGu] reserving the old apply function API of StorageLevel
      04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
      c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
      76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
      e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
      fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
      939e467 [Haoyuan Li] 0.4.1-thrift from maven central
      86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
      16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
      eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
      6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
      d827250 [RongGu] fix JsonProtocolSuie test failure
      716e93b [Haoyuan Li] revert the version
      ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
      2825a13 [RongGu] up-merging to the current master branch of the apache spark
      6a22c1a [Haoyuan Li] fix scalastyle
      8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
      77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
      1dcadf9 [Haoyuan Li] typo
      bf278fa [Haoyuan Li] fix python tests
      e82909c [Haoyuan Li] minor cleanup
      776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
      8859371 [Haoyuan Li] various minor fixes and clean up
      e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
      fcaeab2 [Haoyuan Li] address Aaron's comment
      e554b1e [Haoyuan Li] add python code
      47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
      dc8ef24 [Haoyuan Li] add old storelevel constructor
      e01a271 [Haoyuan Li] update tachyon 0.4.1
      8011a96 [RongGu] fix a brought-in mistake in StorageLevel
      70ca182 [RongGu] a bit change in comment
      556978b [RongGu] fix the scalastyle errors
      791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
      b50ddfde
    • Patrick Wendell's avatar
      Add test utility for generating Jar files with compiled classes. · 5f3c1bb5
      Patrick Wendell authored
      This was requested by a few different people and may be generally
      useful, so I'd like to contribute this and not block on a different
      PR for it to get in.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #326 from pwendell/class-loader-test-utils and squashes the following commits:
      
      ff3e88e [Patrick Wendell] Add test utility for generating Jar files with compiled classes.
      5f3c1bb5
    • Matei Zaharia's avatar
      SPARK-1414. Python API for SparkContext.wholeTextFiles · 60e18ce7
      Matei Zaharia authored
      Also clarified comment on each file having to fit in memory
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #327 from mateiz/py-whole-files and squashes the following commits:
      
      9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
      60e18ce7
    • Thomas Graves's avatar
      [SPARK-1198] Allow pipes tasks to run in different sub-directories · 198892fe
      Thomas Graves authored
      This works as is on Linux/Mac/etc but doesn't cover working on Windows.  In here I use ln -sf for symlinks. Putting this up for comments on that. Do we want to create perhaps some classes for doing shell commands - Linux vs Windows.  Is there some other way we want to do this?   I assume we are still supporting jdk1.6?
      
      Also should I update the Java API for pipes to allow this parameter?
      
      Author: Thomas Graves <tgraves@apache.org>
      
      Closes #128 from tgravescs/SPARK1198 and squashes the following commits:
      
      abc1289 [Thomas Graves] remove extra tag in pom file
      ba23fc0 [Thomas Graves] Add support for symlink on windows, remove commons-io usage
      da4b221 [Thomas Graves] Merge branch 'master' of https://github.com/tgravescs/spark into SPARK1198
      61be271 [Thomas Graves] Fix file name filter
      6b783bd [Thomas Graves] style fixes
      1ab49ca [Thomas Graves] Add support for running pipe tasks is separate directories
      198892fe
    • Patrick Wendell's avatar
      Don't create SparkContext in JobProgressListenerSuite. · a02b535d
      Patrick Wendell authored
      This reduces the time of the test from 11 seconds to 20 milliseconds.
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #324 from pwendell/job-test and squashes the following commits:
      
      868d9eb [Patrick Wendell] Don't create SparkContext in JobProgressListenerSuite.
      a02b535d
    • Sandy Ryza's avatar
      SPARK-1375. Additional spark-submit cleanup · 16b83088
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #278 from sryza/sandy-spark-1375 and squashes the following commits:
      
      5fbf1e9 [Sandy Ryza] SPARK-1375. Additional spark-submit cleanup
      16b83088
    • Xusen Yin's avatar
      [SPARK-1133] Add whole text files reader in MLlib · f1fa6170
      Xusen Yin authored
      Here is a pointer to the former [PR164](https://github.com/apache/spark/pull/164).
      
      I add the pull request for the JIRA issue [SPARK-1133](https://spark-project.atlassian.net/browse/SPARK-1133), which brings a new files reader API in MLlib.
      
      Author: Xusen Yin <yinxusen@gmail.com>
      
      Closes #252 from yinxusen/whole-files-input and squashes the following commits:
      
      7191be6 [Xusen Yin] refine comments
      0af3faf [Xusen Yin] add JavaAPI test
      01745ee [Xusen Yin] fix deletion error
      cc97dca [Xusen Yin] move whole text file API to Spark core
      d792cee [Xusen Yin] remove the typo character "+"
      6bdf2c2 [Xusen Yin] test for small local file system block size
      a1f1e7e [Xusen Yin] add two extra spaces
      28cb0fe [Xusen Yin] add whole text files reader
      f1fa6170
    • Patrick Wendell's avatar
      SPARK-1337: Application web UI garbage collects newest stages · ee6e9e7d
      Patrick Wendell authored
      Simple fix...
      
      Author: Patrick Wendell <pwendell@gmail.com>
      
      Closes #320 from pwendell/stage-clean-up and squashes the following commits:
      
      29be62e [Patrick Wendell] SPARK-1337: Application web UI garbage collects newest stages instead old ones
      ee6e9e7d
  9. Apr 03, 2014
  10. Apr 02, 2014
    • Andrew Or's avatar
      [SPARK-1385] Use existing code for JSON de/serialization of BlockId · de8eefa8
      Andrew Or authored
      `BlockId.scala` offers a way to reconstruct a BlockId from a string through regex matching. `util/JsonProtocol.scala` duplicates this functionality by explicitly matching on the BlockId type.
      With this PR, the de/serialization of BlockIds will go through the first (older) code path.
      
      (Most of the line changes in this PR involve changing `==` to `===` in `JsonProtocolSuite.scala`)
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #289 from andrewor14/blockid-json and squashes the following commits:
      
      409d226 [Andrew Or] Simplify JSON de/serialization for BlockId
      de8eefa8
    • Kay Ousterhout's avatar
      Renamed stageIdToActiveJob to jobIdToActiveJob. · 11973a7b
      Kay Ousterhout authored
      This data structure was misused and, as a result, later renamed to an incorrect name.
      
      This data structure seems to have gotten into this tangled state as a result of @henrydavidge using the stageID instead of the job Id to index into it and later @andrewor14 renaming the data structure to reflect this misunderstanding.
      
      This patch renames it and removes an incorrect indexing into it.  The incorrect indexing into it meant that the code added by @henrydavidge to warn when a task size is too large (added here https://github.com/apache/spark/commit/57579934f0454f258615c10e69ac2adafc5b9835) was not always executed; this commit fixes that.
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #301 from kayousterhout/fixCancellation and squashes the following commits:
      
      bd3d3a4 [Kay Ousterhout] Renamed stageIdToActiveJob to jobIdToActiveJob.
      11973a7b
  11. Apr 01, 2014
    • Mark Hamstra's avatar
      [SPARK-1342] Scala 2.10.4 · 764353d2
      Mark Hamstra authored
      Just a Scala version increment
      
      Author: Mark Hamstra <markhamstra@gmail.com>
      
      Closes #259 from markhamstra/scala-2.10.4 and squashes the following commits:
      
      fbec547 [Mark Hamstra] [SPARK-1342] Bumped Scala version to 2.10.4
      764353d2
    • Andrew Or's avatar
      [Hot Fix #42] Persisted RDD disappears on storage page if re-used · ada310a9
      Andrew Or authored
      If a previously persisted RDD is re-used, its information disappears from the Storage page.
      
      This is because the tasks associated with re-using the RDD do not report the RDD's blocks as updated (which is correct). On stage submit, however, we overwrite any existing information regarding that RDD with a fresh one, whether or not the information for the RDD already exists.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #281 from andrewor14/ui-storage-fix and squashes the following commits:
      
      408585a [Andrew Or] Fix storage UI bug
      ada310a9
  12. Mar 31, 2014
    • Andrew Or's avatar
      [SPARK-1377] Upgrade Jetty to 8.1.14v20131031 · 94fe7fd4
      Andrew Or authored
      Previous version was 7.6.8v20121106. The only difference between Jetty 7 and Jetty 8 is that the former uses Servlet API 2.5, while the latter uses Servlet API 3.0.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #280 from andrewor14/jetty-upgrade and squashes the following commits:
      
      dd57104 [Andrew Or] Merge github.com:apache/spark into jetty-upgrade
      e75fa85 [Andrew Or] Upgrade Jetty to 8.1.14v20131031
      94fe7fd4
Loading