Skip to content
Snippets Groups Projects
  1. Oct 06, 2013
  2. Oct 05, 2013
  3. Oct 04, 2013
    • Aaron Davidson's avatar
      Fix race conditions during recovery · db6f1549
      Aaron Davidson authored
      One major change was the use of messages instead of raw functions as the
      parameter of Akka scheduled timers. Since messages are serialized, unlike
      raw functions, the behavior is easier to think about and doesn't cause
      race conditions when exceptions are thrown.
      
      Another change is to avoid using global pointers that might change without
      a lock.
      db6f1549
  4. Sep 26, 2013
    • Aaron Davidson's avatar
      Add license notices · 42d72308
      Aaron Davidson authored
      42d72308
    • Aaron Davidson's avatar
      Standalone Scheduler fault tolerance using ZooKeeper · f549ea33
      Aaron Davidson authored
      This patch implements full distributed fault tolerance for standalone scheduler Masters.
      There is only one master Leader at a time, which is actively serving scheduling
      requests. If this Leader crashes, another master will eventually be elected, reconstruct
      the state from the first Master, and continue serving scheduling requests.
      
      Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
      the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
      retries and session monitoring on top of the ZooKeeper client.
      
      Master failover follows directly from the single-node Master recovery via the file
      system (patch 194ba4b8), save that the Master state is stored in ZooKeeper instead.
      
      Configuration:
      By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
      By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
      to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
      By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
      to an appropriate directory accessible by the Master, we will keep the behavior of from 194ba4b8.
      
      Additionally, places where a Master could be specificied by a spark:// url can now take
      comma-delimited lists to specify backup masters. Note that this is only used for registration
      of NEW Workers and application Clients. Once a Worker or Client has registered with the
      Master Leader, it is "in the system" and will never need to register again.
      
      Forthcoming:
      Documentation, tests (! - only ad hoc testing has been performed so far)
      I do not intend for this commit to be merged until tests are added, but this patch should
      still be mostly reviewable until then.
      f549ea33
    • Aaron Davidson's avatar
      Standalone Scheduler fault recovery · d5a96fec
      Aaron Davidson authored
      Implements a basic form of Standalone Scheduler fault recovery. In particular,
      this allows faults to be manually recovered from by means of restarting the
      Master process on the same machine. This is the majority of the code necessary
      for general fault tolerance, which will first elect a leader and then recover
      the Master state.
      
      In order to enable fault recovery, the Master will persist a small amount of state related
      to the registration of Workers and Applications to disk. If the Master is started and
      sees that this state is still around, it will enter Recovery mode, during which time it
      will not schedule any new Executors on Workers (but it does accept the registration of
      new Clients and Workers).
      
      At this point, the Master attempts to reconnect to all Workers and Client applications
      that were registered at the time of failure. After confirming either the existence
      or nonexistence of all such nodes (within a certain timeout), the Master will exit
      Recovery mode and resume normal scheduling.
      d5a96fec
    • Reynold Xin's avatar
      Merge pull request #16 from pwendell/master · 13eced72
      Reynold Xin authored
      Bug fix in master build
      13eced72
    • Reynold Xin's avatar
      Merge pull request #14 from kayousterhout/untangle_scheduler · 70a0b993
      Reynold Xin authored
      Improved organization of scheduling packages.
      
      This commit does not change any code -- only file organization.
      Please let me know if there was some masterminded strategy behind
      the existing organization that I failed to understand!
      
      There are two components of this change:
      (1) Moving files out of the cluster package, and down
      a level to the scheduling package. These files are all used by
      the local scheduler in addition to the cluster scheduler(s), so
      should not be in the cluster package. As a result of this change,
      none of the files in the local package reference files in the
      cluster package.
      
      (2) Moving the mesos package to within the cluster package.
      The mesos scheduling code is for a cluster, and represents a
      specific case of cluster scheduling (the Mesos-related classes
      often subclass cluster scheduling classes). Thus, the most logical
      place for it seems to be within the cluster package.
      
      The one thing about the scheduling code that seems a little funny to me
      is the naming of the SchedulerBackends.  The StandaloneSchedulerBackend
      is not just for Standalone mode, but instead is used by Mesos coarse grained
      mode and Yarn, and the backend that *is* just for Standalone mode is instead called SparkDeploySchedulerBackend. I didn't change this because I wasn't sure if there
      was a reason for this naming that I'm just not aware of.
      70a0b993
    • Reynold Xin's avatar
      Merge pull request #670 from jey/ec2-ssh-improvements · 76677b8f
      Reynold Xin authored
      EC2 SSH improvements
      76677b8f
    • Reynold Xin's avatar
      Merge pull request #930 from holdenk/master · c514cd15
      Reynold Xin authored
      Add mapPartitionsWithIndex
      c514cd15
    • Patrick Wendell's avatar
      Bug fix in master build · e2ff59af
      Patrick Wendell authored
      e2ff59af
    • Reynold Xin's avatar
      Merge pull request #7 from wannabeast/memorystore-fixes · 560ee5c9
      Reynold Xin authored
      some minor fixes to MemoryStore
      
      This is a repeat of #5, moved to its own branch in my repo.
      
      This makes all updates to   on ; it skips on synchronizing the reads where it can get away with it.
      560ee5c9
    • Patrick Wendell's avatar
      Merge pull request #9 from rxin/limit · 6566a19b
      Patrick Wendell authored
      Smarter take/limit implementation.
      6566a19b
  5. Sep 25, 2013
    • Kay Ousterhout's avatar
      Improved organization of scheduling packages. · d85fe41b
      Kay Ousterhout authored
      This commit does not change any code -- only file organization.
      
      There are two components of this change:
      (1) Moving files out of the cluster package, and down
      a level to the scheduling package. These files are all used by
      the local scheduler in addition to the cluster scheduler(s), so
      should not be in the cluster package. As a result of this change,
      none of the files in the local package reference files in the
      cluster package.
      
      (2) Moving the mesos package to within the cluster package.
      The mesos scheduling code is for a cluster, and represents a
      specific case of cluster scheduling (the Mesos-related classes
      often subclass cluster scheduling classes). Thus, the most logical
      place for it is within the cluster package.
      d85fe41b
  6. Sep 24, 2013
  7. Sep 23, 2013
  8. Sep 22, 2013
  9. Sep 21, 2013
  10. Sep 20, 2013
  11. Sep 19, 2013
    • Patrick Wendell's avatar
      Merge pull request #938 from ilikerps/master · cd7222c3
      Patrick Wendell authored
      Fix issue with spark_ec2 seeing empty security groups
      cd7222c3
    • Aaron Davidson's avatar
      Fix issue with spark_ec2 seeing empty security groups · f589ce77
      Aaron Davidson authored
      Under unknown, but occasional, circumstances, reservation.groups is empty
      despite reservation.instances each having groups. This means that the
      spark_ec2 get_existing_clusters() method would fail to find any instances.
      To fix it, we simply use the instances' groups as the source of truth.
      
      Note that this is actually just a revival of PR #827, now that the issue
      has been reproduced.
      f589ce77
  12. Sep 18, 2013
  13. Sep 16, 2013
Loading