Commits · e1190229e13453cdb1e7c28fdf300d1f8dd717c2 · cs525-sp18-g07 / spark

Oct 06, 2013
- Add end-to-end test for standalone scheduler fault tolerance · e1190229
  Aaron Davidson authored 11 years ago
  
  Docker files drawn mostly from Matt Masse. Some updates from Andre Schumacher.
  e1190229
Oct 05, 2013
- Address Matei's comments · 0f070279
  Aaron Davidson authored 11 years ago
  
  0f070279
Oct 04, 2013

Fix race conditions during recovery · db6f1549

Aaron Davidson authored 11 years ago

One major change was the use of messages instead of raw functions as the
parameter of Akka scheduled timers. Since messages are serialized, unlike
raw functions, the behavior is easier to think about and doesn't cause
race conditions when exceptions are thrown.

Another change is to avoid using global pointers that might change without
a lock.

db6f1549

Sep 26, 2013

Add license notices · 42d72308
Aaron Davidson authored 11 years ago

42d72308

Standalone Scheduler fault tolerance using ZooKeeper · f549ea33

Aaron Davidson authored 11 years ago

This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.

Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.

Master failover follows directly from the single-node Master recovery via the file
system (patch 194ba4b8), save that the Master state is stored in ZooKeeper instead.

Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from 194ba4b8.

Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.

Forthcoming:
Documentation, tests (! - only ad hoc testing has been performed so far)
I do not intend for this commit to be merged until tests are added, but this patch should
still be mostly reviewable until then.

f549ea33

Standalone Scheduler fault recovery · d5a96fec

Aaron Davidson authored 11 years ago

Implements a basic form of Standalone Scheduler fault recovery. In particular,
this allows faults to be manually recovered from by means of restarting the
Master process on the same machine. This is the majority of the code necessary
for general fault tolerance, which will first elect a leader and then recover
the Master state.

In order to enable fault recovery, the Master will persist a small amount of state related
to the registration of Workers and Applications to disk. If the Master is started and
sees that this state is still around, it will enter Recovery mode, during which time it
will not schedule any new Executors on Workers (but it does accept the registration of
new Clients and Workers).

At this point, the Master attempts to reconnect to all Workers and Client applications
that were registered at the time of failure. After confirming either the existence
or nonexistence of all such nodes (within a certain timeout), the Master will exit
Recovery mode and resume normal scheduling.

d5a96fec

Merge pull request #16 from pwendell/master · 13eced72
Reynold Xin authored 11 years ago
```
Bug fix in master build
```
13eced72

Merge pull request #14 from kayousterhout/untangle_scheduler · 70a0b993

Reynold Xin authored 11 years ago

Improved organization of scheduling packages.

This commit does not change any code -- only file organization.
Please let me know if there was some masterminded strategy behind
the existing organization that I failed to understand!

There are two components of this change:
(1) Moving files out of the cluster package, and down
a level to the scheduling package. These files are all used by
the local scheduler in addition to the cluster scheduler(s), so
should not be in the cluster package. As a result of this change,
none of the files in the local package reference files in the
cluster package.

(2) Moving the mesos package to within the cluster package.
The mesos scheduling code is for a cluster, and represents a
specific case of cluster scheduling (the Mesos-related classes
often subclass cluster scheduling classes). Thus, the most logical
place for it seems to be within the cluster package.

The one thing about the scheduling code that seems a little funny to me
is the naming of the SchedulerBackends.  The StandaloneSchedulerBackend
is not just for Standalone mode, but instead is used by Mesos coarse grained
mode and Yarn, and the backend that *is* just for Standalone mode is instead called SparkDeploySchedulerBackend. I didn't change this because I wasn't sure if there
was a reason for this naming that I'm just not aware of.

70a0b993

Merge pull request #670 from jey/ec2-ssh-improvements · 76677b8f
Reynold Xin authored 11 years ago
```
EC2 SSH improvements
```
76677b8f
Merge pull request #930 from holdenk/master · c514cd15
Reynold Xin authored 11 years ago
```
Add mapPartitionsWithIndex
```
c514cd15
Bug fix in master build · e2ff59af
Patrick Wendell authored 11 years ago

e2ff59af

Merge pull request #7 from wannabeast/memorystore-fixes · 560ee5c9

Reynold Xin authored 11 years ago

some minor fixes to MemoryStore

This is a repeat of #5, moved to its own branch in my repo.

This makes all updates to on ; it skips on synchronizing the reads where it can get away with it.

560ee5c9

Merge pull request #9 from rxin/limit · 6566a19b
Patrick Wendell authored 11 years ago
```
Smarter take/limit implementation.
```
6566a19b

Sep 25, 2013

Improved organization of scheduling packages. · d85fe41b

Kay Ousterhout authored 11 years ago

This commit does not change any code -- only file organization.

There are two components of this change:
(1) Moving files out of the cluster package, and down
a level to the scheduling package. These files are all used by
the local scheduler in addition to the cluster scheduler(s), so
should not be in the cluster package. As a result of this change,
none of the files in the local package reference files in the
cluster package.

(2) Moving the mesos package to within the cluster package.
The mesos scheduling code is for a cluster, and represents a
specific case of cluster scheduling (the Mesos-related classes
often subclass cluster scheduling classes). Thus, the most logical
place for it is within the cluster package.

d85fe41b

Sep 24, 2013
- Merge remote-tracking branch 'apache-github/pr/13' into HEAD · 9d34838b
  Patrick Wendell authored 11 years ago
  
  9d34838b
- Update build version in master · 6079721f
  Patrick Wendell authored 11 years ago
  
  6079721f
Sep 23, 2013
- Fix formatting :) · 0cef6835
  Holden Karau authored 11 years ago
  
  0cef6835
- Merge remote-tracking branch 'pr/12' · 7220e8f9
  Reynold Xin authored 11 years ago
  
  Fix spacing so java.io.tmpdir doesn't run on with SPARK_JAVA_OPTS
  7220e8f9
- $Y.CORP.YAHOO.COM\tgraves's avatar$
  
  Fix spacing so that the java.io.tmpdir doesn't run on with SPARK_JAVA_OPTS · a314b307
  Y.CORP.YAHOO.COM\tgraves authored 11 years ago
  
  a314b307
- Merge branch 'master' of https://git-wip-us.apache.org/repos/asf/incubator-spark · 0d2e5c3e
  Reynold Xin authored 11 years ago
  
  0d2e5c3e
- Merge branch 'master' of github.com:markhamstra/incubator-spark · ff540a01
  Reynold Xin authored 11 years ago
  
  ff540a01
- Merge branch 'master' of github.com:mesos/spark · f4dc9d37
  Reynold Xin authored 11 years ago
  
  f4dc9d37
Sep 22, 2013
- Switch indent from 2 to 4 spaces · 7fe0b0ff
  Holden Karau authored 11 years ago
  
  7fe0b0ff
- Merge pull request #928 from jerryshao/fairscheduler-refactor · 834686b1
  Reynold Xin authored 11 years ago
  
  Refactor FairSchedulableBuilder
  834686b1
- Change Exception to NoSuchElementException and minor style fix · 77e9da1f
  jerryshao authored 11 years ago
  
  77e9da1f
- Remove infix style and others · 85024acd
  jerryshao authored 11 years ago
  
  85024acd
- Refactor FairSchedulableBuilder: · 5850f599
  jerryshao authored 11 years ago
  
  1. Configuration can be read from classpath if not set explicitly. 2. Add missing close handler.
  5850f599
- Merge pull request #937 from jerryshao/localProperties-fix · a2ea069a
  Reynold Xin authored 11 years ago
  
  Fix PR926 local properties issues in Spark Streaming like scenarios
  a2ea069a
- Merge pull request #941 from ilikerps/master · f06f2da2
  Reynold Xin authored 11 years ago
  
  Add "org.apache." prefix to packages in spark-class
  f06f2da2
- Merge pull request #940 from ankurdave/clear-port-properties-after-tests · 7bb12a2a
  Reynold Xin authored 11 years ago
  
  After unit tests, clear port properties unconditionally
  7bb12a2a
Sep 21, 2013
- Add barrier for local properties unit test and fix some styles · aa0c29f7
  jerryshao authored 11 years ago
  
  aa0c29f7
Sep 20, 2013

Add "org.apache." prefix to packages in spark-class · 8933f9e9
Aaron Davidson authored 11 years ago
```
Lacking this, the if/case statements never trigger on Spark 0.8.0+.
```
8933f9e9
Smarter take/limit implementation. · 42571d30
Reynold Xin authored 11 years ago

42571d30
Merge branch 'master' of github.com:mesos/spark · 119de802
Reynold Xin authored 11 years ago

119de802

Synchronize on "entries" the remaining update to "currentMemory". · 9524b943

Mike authored 11 years ago

Make "currentMemory" @volatile, so that it's reads in ensureFreeSpace() are atomic and up-to-date--i.e., currentMemory can't increase while putLock is held (though it could decrease, which would only help ensureFreeSpace()).

9524b943

After unit tests, clear port properties unconditionally · 026dba6a

Ankur Dave authored 11 years ago

In MapOutputTrackerSuite, the "remote fetch" test sets spark.driver.port
and spark.hostPort, assuming that they will be cleared by
LocalSparkContext. However, the test never sets sc, so it remains null,
causing LocalSparkContext to skip clearing these properties. Subsequent
tests therefore fail with java.net.BindException: "Address already in
use".

This commit makes LocalSparkContext clear the properties even if sc is
null.

026dba6a

Sep 19, 2013

Merge pull request #938 from ilikerps/master · cd7222c3
Patrick Wendell authored 11 years ago
```
Fix issue with spark_ec2 seeing empty security groups
```
cd7222c3

Fix issue with spark_ec2 seeing empty security groups · f589ce77

Aaron Davidson authored 11 years ago

Under unknown, but occasional, circumstances, reservation.groups is empty
despite reservation.instances each having groups. This means that the
spark_ec2 get_existing_clusters() method would fail to find any instances.
To fix it, we simply use the instances' groups as the source of truth.

Note that this is actually just a revival of PR #827, now that the issue
has been reproduced.

f589ce77

Sep 18, 2013
- Fix issue when local properties pass from parent to child thread · ffa5f8e1
  jerryshao authored 11 years ago
  
  ffa5f8e1
Sep 16, 2013
- Merge branch 'master' of github.com:mesos/spark · 3443d3fd
  Reynold Xin authored 11 years ago
  
  3443d3fd