Commits · 1d7bef0c91abe053570e006917185d3b4ae19b0e · cs525-sp18-g07 / spark

Jan 10, 2014

Merge pull request #381 from mateiz/default-ttl · 1d7bef0c

Matei Zaharia authored 11 years ago

Fix default TTL for metadata cleaner

It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.

1d7bef0c

Merge pull request #382 from RongGu/master · 44d6a8e3
Patrick Wendell authored 11 years ago
```
Fix a type error in comment lines

Fix a type error in comment lines
```
44d6a8e3

Merge pull request #385 from shivaram/add-i2-instances · 88faa30a

Patrick Wendell authored 11 years ago

Add i2 instance types to Spark EC2.

Using data from http://aws.amazon.com/amazon-linux-ami/instance-type-matrix/ and http://www.ec2instances.info/

88faa30a

Fix one unit test that was not setting spark.cleaner.ttl · 942c80b3
Matei Zaharia authored 11 years ago

942c80b3

Merge pull request #383 from tdas/driver-test · f2655310

Patrick Wendell authored 11 years ago

API for automatic driver recovery for streaming programs and other bug fixes

1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory.

Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext
Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext

See the RecoverableNetworkWordCount below as an example of how to use it.

2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint.

3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery.

4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp).

5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared.

This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test.

Edit: Java example to be added later, unit test added.

f2655310

Merge pull request #377 from andrewor14/master · d37408f3

Patrick Wendell authored 11 years ago

External Sorting for Aggregator and CoGroupedRDDs (Revisited)

(This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving)

The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted.

The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order.

Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.

d37408f3

Merge remote-tracking branch 'apache/master' into driver-test · 4f39e79c
Tathagata Das authored 11 years ago
```
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
```
4f39e79c
Update documentation for externalSorting · 2e393cd5
Andrew Or authored 11 years ago

2e393cd5
Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate. · 82f07dee
Tathagata Das authored 11 years ago

82f07dee
Merge pull request #369 from pillis/master · 0eaf01c5
Reynold Xin authored 11 years ago
```
SPARK-961 Add a Vector.random() method

Added method and testcases
```
0eaf01c5

Address Patrick's and Reynold's comments · e4c51d21

Andrew Or authored 11 years ago

Aside from trivial formatting changes, use nulls instead of Options for
DiskMapIterator, and add documentation for spark.shuffle.externalSorting
and spark.shuffle.memoryFraction.

Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.

e4c51d21

fix a type error in comment lines · 94776f75
RongGu authored 11 years ago

94776f75

Merge pull request #371 from tgravescs/yarn_client_addjar_misc_fixes · 7cef8435

Thomas Graves authored 11 years ago

Yarn client addjar and misc fixes

Fix the addJar functionality in yarn-client mode, add support for the other options supported in yarn-standalone mode, set the application type on yarn in hadoop 2.X, add documentation, change heartbeat interval to be same code as the yarn-standalone so it doesn't take so long to get containers and exit.

7cef8435

Merge pull request #384 from pwendell/debug-logs · 7b58f116

Patrick Wendell authored 11 years ago

Make DEBUG-level logs consummable.

Removes two things that caused issues with the debug logs:

(a) Internal polling in the DAGScheduler was polluting the logs.
(b) The Scala REPL logs were really noisy.

7b58f116

Add i2 instance types to Spark EC2. · 7c4e6e1b
Shivaram Venkataraman authored 11 years ago

7c4e6e1b
Updated docs based on Patrick's comments in PR 383. · e4bb8452
Tathagata Das authored 11 years ago

e4bb8452

Make DEBUG-level logs consummable. · e9ed2d9e

Patrick Wendell authored 11 years ago

Removes two things that caused issues with the debug logs:

(a) Internal polling in the DAGScheduler was polluting the logs.
(b) The Scala REPL logs were really noisy.

e9ed2d9e

Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test · 2213a5a4
Tathagata Das authored 11 years ago

2213a5a4
Fixed conf/slaves and updated docs. · 740730a1
Tathagata Das authored 11 years ago

740730a1
Removed spark.hostPort and other setting from SparkConf before saving to checkpoint. · 4f609f79
Tathagata Das authored 11 years ago

4f609f79
Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test · d7ec73ac
Tathagata Das authored 11 years ago

d7ec73ac
Refactored graph checkpoint file reading and writing code to make it cleaner and easily debuggable. · 9d3d9c82
Tathagata Das authored 11 years ago

9d3d9c82

Fix default TTL for metadata cleaner · 669ba4ca

Matei Zaharia authored 11 years ago

It seems to have been set to 3500 in a previous commit for debugging,
but it should be off by default

669ba4ca

SPARK-961. Add a Vector.random() method - update 1 · 8d021b42
Pillis authored 11 years ago

8d021b42

Merge pull request #375 from mateiz/option-fix · 0ebc9730

Matei Zaharia authored 11 years ago

Fix bug added when we changed AppDescription.maxCores to an Option

The Scala compiler warned about this -- we were comparing an Option against an integer now.

0ebc9730

Merge pull request #378 from pwendell/consolidate_on · dd03cea0
Patrick Wendell authored 11 years ago
```
Enable shuffle consolidation by default.

Bump this to being enabled for 0.9.0.
```
dd03cea0
Enable shuffle consolidation by default. · 460f655c
Patrick Wendell authored 11 years ago
```
Bump this to being enabled for 0.9.0.
```
460f655c

Merge pull request #363 from pwendell/streaming-logs · 997c830e

Patrick Wendell authored 11 years ago

Set default logging to WARN for Spark streaming examples.

This programatically sets the log level to WARN by default for streaming
tests. If the user has already specified a log4j.properties file,
the user's file will take precedence over this default.

997c830e

Jan 09, 2014

Fix wonky imports from merge · 372a533a
Andrew Or authored 11 years ago

372a533a

Defensively allocate memory from global pool · aa5002bb

Andrew Or authored 11 years ago

This is an alternative to the existing approach, which evenly distributes the
collective shuffle memory among all running tasks. In the new approach, each
thread requests a chunk of memory whenever its map is about to multiplicatively
grow. If there is sufficient memory in the global pool, the thread allocates it
and grows its map. Otherwise, it spills.

A danger with the previous approach is that a new task may quickly fill up its
map before old tasks finish spilling, potentially causing an OOM. This approach
prevents this scenario as it favors existing tasks over new tasks; any thread
that may step over the boundary of other threads defensively backs off and
starts spilling.

Testing through spark-perf reveals: (1) When no spills have occured, the
performance of external sorting using this memory management approach is
essentially the same as without external sorting. (2) When one or more spills
have occured, the performance of external sorting is a small multiple (3x) worse

aa5002bb

Merge github.com:apache/incubator-spark · d76e1f90

Andrew Or authored 11 years ago

Conflicts:
	core/src/main/scala/org/apache/spark/SparkEnv.scala
	streaming/src/test/java/org/apache/spark/streaming/JavaAPISuite.java

d76e1f90

Minor clean-up · 7b748b83
Patrick Wendell authored 11 years ago

7b748b83

Merge pull request #353 from pwendell/ipython-simplify · 300eaa99

Patrick Wendell authored 11 years ago

Simplify and fix pyspark script.

This patch removes compatibility for IPython < 1.0 but fixes the launch
script and makes it much simpler.

I tested this using the three commands in the PySpark documentation page:

1. IPYTHON=1 ./pyspark
2. IPYTHON_OPTS="notebook" ./pyspark
3. IPYTHON_OPTS="notebook --pylab inline" ./pyspark

There are two changes:
- We rely on PYTHONSTARTUP env var to start PySpark
- Removed the quotes around $IPYTHON_OPTS... having quotes
  gloms them together as a single argument passed to `exec` which
  seemed to cause ipython to fail (it instead expects them as
  multiple arguments).

300eaa99

Merge remote-tracking branch 'apache/master' into driver-test · 38d75e18
Tathagata Das authored 11 years ago

38d75e18
Fixed bugs in reading of checkpoints. · 4a5558ca
Tathagata Das authored 11 years ago

4a5558ca

Merge pull request #374 from mateiz/completeness · 4b074fac

Reynold Xin authored 11 years ago

Add some missing Java API methods

These are primarily for setting job groups, canceling jobs, and setting names on RDDs. Seemed like useful stuff to expose in Java.

4b074fac

Merge pull request #294 from RongGu/master · a9d53333

Reynold Xin authored 11 years ago

Bug fixes for updating the RDD block's memory and disk usage information

Bug fixes for updating the RDD block's memory and disk usage information.
From the code context, we can find that the memSize and diskSize here are both always equal to the size of the block. Actually, they never be zero. Thus, the logic here is wrong for recording the block usage in BlockStatus, especially for the blocks which are dropped from memory to ensure space for the new input rdd blocks. I have tested it that this would cause the storage metrics shown in the Storage webpage wrong and misleading. With this patch, the metrics will be okay.
Finally, Merry Christmas, guys:)

a9d53333

Small fix suggested by josh · 77ca9e1b
Patrick Wendell authored 11 years ago

77ca9e1b

Merge pull request #293 from pwendell/standalone-driver · d86a85e9

Patrick Wendell authored 11 years ago

SPARK-998: Support Launching Driver Inside of Standalone Mode

[NOTE: I need to bring the tests up to date with new changes, so for now they will fail]

This patch provides support for launching driver programs inside of a standalone cluster manager. It also supports monitoring and re-launching of driver programs which is useful for long running, recoverable applications such as Spark Streaming jobs. For those jobs, this patch allows a deployment mode which is resilient to the failure of any worker node, failure of a master node (provided a multi-master setup), and even failures of the applicaiton itself, provided they are recoverable on a restart. Driver information, such as the status and logs from a driver, is displayed in the UI

There are a few small TODO's here, but the code is generally feature-complete. They are:
- Bring tests up to date and add test coverage
- Restarting on failure should be optional and maybe off by default.
- See if we can re-use akka connections to facilitate clients behind a firewall

A sensible place to start for review would be to look at the `DriverClient` class which presents users the ability to launch their driver program. I've also added an example program (`DriverSubmissionTest`) that allows you to test this locally and play around with killing workers, etc. Most of the code is devoted to persisting driver state in the cluster manger, exposing it in the UI, and dealing correctly with various types of failures.

Instructions to test locally:
- `sbt/sbt assembly/assembly examples/assembly`
- start a local version of the standalone cluster manager

```
./spark-class org.apache.spark.deploy.client.DriverClient \
  -j -Dspark.test.property=something \
  -e SPARK_TEST_KEY=SOMEVALUE \
  launch spark://10.99.1.14:7077 \
  ../path-to-examples-assembly-jar \
  org.apache.spark.examples.DriverSubmissionTest 1000 some extra options --some-option-here -X 13
```
- Go in the UI and make sure it started correctly, look at the output etc
- Kill workers, the driver program, masters, etc.

d86a85e9

Fix bug added when we changed AppDescription.maxCores to an Option · c43eb006
Matei Zaharia authored 11 years ago
```
The Scala compiler warned about this -- we were comparing an Option
against an integer now.
```
c43eb006