Commits · 448aef6790caa3728bcc43f518afb69807597c39 · cs525-sp18-g07 / spark

Jan 12, 2014
- Moved DStream, DStreamCheckpointData and PairDStream from... · 448aef67
  Tathagata Das authored 11 years ago
  
  Moved DStream, DStreamCheckpointData and PairDStream from org.apache.spark.streaming to org.apache.spark.streaming.dstream.
  448aef67
- Fixed bugs. · c5921e5c
  Tathagata Das authored 11 years ago
  
  c5921e5c
- Merge remote-tracking branch 'apache/master' into error-handling · 18f4889d
  Tathagata Das authored 11 years ago
  
  18f4889d
- Added waitForStop and stop to JavaStreamingContext. · 4d9b0ab4
  Tathagata Das authored 11 years ago
  
  4d9b0ab4
- Converted JobScheduler to use actors for event handling. Changed... · f5108ffc
  Tathagata Das authored 11 years ago
  
  Converted JobScheduler to use actors for event handling. Changed protected[streaming] to private[streaming] in StreamingContext and DStream. Added waitForStop to StreamingContext, and StreamingContextSuite.
  f5108ffc
Jan 11, 2014

Merge pull request #389 from rxin/clone-writables · 288a8789
Reynold Xin authored 11 years ago
```
Minor update for clone writables and more documentation.
```
288a8789

Merge pull request #388 from pwendell/master · dbc11df4

Reynold Xin authored 11 years ago

Fix UI bug introduced in #244.

The 'duration' field was incorrectly renamed to 'task time' in the table that
lists stages.

dbc11df4

Renamed cloneKeyValues to cloneRecords; updated docs. · 362cda18
Reynold Xin authored 11 years ago

362cda18

Merge pull request #393 from pwendell/revert-381 · 409866b3

Patrick Wendell authored 11 years ago

Revert PR 381

This PR missed a bunch of test cases that require "spark.cleaner.ttl". I think it is what is causing test failures on Jenkins right now (though it's a bit hard to tell because the DNS for cs.berkeley.edu is down).

I'm submitting this to see if it fixes jeknins. I did try just patching various tests but it was taking a really long time because there are a bunch of them, so for now I'm just seeing if a revert works.

409866b3

Revert "Fix default TTL for metadata cleaner" · 07b952e1
Patrick Wendell authored 11 years ago
```
This reverts commit 669ba4ca.
```
07b952e1
Revert "Fix one unit test that was not setting spark.cleaner.ttl" · 22d4d624
Patrick Wendell authored 11 years ago
```
This reverts commit 942c80b3.
```
22d4d624
Merge pull request #387 from jerryshao/conf-fix · 6510f04e
Reynold Xin authored 11 years ago
```
Fix configure didn't work small problem in ALS
```
6510f04e
Minor update for clone writables and more documentation. · b0fbfcca
Reynold Xin authored 11 years ago

b0fbfcca

Merge pull request #359 from ScrapCodes/clone-writables · ee6e7f9b

Reynold Xin authored 11 years ago

We clone hadoop key and values by default and reuse objects if asked to.

We try to clone for most common types of writables and we call WritableUtils.clone otherwise intention is to optimize, for example for NullWritable there is no need and for Long, int and String creating a new object with value set would be faster than doing copy on object hopefully.

There is another way to do this PR where we ask for both key and values whether to clone them or not, but could not think of a use case for it except either of them is actually a NullWritable for which I have already worked around. So thought that would be unnecessary.

ee6e7f9b

Fix UI bug introduced in #244. · b313e156

Patrick Wendell authored 11 years ago

The 'duration' field was incorrectly renamed to 'task time' in the table that
lists stages.

b313e156

Merge pull request #373 from jerryshao/kafka-upgrade · 4216178d
Patrick Wendell authored 11 years ago
```
Upgrade Kafka dependecy to 0.8.0 release version
```
4216178d
Fix configure didn't work small problem in ALS · cbfbc019
jerryshao authored 11 years ago

cbfbc019

Merge pull request #376 from prabeesh/master · 92ad18b0

Reynold Xin authored 11 years ago

Change clientId to random clientId

The client identifier should be unique across all clients connecting to the same server. A convenience method is provided to generate a random client id that should satisfy this criteria - generateClientId(). Returns a randomly generated client identifier based on the current user's login name and the system time. As the client identifier is used by the server to identify a client when it reconnects, the client must use the same identifier between connections if durable subscriptions are to be used.

92ad18b0

Merge pull request #386 from pwendell/typo-fix · 0b5ce7af
Reynold Xin authored 11 years ago
```
Small typo fix
```
0b5ce7af

Jan 10, 2014

Merge pull request #381 from mateiz/default-ttl · 1d7bef0c

Matei Zaharia authored 11 years ago

Fix default TTL for metadata cleaner

It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.

1d7bef0c

Merge pull request #382 from RongGu/master · 44d6a8e3
Patrick Wendell authored 11 years ago
```
Fix a type error in comment lines

Fix a type error in comment lines
```
44d6a8e3
Small typo fix · 08370a52
Patrick Wendell authored 11 years ago

08370a52

Merge pull request #385 from shivaram/add-i2-instances · 88faa30a

Patrick Wendell authored 11 years ago

Add i2 instance types to Spark EC2.

Using data from http://aws.amazon.com/amazon-linux-ami/instance-type-matrix/ and http://www.ec2instances.info/

88faa30a

Fix one unit test that was not setting spark.cleaner.ttl · 942c80b3
Matei Zaharia authored 11 years ago

942c80b3

Merge pull request #383 from tdas/driver-test · f2655310

Patrick Wendell authored 11 years ago

API for automatic driver recovery for streaming programs and other bug fixes

1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory.

Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext
Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext

See the RecoverableNetworkWordCount below as an example of how to use it.

2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint.

3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery.

4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp).

5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared.

This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test.

Edit: Java example to be added later, unit test added.

f2655310

Merge pull request #377 from andrewor14/master · d37408f3

Patrick Wendell authored 11 years ago

External Sorting for Aggregator and CoGroupedRDDs (Revisited)

(This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving)

The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted.

The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order.

Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.

d37408f3

Merge remote-tracking branch 'apache/master' into driver-test · 4f39e79c
Tathagata Das authored 11 years ago
```
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
```
4f39e79c
Update documentation for externalSorting · 2e393cd5
Andrew Or authored 11 years ago

2e393cd5
Modified streaming.FailureSuite tests to test StreamingContext.getOrCreate. · 82f07dee
Tathagata Das authored 11 years ago

82f07dee
Merge pull request #369 from pillis/master · 0eaf01c5
Reynold Xin authored 11 years ago
```
SPARK-961 Add a Vector.random() method

Added method and testcases
```
0eaf01c5

Address Patrick's and Reynold's comments · e4c51d21

Andrew Or authored 11 years ago

Aside from trivial formatting changes, use nulls instead of Options for
DiskMapIterator, and add documentation for spark.shuffle.externalSorting
and spark.shuffle.memoryFraction.

Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.

e4c51d21

fix a type error in comment lines · 94776f75
RongGu authored 11 years ago

94776f75

Merge pull request #371 from tgravescs/yarn_client_addjar_misc_fixes · 7cef8435

Thomas Graves authored 11 years ago

Yarn client addjar and misc fixes

Fix the addJar functionality in yarn-client mode, add support for the other options supported in yarn-standalone mode, set the application type on yarn in hadoop 2.X, add documentation, change heartbeat interval to be same code as the yarn-standalone so it doesn't take so long to get containers and exit.

7cef8435

Merge pull request #384 from pwendell/debug-logs · 7b58f116

Patrick Wendell authored 11 years ago

Make DEBUG-level logs consummable.

Removes two things that caused issues with the debug logs:

(a) Internal polling in the DAGScheduler was polluting the logs.
(b) The Scala REPL logs were really noisy.

7b58f116

Add i2 instance types to Spark EC2. · 7c4e6e1b
Shivaram Venkataraman authored 11 years ago

7c4e6e1b
Updated docs based on Patrick's comments in PR 383. · e4bb8452
Tathagata Das authored 11 years ago

e4bb8452

Make DEBUG-level logs consummable. · e9ed2d9e

Patrick Wendell authored 11 years ago

Removes two things that caused issues with the debug logs:

(a) Internal polling in the DAGScheduler was polluting the logs.
(b) The Scala REPL logs were really noisy.

e9ed2d9e

Merge branch 'driver-test' of github.com:tdas/incubator-spark into driver-test · 2213a5a4
Tathagata Das authored 11 years ago

2213a5a4
Fixed conf/slaves and updated docs. · 740730a1
Tathagata Das authored 11 years ago

740730a1
Removed spark.hostPort and other setting from SparkConf before saving to checkpoint. · 4f609f79
Tathagata Das authored 11 years ago

4f609f79