Commits · 5c3912e5c90ce659146c3056430d100604378b71 · cs525-sp18-g07 / spark

Feb 26, 2016
- [MINOR][STREAMING] Fix a minor naming issue in JavaDStreamLike · 318bf411
  Liwei Lin authored 9 years ago
  
  Author: Liwei Lin <proflin.me@gmail.com> Closes #11385 from proflin/Fix-minor-naming.
  318bf411
Feb 25, 2016

[SPARK-12757] Add block-level read/write locks to BlockManager · 633d63a4

Josh Rosen authored 9 years ago

## Motivation

As a pre-requisite to off-heap caching of blocks, we need a mechanism to prevent pages / blocks from being evicted while they are being read. With on-heap objects, evicting a block while it is being read merely leads to memory-accounting problems (because we assume that an evicted block is a candidate for garbage-collection, which will not be true during a read), but with off-heap memory this will lead to either data corruption or segmentation faults.

## Changes

### BlockInfoManager and reader/writer locks

This patch adds block-level read/write locks to the BlockManager. It introduces a new `BlockInfoManager` component, which is contained within the `BlockManager`, holds the `BlockInfo` objects that the `BlockManager` uses for tracking block metadata, and exposes APIs for locking blocks in either shared read or exclusive write modes.

`BlockManager`'s `get*()` and `put*()` methods now implicitly acquire the necessary locks. After a `get()` call successfully retrieves a block, that block is locked in a shared read mode. A `put()` call will block until it acquires an exclusive write lock. If the write succeeds, the write lock will be downgraded to a shared read lock before returning to the caller. This `put()` locking behavior allows us store a block and then immediately turn around and read it without having to worry about it having been evicted between the write and the read, which will allow us to significantly simplify `CacheManager` in the future (see #10748).

See `BlockInfoManagerSuite`'s test cases for a more detailed specification of the locking semantics.

### Auto-release of locks at the end of tasks

Our locking APIs support explicit release of locks (by calling `unlock()`), but it's not always possible to guarantee that locks will be released prior to the end of the task. One reason for this is our iterator interface: since our iterators don't support an explicit `close()` operator to signal that no more records will be consumed, operations like `take()` or `limit()` don't have a good means to release locks on their input iterators' blocks. Another example is broadcast variables, whose block locks can only be released at the end of the task.

To address this, `BlockInfoManager` uses a pair of maps to track the set of locks acquired by each task. Lock acquisitions automatically record the current task attempt id by obtaining it from `TaskContext`. When a task finishes, code in `Executor` calls `BlockInfoManager.unlockAllLocksForTask(taskAttemptId)` to free locks.

### Locking and the MemoryStore

In order to prevent in-memory blocks from being evicted while they are being read, the `MemoryStore`'s `evictBlocksToFreeSpace()` method acquires write locks on blocks which it is considering as candidates for eviction. These lock acquisitions are non-blocking, so a block which is being read will not be evicted. By holding write locks until the eviction is performed or skipped (in case evicting the blocks would not free enough memory), we avoid a race where a new reader starts to read a block after the block has been marked as an eviction candidate but before it has been removed.

### Locking and remote block transfer

This patch makes small changes to to block transfer and network layer code so that locks acquired by the BlockTransferService are released as soon as block transfer messages are consumed and released by Netty. This builds on top of #11193, a bug fix related to freeing of network layer ManagedBuffers.

## FAQ

- **Why not use Java's built-in [`ReadWriteLock`](https://docs.oracle.com/javase/7/docs/api/java/util/concurrent/locks/ReadWriteLock.html)?**

Our locks operate on a per-task rather than per-thread level. Under certain circumstances a task may consist of multiple threads, so using `ReadWriteLock` would mean that we might call `unlock()` from a thread which didn't hold the lock in question, an operation which has undefined semantics. If we could rely on Java 8 classes, we might be able to use [`StampedLock`](https://docs.oracle.com/javase/8/docs/api/java/util/concurrent/locks/StampedLock.html) to work around this issue.

- **Why not detect "leaked" locks in tests?**:

See above notes about `take()` and `limit`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10705 from JoshRosen/pin-pages.

633d63a4

Feb 22, 2016

[MINOR][DOCS] Fix all typos in markdown files of `doc` and similar patterns in other comments · 024482bf

Dongjoon Hyun authored 9 years ago

## What changes were proposed in this pull request?

This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.

## How was the this patch tested?

manual tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11300 from dongjoon-hyun/minor_fix_typos.

024482bf

[SPARK-13399][STREAMING] Fix checkpointsuite type erasure warnings · 1b144455

Holden Karau authored 9 years ago

## What changes were proposed in this pull request?

Change the checkpointsuite getting the outputstreams to explicitly be unchecked on the generic type so as to avoid the warnings. This only impacts test code.

Alternatively we could encode the type tag in the TestOutputStreamWithPartitions and filter the type tag as well - but this is unnecessary since multiple testoutputstreams are not registered and the previous code was not actually checking this type.

## How was the this patch tested?

unit tests (streaming/testOnly org.apache.spark.streaming.CheckpointSuite)

Author: Holden Karau <holden@us.ibm.com>

Closes #11286 from holdenk/SPARK-13399-checkpointsuite-type-erasure.

1b144455

[SPARK-13186][STREAMING] migrate away from SynchronizedMap · 8f35d3ea

Huaxin Gao authored 9 years ago

trait SynchronizedMap in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Change to java.util.concurrent.ConcurrentHashMap instead.

Author: Huaxin Gao <huaxing@us.ibm.com>

Closes #11250 from huaxingao/spark__13186.

8f35d3ea

Feb 21, 2016

[SPARK-13248][STREAMING] Remove deprecated Streaming APIs. · 1a340da8

Luciano Resende authored 9 years ago

Remove deprecated Streaming APIs and adjust sample applications.

Author: Luciano Resende <lresende@apache.org>

Closes #11139 from lresende/streaming-deprecated-apis.

1a340da8

Feb 19, 2016

[SPARK-13339][DOCS] Clarify commutative / associative operator requirements for reduce, fold · fb7e2179

Sean Owen authored 9 years ago

Clarify that reduce functions need to be commutative, and fold functions do not

See https://github.com/apache/spark/pull/11091

Author: Sean Owen <sowen@cloudera.com>

Closes #11217 from srowen/SPARK-13339.

fb7e2179

Feb 16, 2016

[SPARK-11627] Add initial input rate limit for spark streaming backpressure mechanism. · 7218c0eb

junhao authored 9 years ago

https://issues.apache.org/jira/browse/SPARK-11627

Spark Streaming backpressure mechanism has no initial input rate limit, it might cause OOM exception.
In the firest batch task ,receivers receive data at the maximum speed they can reach,it might exhaust executors memory resources. Add a initial input rate limit value can make sure the Streaming job execute success in the first batch,then the backpressure mechanism can adjust receiving rate adaptively.

Author: junhao <junhao@mogujie.com>

Closes #9593 from junhaoMg/junhao-dev.

7218c0eb

[SPARK-13280][STREAMING] Use a better logger name for FileBasedWriteAheadLog. · c7d00a24

Marcelo Vanzin authored 9 years ago

The new logger name is under the org.apache.spark namespace.
The detection of the caller name was also enhanced a bit to ignore
some common things that show up in the call stack.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11165 from vanzin/SPARK-13280.

c7d00a24

Feb 13, 2016

[SPARK-13172][CORE][SQL] Stop using RichException.getStackTrace it is deprecated · 388cd9ea

Sean Owen authored 9 years ago

Replace `getStackTraceString` with `Utils.exceptionString`

Author: Sean Owen <sowen@cloudera.com>

Closes #11182 from srowen/SPARK-13172.

388cd9ea

Feb 11, 2016

[STREAMING][TEST] Fix flaky streaming.FailureSuite · 219a74a7

Tathagata Das authored 9 years ago

Under some corner cases, the test suite failed to shutdown the SparkContext causing cascaded failures. This fix does two things
- Makes sure no SparkContext is active after every test
- Makes sure StreamingContext is always shutdown (prevents leaking of StreamingContexts as well, just in case)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #11166 from tdas/fix-failuresuite.

219a74a7

Feb 09, 2016

[SPARK-13170][STREAMING] Investigate replacing SynchronizedQueue as it is deprecated · 68ed3632

Sean Owen authored 9 years ago

Replace SynchronizeQueue with synchronized access to a Queue

Author: Sean Owen <sowen@cloudera.com>

Closes #11111 from srowen/SPARK-13170.

68ed3632

[SPARK-13165][STREAMING] Replace deprecated synchronizedBuffer in streaming · 159198ef

Holden Karau authored 9 years ago

Building with Scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative - we already use ConcurrentLinkedQueue elsewhere so lets replace it.

Some notes about how behaviour is different for reviewers:
The Seq from a SynchronizedBuffer that was implicitly converted would continue to receive updates - however when we do the same conversion explicitly on the ConcurrentLinkedQueue this isn't the case. Hence changing some of the (internal & test) APIs to pass an Iterable. toSeq is safe to use if there are no more updates.

Author: Holden Karau <holden@us.ibm.com>
Author: tedyu <yuzhihong@gmail.com>

Closes #11067 from holdenk/SPARK-13165-replace-deprecated-synchronizedBuffer-in-streaming.

159198ef

Feb 04, 2016

[SPARK-13195][STREAMING] Fix NoSuchElementException when a state is not set... · 8e2f2963

Shixiong Zhu authored 9 years ago

[SPARK-13195][STREAMING] Fix NoSuchElementException when a state is not set but timeoutThreshold is defined

Check the state Existence before calling get.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11081 from zsxwing/SPARK-13195.

8e2f2963

Feb 03, 2016

[SPARK-12739][STREAMING] Details of batch in Streaming tab uses two Duration columns · e9eb248e

Mario Briggs authored 9 years ago

I have clearly prefix the two 'Duration' columns in 'Details of Batch' Streaming tab as 'Output Op Duration' and 'Job Duration'

Author: Mario Briggs <mario.briggs@in.ibm.com>
Author: mariobriggs <mariobriggs@in.ibm.com>

Closes #11022 from mariobriggs/spark-12739.

e9eb248e

Feb 02, 2016

[SPARK-13121][STREAMING] java mapWithState mishandles scala Option · d0df2ca4

Gabriele Nizzoli authored 9 years ago

Already merged into 1.6 branch, this PR is to commit to master the same change

Author: Gabriele Nizzoli <mail@nizzoli.net>

Closes #11028 from gabrielenizzoli/patch-1.

d0df2ca4

Feb 01, 2016

[SPARK-6847][CORE][STREAMING] Fix stack overflow issue when updateStateByKey... · 6075573a

Shixiong Zhu authored 9 years ago

[SPARK-6847][CORE][STREAMING] Fix stack overflow issue when updateStateByKey is followed by a checkpointed dstream

Add a local property to indicate if checkpointing all RDDs that are marked with the checkpoint flag, and enable it in Streaming

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10934 from zsxwing/recursive-checkpoint.

6075573a

Jan 30, 2016

[SPARK-6363][BUILD] Make Scala 2.11 the default Scala version · 289373b2

Josh Rosen authored 9 years ago

This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).

The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).

After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10608 from JoshRosen/SPARK-6363.

289373b2

Jan 26, 2016

[SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is... · 649e9d0f

Sean Owen authored 9 years ago

[SPARK-3369][CORE][STREAMING] Java mapPartitions Iterator->Iterable is inconsistent with Scala's Iterator->Iterator

Fix Java function API methods for flatMap and mapPartitions to require producing only an Iterator, not Iterable. Also fix DStream.flatMap to require a function producing TraversableOnce only, not Traversable.

CC rxin pwendell for API change; tdas since it also touches streaming.

Author: Sean Owen <sowen@cloudera.com>

Closes #10413 from srowen/SPARK-3369.

649e9d0f

Jan 23, 2016

[STREAMING][MINOR] Scaladoc + logs · cfdcef70

Jacek Laskowski authored 9 years ago

Found while doing code review

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10878 from jaceklaskowski/streaming-scaladoc-logs-tiny-fixes.

cfdcef70

[SPARK-11137][STREAMING] Make StreamingContext.stop() exception-safe · 5f569801

jayadevanmurali authored 9 years ago

Make StreamingContext.stop() exception-safe

Author: jayadevanmurali <jayadevan.m@tcs.com>

Closes #10807 from jayadevanmurali/branch-0.1-SPARK-11137.

5f569801

[SPARK-12859][STREAMING][WEB UI] Names of input streams with receivers don't fit in Streaming page · 358a33bb

Alex Bozarth authored 9 years ago

Added CSS style to force names of input streams with receivers to wrap

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #10873 from ajbozarth/spark12859.

358a33bb

Jan 22, 2016

[SPARK-7997][CORE] Remove Akka from Spark Core and Streaming · bc1babd6

Shixiong Zhu authored 9 years ago

- Remove Akka dependency from core. Note: the streaming-akka project still uses Akka.
- Remove HttpFileServer
- Remove Akka configs from SparkConf and SSLOptions
- Rename `spark.akka.frameSize` to `spark.rpc.message.maxSize`. I think it's still worth to keep this config because using `DirectTaskResult` or `IndirectTaskResult`  depends on it.
- Update comments and docs

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10854 from zsxwing/remove-akka.

bc1babd6

Jan 20, 2016

[SPARK-7799][SPARK-12786][STREAMING] Add "streaming-akka" project · b7d74a60

Shixiong Zhu authored 9 years ago

Include the following changes:

1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream
2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream"
3. Update the ActorWordCount example and add the JavaActorWordCount example
4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10744 from zsxwing/streaming-akka-2.

b7d74a60

[SPARK-12847][CORE][STREAMING] Remove StreamingListenerBus and post all... · 944fdadf

Shixiong Zhu authored 9 years ago

[SPARK-12847][CORE][STREAMING] Remove StreamingListenerBus and post all Streaming events to the same thread as Spark events

Including the following changes:

1. Add StreamingListenerForwardingBus to WrappedStreamingListenerEvent process events in `onOtherEvent` to StreamingListener
2. Remove StreamingListenerBus
3. Merge AsynchronousListenerBus and LiveListenerBus to the same class LiveListenerBus
4. Add `logEvent` method to SparkListenerEvent so that EventLoggingListener can use it to ignore WrappedStreamingListenerEvents

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10779 from zsxwing/streaming-listener.

944fdadf

Jan 18, 2016

[SPARK-10985][CORE] Avoid passing evicted blocks throughout BlockManager · b8cb548a

Josh Rosen authored 9 years ago

This patch refactors portions of the BlockManager and CacheManager in order to avoid having to pass `evictedBlocks` lists throughout the code. It appears that these lists were only consumed by `TaskContext.taskMetrics`, so the new code now directly updates the metrics from the lower-level BlockManager methods.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10776 from JoshRosen/SPARK-10985.

b8cb548a

Jan 12, 2016

[SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1 · 4f60651c

Shixiong Zhu authored 9 years ago

- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
  - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. https://github.com/zsxwing/spark/commit/bfd4b5c040eb29394c3132af3c670b1a7272457c
- [x] Verify no leak any more after reverting our workarounds

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10692 from zsxwing/py4j-0.9.1.

4f60651c

Jan 11, 2016

[SPARK-12692][BUILD][STREAMING] Scala style: Fix the style violation (Space before "," or ":") · 39ae04e6

Kousuke Saruta authored 9 years ago

Fix the style violation (space before , and :).
This PR is a followup for #10643.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #10685 from sarutak/SPARK-12692-followup-streaming.

39ae04e6

Jan 10, 2016

[SPARK-3873][BUILD] Enable import ordering error checking. · 6439a825

Marcelo Vanzin authored 9 years ago

Turn import ordering violations into build errors, plus a few adjustments
to account for how the checker behaves. I'm a little on the fence about
whether the existing code is right, but it's easier to appease the checker
than to discuss what's the more correct order here.

Plus a few fixes to imports that cropped in since my recent cleanups.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10612 from vanzin/SPARK-3873-enable.

6439a825

Jan 08, 2016

[SPARK-4819] Remove Guava's "Optional" from public API · 659fd9d0

Sean Owen authored 9 years ago

Replace Guava `Optional` with (an API clone of) Java 8 `java.util.Optional` (edit: and a clone of Guava `Optional`)

See also https://github.com/apache/spark/pull/10512

Author: Sean Owen <sowen@cloudera.com>

Closes #10513 from srowen/SPARK-4819.

659fd9d0

[SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition · b9c83533

Sean Owen authored 9 years ago

Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.

b9c83533

Jan 07, 2016

[SPARK-12591][STREAMING] Register OpenHashMapBasedStateMap for Kryo · 28e0e500

Shixiong Zhu authored 9 years ago

The default serializer in Kryo is FieldSerializer and it ignores transient fields and never calls `writeObject` or `readObject`. So we should register OpenHashMapBasedStateMap using `DefaultSerializer` to make it work with Kryo.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10609 from zsxwing/SPARK-12591.

28e0e500

[SPARK-12510][STREAMING] Refactor ActorReceiver to support Java · c0c39750

Shixiong Zhu authored 9 years ago

This PR includes the following changes:

1. Rename `ActorReceiver` to `ActorReceiverSupervisor`
2. Remove `ActorHelper`
3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java
4. Add `JavaActorWordCount` example

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10457 from zsxwing/java-actor-stream.

c0c39750

[STREAMING][MINOR] More contextual information in logs + minor code i… · 1b2c2162

Jacek Laskowski authored 9 years ago

…mprovements

Please review and merge at your convenience. Thanks!

Author: Jacek Laskowski <jacek@japila.pl>

Closes #10595 from jaceklaskowski/streaming-minor-fixes.

1b2c2162

Jan 06, 2016

[SPARK-7689] Remove TTL-based metadata cleaning in Spark 2.0 · 8e19c766

Josh Rosen authored 9 years ago

This PR removes `spark.cleaner.ttl` and the associated TTL-based metadata cleaning code.

Now that we have the `ContextCleaner` and a timer to trigger periodic GCs, I don't think that `spark.cleaner.ttl` is necessary anymore. The TTL-based cleaning isn't enabled by default, isn't included in our end-to-end tests, and has been a source of user confusion when it is misconfigured. If the TTL is set too low, data which is still being used may be evicted / deleted, leading to hard to diagnose bugs.

For all of these reasons, I think that we should remove this functionality in Spark 2.0. Additional benefits of doing this include marginally reduced memory usage, since we no longer need to store timetsamps in hashmaps, and a handful fewer threads.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #10534 from JoshRosen/remove-ttl-based-cleaning.

8e19c766

[SPARK-12604][CORE] Java count(AprroxDistinct)ByKey methods return Scala Long not Java · ac56cf60

Sean Owen authored 9 years ago

Change Java countByKey, countApproxDistinctByKey return types to use Java Long, not Scala; update similar methods for consistency on java.long.Long.valueOf with no API change

Author: Sean Owen <sowen@cloudera.com>

Closes #10554 from srowen/SPARK-12604.

ac56cf60

Revert "[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of... · cbaea959

Shixiong Zhu authored 9 years ago

Revert "[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root path to gain the streaming batch url."

This reverts commit 19e4e9fe. Will merge #10618 instead.

cbaea959

[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root... · 19e4e9fe

huangzhaowei authored 9 years ago

[SPARK-12672][STREAMING][UI] Use the uiRoot function instead of default root path to gain the streaming batch url.

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #10617 from SaintBacchus/SPARK-12672.

19e4e9fe

Jan 05, 2016

[SPARK-3873][TESTS] Import ordering fixes. · b3ba1be3
Marcelo Vanzin authored 9 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #10582 from vanzin/SPARK-3873-tests.
```
b3ba1be3

[SPARK-12511] [PYSPARK] [STREAMING] Make sure PythonDStream.registerSerializer is called only once · 6cfe341e

Shixiong Zhu authored 9 years ago

There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (https://github.com/bartdag/py4j/pull/184)

Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10514 from zsxwing/SPARK-12511.

6cfe341e