-
- Downloads
[SPARK-17491] Close serialization stream to fix wrong answer bug in putIteratorAsBytes()
## What changes were proposed in this pull request? `MemoryStore.putIteratorAsBytes()` may silently lose values when used with `KryoSerializer` because it does not properly close the serialization stream before attempting to deserialize the already-serialized values, which may cause values buffered in Kryo's internal buffers to not be read. This is the root cause behind a user-reported "wrong answer" bug in PySpark caching reported by bennoleslie on the Spark user mailing list in a thread titled "pyspark persist MEMORY_ONLY vs MEMORY_AND_DISK". Due to Spark 2.0's automatic use of KryoSerializer for "safe" types (such as byte arrays, primitives, etc.) this misuse of serializers manifested itself as silent data corruption rather than a StreamCorrupted error (which you might get from JavaSerializer). The minimal fix, implemented here, is to close the serialization stream before attempting to deserialize written values. In addition, this patch adds several additional assertions / precondition checks to prevent misuse of `PartiallySerializedBlock` and `ChunkedByteBufferOutputStream`. ## How was this patch tested? The original bug was masked by an invalid assert in the memory store test cases: the old assert compared two results record-by-record with `zip` but didn't first check that the lengths of the two collections were equal, causing missing records to go unnoticed. The updated test case reproduced this bug. In addition, I added a new `PartiallySerializedBlockSuite` to unit test that component. Author: Josh Rosen <joshrosen@databricks.com> Closes #15043 from JoshRosen/partially-serialized-block-values-iterator-bugfix.
Showing
- core/src/main/scala/org/apache/spark/scheduler/Task.scala 1 addition, 0 deletionscore/src/main/scala/org/apache/spark/scheduler/Task.scala
- core/src/main/scala/org/apache/spark/storage/memory/MemoryStore.scala 66 additions, 23 deletions...n/scala/org/apache/spark/storage/memory/MemoryStore.scala
- core/src/main/scala/org/apache/spark/util/ByteBufferOutputStream.scala 26 additions, 1 deletion.../scala/org/apache/spark/util/ByteBufferOutputStream.scala
- core/src/main/scala/org/apache/spark/util/io/ChunkedByteBufferOutputStream.scala 11 additions, 1 deletion.../apache/spark/util/io/ChunkedByteBufferOutputStream.scala
- core/src/test/scala/org/apache/spark/storage/MemoryStoreSuite.scala 16 additions, 18 deletions...est/scala/org/apache/spark/storage/MemoryStoreSuite.scala
- core/src/test/scala/org/apache/spark/storage/PartiallySerializedBlockSuite.scala 215 additions, 0 deletions.../apache/spark/storage/PartiallySerializedBlockSuite.scala
- core/src/test/scala/org/apache/spark/storage/PartiallyUnrolledIteratorSuite.scala 1 addition, 1 deletion...apache/spark/storage/PartiallyUnrolledIteratorSuite.scala
- core/src/test/scala/org/apache/spark/util/io/ChunkedByteBufferOutputStreamSuite.scala 8 additions, 0 deletions...he/spark/util/io/ChunkedByteBufferOutputStreamSuite.scala
Loading
Please register or sign in to comment