-
- Downloads
[SPARK-1371][WIP] Compression support for Spark SQL in-memory columnar storage
JIRA issue: [SPARK-1373](https://issues.apache.org/jira/browse/SPARK-1373) (Although tagged as WIP, this PR is structurally complete. The only things left unimplemented are 3 more compression algorithms: `BooleanBitSet`, `IntDelta` and `LongDelta`, which are trivial to add later in this or another separate PR.) This PR contains compression support for Spark SQL in-memory columnar storage. Main interfaces include: * `CompressionScheme` Each `CompressionScheme` represents a concrete compression algorithm, which basically consists of an `Encoder` for compression and a `Decoder` for decompression. Algorithms implemented include: * `RunLengthEncoding` * `DictionaryEncoding` Algorithms to be implemented include: * `BooleanBitSet` * `IntDelta` * `LongDelta` * `CompressibleColumnBuilder` A stackable `ColumnBuilder` trait used to build byte buffers for compressible columns. A best `CompressionScheme` that exhibits lowest compression ratio is chosen for each column according to statistical information gathered while elements are appended into the `ColumnBuilder`. However, if no `CompressionScheme` can achieve a compression ratio better than 80%, no compression will be done for this column to save CPU time. Memory layout of the final byte buffer is showed below: ``` .--------------------------- Column type ID (4 bytes) | .----------------------- Null count N (4 bytes) | | .------------------- Null positions (4 x N bytes, empty if null count is zero) | | | .------------- Compression scheme ID (4 bytes) | | | | .--------- Compressed non-null elements V V V V V +---+---+-----+---+---------+ | | | ... | | ... ... | +---+---+-----+---+---------+ \-----------/ \-----------/ header body ``` * `CompressibleColumnAccessor` A stackable `ColumnAccessor` trait used to iterate (possibly) compressed data column. * `ColumnStats` Used to collect statistical information while loading data into in-memory columnar table. Optimizations like partition pruning rely on this information. Strictly speaking, `ColumnStats` related code is not part of the compression support. It's contained in this PR to ensure and validate the row-based API design (which is used to avoid boxing/unboxing cost whenever possible). A major refactoring change since PR #205 is: * Refactored all getter/setter methods for primitive types in various places into `ColumnType` classes to remove duplicated code. Author: Cheng Lian <lian.cs.zju@gmail.com> Closes #285 from liancheng/memColumnarCompression and squashes the following commits: ed71bbd [Cheng Lian] Addressed all PR comments by @marmbrus d3a4fa9 [Cheng Lian] Removed Ordering[T] in ColumnStats for better performance 5034453 [Cheng Lian] Bug fix, more tests, and more refactoring c298b76 [Cheng Lian] Test suites refactored 2780d6a [Cheng Lian] [WIP] in-memory columnar compression support 211331c [Cheng Lian] WIP: in-memory columnar compression support 85cc59b [Cheng Lian] Refactored ColumnAccessors & ColumnBuilders to remove duplicate code
Showing
- sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnAccessor.scala 24 additions, 79 deletions.../scala/org/apache/spark/sql/columnar/ColumnAccessor.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala 44 additions, 81 deletions...n/scala/org/apache/spark/sql/columnar/ColumnBuilder.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnStats.scala 360 additions, 0 deletions...ain/scala/org/apache/spark/sql/columnar/ColumnStats.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala 83 additions, 4 deletions...main/scala/org/apache/spark/sql/columnar/ColumnType.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/InMemoryColumnarTableScan.scala 2 additions, 5 deletions...apache/spark/sql/columnar/InMemoryColumnarTableScan.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/NullableColumnAccessor.scala 1 addition, 1 deletion...rg/apache/spark/sql/columnar/NullableColumnAccessor.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/NullableColumnBuilder.scala 13 additions, 16 deletions...org/apache/spark/sql/columnar/NullableColumnBuilder.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/compression/CompressibleColumnAccessor.scala 36 additions, 0 deletions...sql/columnar/compression/CompressibleColumnAccessor.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/compression/CompressibleColumnBuilder.scala 95 additions, 0 deletions.../sql/columnar/compression/CompressibleColumnBuilder.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/compression/CompressionScheme.scala 94 additions, 0 deletions...he/spark/sql/columnar/compression/CompressionScheme.scala
- sql/core/src/main/scala/org/apache/spark/sql/columnar/compression/compressionSchemes.scala 288 additions, 0 deletions...e/spark/sql/columnar/compression/compressionSchemes.scala
- sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnStatsSuite.scala 61 additions, 0 deletions...cala/org/apache/spark/sql/columnar/ColumnStatsSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnTypeSuite.scala 89 additions, 127 deletions...scala/org/apache/spark/sql/columnar/ColumnTypeSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnarQuerySuite.scala 2 additions, 2 deletions...la/org/apache/spark/sql/columnar/ColumnarQuerySuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/columnar/ColumnarTestUtils.scala 100 additions, 0 deletions...ala/org/apache/spark/sql/columnar/ColumnarTestUtils.scala
- sql/core/src/test/scala/org/apache/spark/sql/columnar/NullableColumnAccessorSuite.scala 31 additions, 12 deletions...ache/spark/sql/columnar/NullableColumnAccessorSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/columnar/NullableColumnBuilderSuite.scala 35 additions, 26 deletions...pache/spark/sql/columnar/NullableColumnBuilderSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/columnar/compression/DictionaryEncodingSuite.scala 113 additions, 0 deletions...rk/sql/columnar/compression/DictionaryEncodingSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/columnar/compression/RunLengthEncodingSuite.scala 130 additions, 0 deletions...ark/sql/columnar/compression/RunLengthEncodingSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/columnar/compression/TestCompressibleColumnBuilder.scala 43 additions, 0 deletions.../columnar/compression/TestCompressibleColumnBuilder.scala
Loading
Please register or sign in to comment