Skip to content
Snippets Groups Projects
Commit d6a52176 authored by Sameer Agarwal's avatar Sameer Agarwal Committed by Cheng Lian
Browse files

[SPARK-16668][TEST] Test parquet reader for row groups containing both...

[SPARK-16668][TEST] Test parquet reader for row groups containing both dictionary and plain encoded pages

## What changes were proposed in this pull request?

This patch adds an explicit test for [SPARK-14217] by setting the parquet dictionary and page size the generated parquet file spans across 3 pages (within a single row group) where the first page is dictionary encoded and the remaining two are plain encoded.

## How was this patch tested?

1. ParquetEncodingSuite
2. Also manually tested that this test fails without https://github.com/apache/spark/pull/12279

Author: Sameer Agarwal <sameerag@cs.berkeley.edu>

Closes #14304 from sameeragarwal/hybrid-encoding-test.
parent 64529b18
No related branches found
No related tags found
No related merge requests found
......@@ -16,6 +16,10 @@
*/
package org.apache.spark.sql.execution.datasources.parquet
import scala.collection.JavaConverters._
import org.apache.parquet.hadoop.ParquetOutputFormat
import org.apache.spark.sql.test.SharedSQLContext
// TODO: this needs a lot more testing but it's currently not easy to test with the parquet
......@@ -78,4 +82,29 @@ class ParquetEncodingSuite extends ParquetCompatibilityTest with SharedSQLContex
}}
}
}
test("Read row group containing both dictionary and plain encoded pages") {
withSQLConf(ParquetOutputFormat.DICTIONARY_PAGE_SIZE -> "2048",
ParquetOutputFormat.PAGE_SIZE -> "4096") {
withTempPath { dir =>
// In order to explicitly test for SPARK-14217, we set the parquet dictionary and page size
// such that the following data spans across 3 pages (within a single row group) where the
// first page is dictionary encoded and the remaining two are plain encoded.
val data = (0 until 512).flatMap(i => Seq.fill(3)(i.toString))
data.toDF("f").coalesce(1).write.parquet(dir.getCanonicalPath)
val file = SpecificParquetRecordReaderBase.listDirectory(dir).asScala.head
val reader = new VectorizedParquetRecordReader
reader.initialize(file, null /* set columns to null to project all columns */)
val column = reader.resultBatch().column(0)
assert(reader.nextBatch())
(0 until 512).foreach { i =>
assert(column.getUTF8String(3 * i).toString == i.toString)
assert(column.getUTF8String(3 * i + 1).toString == i.toString)
assert(column.getUTF8String(3 * i + 2).toString == i.toString)
}
}
}
}
}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment