-
- Downloads
[SPARK-9340] [SQL] Fixes converting unannotated Parquet lists
This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR. **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".** ---- SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`: > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field. One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays. This PR fixes this issue by 1. Handling unannotated repeated fields in `CatalystSchemaConverter`. 2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`. Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`. Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream. Author: Cheng Lian <lian@databricks.com> Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits: ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite f1c7bfd [Cheng Lian] Updates .rat-excludes 420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists
Showing
- .rat-excludes 1 addition, 0 deletions.rat-excludes
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala 120 additions, 31 deletions.../execution/datasources/parquet/CatalystRowConverter.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala 5 additions, 2 deletions...ecution/datasources/parquet/CatalystSchemaConverter.scala
- sql/core/src/test/resources/nested-array-struct.parquet 0 additions, 0 deletionssql/core/src/test/resources/nested-array-struct.parquet
- sql/core/src/test/resources/old-repeated-int.parquet 0 additions, 0 deletionssql/core/src/test/resources/old-repeated-int.parquet
- sql/core/src/test/resources/old-repeated-message.parquet 0 additions, 0 deletionssql/core/src/test/resources/old-repeated-message.parquet
- sql/core/src/test/resources/old-repeated.parquet 0 additions, 0 deletionssql/core/src/test/resources/old-repeated.parquet
- sql/core/src/test/resources/parquet-thrift-compat.snappy.parquet 0 additions, 0 deletions...e/src/test/resources/parquet-thrift-compat.snappy.parquet
- sql/core/src/test/resources/proto-repeated-string.parquet 0 additions, 0 deletionssql/core/src/test/resources/proto-repeated-string.parquet
- sql/core/src/test/resources/proto-repeated-struct.parquet 0 additions, 0 deletionssql/core/src/test/resources/proto-repeated-struct.parquet
- sql/core/src/test/resources/proto-struct-with-array-many.parquet 0 additions, 0 deletions...e/src/test/resources/proto-struct-with-array-many.parquet
- sql/core/src/test/resources/proto-struct-with-array.parquet 0 additions, 0 deletionssql/core/src/test/resources/proto-struct-with-array.parquet
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetProtobufCompatibilitySuite.scala 91 additions, 0 deletions...tasources/parquet/ParquetProtobufCompatibilitySuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetSchemaSuite.scala 30 additions, 0 deletions...ql/execution/datasources/parquet/ParquetSchemaSuite.scala
Loading
Please register or sign in to comment