Skip to content
Snippets Groups Projects
  • Damian Guy's avatar
    071bbad5
    [SPARK-9340] [SQL] Fixes converting unannotated Parquet lists · 071bbad5
    Damian Guy authored
    This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR.
    
    **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".**
    
    ----
    
    SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`:
    
    > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field.
    
    One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays.
    
    This PR fixes this issue by
    
    1. Handling unannotated repeated fields in `CatalystSchemaConverter`.
    2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`.
    
       Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`.
    
       Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits:
    
    ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite
    f1c7bfd [Cheng Lian] Updates .rat-excludes
    420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists
    071bbad5
    History
    [SPARK-9340] [SQL] Fixes converting unannotated Parquet lists
    Damian Guy authored
    This PR is inspired by #8063 authored by dguy. Especially, testing Parquet files added here are all taken from that PR.
    
    **Committer who merges this PR should attribute it to "Damian Guy <damian.guygmail.com>".**
    
    ----
    
    SPARK-6776 and SPARK-6777 followed `parquet-avro` to implement backwards-compatibility rules defined in `parquet-format` spec. However, both Spark SQL and `parquet-avro` neglected the following statement in `parquet-format`:
    
    > This does not affect repeated fields that are not annotated: A repeated field that is neither contained by a `LIST`- or `MAP`-annotated group nor annotated by `LIST` or `MAP` should be interpreted as a required list of required elements where the element type is the type of the field.
    
    One of the consequences is that, Parquet files generated by `parquet-protobuf` containing unannotated repeated fields are not correctly converted to Catalyst arrays.
    
    This PR fixes this issue by
    
    1. Handling unannotated repeated fields in `CatalystSchemaConverter`.
    2. Converting this kind of special repeated fields to Catalyst arrays in `CatalystRowConverter`.
    
       Two special converters, `RepeatedPrimitiveConverter` and `RepeatedGroupConverter`, are added. They delegate actual conversion work to a child `elementConverter` and accumulates elements in an `ArrayBuffer`.
    
       Two extra methods, `start()` and `end()`, are added to `ParentContainerUpdater`. So that they can be used to initialize new `ArrayBuffer`s for unannotated repeated fields, and propagate converted array values to upstream.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8070 from liancheng/spark-9340/unannotated-parquet-list and squashes the following commits:
    
    ace6df7 [Cheng Lian] Moves ParquetProtobufCompatibilitySuite
    f1c7bfd [Cheng Lian] Updates .rat-excludes
    420ad2b [Cheng Lian] Fixes converting unannotated Parquet lists
.rat-excludes 1.42 KiB