Skip to content
Snippets Groups Projects
  • Cheng Lian's avatar
    4ffc27ca
    [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for... · 4ffc27ca
    Cheng Lian authored
    [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
    
    This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support.  And this one fixes the read path.  Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).
    
    ### Major changes
    
    1. `CatalystConverter` class hierarchy refactoring
    
       - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.
    
         Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.
    
         This simplifies the design since converters don't need to care about details of their parent converters anymore.
    
       - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`
    
         Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.
    
       - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
    
         `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.
    
         The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.
    
       - Implements backwards-compatibility rules in `CatalystArrayConverter`
    
         When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.
    
    2. Requested columns handling
    
       When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns.  This is not preferable when taking compatibility and interoperability into consideration.  Because the actual Parquet file may have different physical structure from the converted schema.
    
       In this PR, the schema for requested columns is constructed using the following method:
    
       - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
       - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
       - Unions all single-field `MessageType`s into a full schema containing all requested fields
    
       With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.
    
    ### Testing
    
    This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.
    
    [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
    [2]: https://issues.apache.org/jira/browse/SPARK-6774
    [3]: https://issues.apache.org/jira/browse/SPARK-6123
    [4]: https://issues.apache.org/jira/browse/SPARK-8848
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #7231 from liancheng/spark-6776 and squashes the following commits:
    
    360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
    c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
    b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
    598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
    926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
    7946ee1 [Cheng Lian] Fixes Scala styling issues
    3d7ab36 [Cheng Lian] Fixes .rat-excludes
    a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
    f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
    1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
    440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
    13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
    06cfe9d [Cheng Lian] Adds comments about TimestampType handling
    a099d3e [Cheng Lian] More comments
    0cc1b37 [Cheng Lian] Fixes MiMa checks
    884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
    802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
    38fe1e7 [Cheng Lian] Adds explicit return type
    7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
    1781dff [Cheng Lian] Adds test case for SPARK-8811
    6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
    bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
    a74fb2c [Cheng Lian] More comments
    0525346 [Cheng Lian] Removes old Parquet record converters
    03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules
    4ffc27ca
    History
    [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for...
    Cheng Lian authored
    [SPARK-6123] [SPARK-6775] [SPARK-6776] [SQL] Refactors Parquet read path for interoperability and backwards-compatibility
    
    This PR is a follow-up of #6617 and is part of [SPARK-6774] [2], which aims to ensure interoperability and backwards-compatibility for Spark SQL Parquet support.  And this one fixes the read path.  Now Spark SQL is expected to be able to read legacy Parquet data files generated by most (if not all) common libraries/tools like parquet-thrift, parquet-avro, and parquet-hive. However, we still need to refactor the write path to write standard Parquet LISTs and MAPs ([SPARK-8848] [4]).
    
    ### Major changes
    
    1. `CatalystConverter` class hierarchy refactoring
    
       - Replaces `CatalystConverter` trait with a much simpler `ParentContainerUpdater`.
    
         Now instead of extending the original `CatalystConverter` trait, every converter class accepts an updater which is responsible for propagating the converted value to some parent container. For example, appending array elements to a parent array buffer, appending a key-value pairs to a parent mutable map, or setting a converted value to some specific field of a parent row. Root converter doesn't have a parent and thus uses a `NoopUpdater`.
    
         This simplifies the design since converters don't need to care about details of their parent converters anymore.
    
       - Unifies `CatalystRootConverter`, `CatalystGroupConverter` and `CatalystPrimitiveRowConverter` into `CatalystRowConverter`
    
         Specifically, now all row objects are represented by `SpecificMutableRow` during conversion.
    
       - Refactors `CatalystArrayConverter`, and removes `CatalystArrayContainsNullConverter` and `CatalystNativeArrayConverter`
    
         `CatalystNativeArrayConverter` was probably designed with the intention of avoiding boxing costs. However, the way it uses Scala generics actually doesn't achieve this goal.
    
         The new `CatalystArrayConverter` handles both nullable and non-nullable array elements in a consistent way.
    
       - Implements backwards-compatibility rules in `CatalystArrayConverter`
    
         When Parquet records are being converted, schema of Parquet files should have already been verified. So we only need to care about the structure rather than field names in the Parquet schema. Since all map objects represented in legacy systems have the same structure as the standard one (see [backwards-compatibility rules for MAP] [1]), we only need to deal with LIST (namely array) in `CatalystArrayConverter`.
    
    2. Requested columns handling
    
       When specifying requested columns in `RowReadSupport`, we used to use a Parquet `MessageType` converted from a Catalyst `StructType` which contains all requested columns.  This is not preferable when taking compatibility and interoperability into consideration.  Because the actual Parquet file may have different physical structure from the converted schema.
    
       In this PR, the schema for requested columns is constructed using the following method:
    
       - For a column that exists in the target Parquet file, we extract the column type by name from the full file schema, and construct a single-field `MessageType` for that column.
       - For a column that doesn't exist in the target Parquet file, we create a single-field `StructType` and convert it to a `MessageType` using `CatalystSchemaConverter`.
       - Unions all single-field `MessageType`s into a full schema containing all requested fields
    
       With this change, we also fix [SPARK-6123] [3] by validating the global schema against each individual Parquet part-files.
    
    ### Testing
    
    This PR also adds compatibility tests for parquet-avro, parquet-thrift, and parquet-hive. Please refer to `README.md` under `sql/core/src/test` for more information about these tests. To avoid build time code generation and adding extra complexity to the build system, Java code generated from testing Thrift schema and Avro IDL is also checked in.
    
    [1]: https://github.com/apache/incubator-parquet-format/blob/master/LogicalTypes.md#backward-compatibility-rules-1
    [2]: https://issues.apache.org/jira/browse/SPARK-6774
    [3]: https://issues.apache.org/jira/browse/SPARK-6123
    [4]: https://issues.apache.org/jira/browse/SPARK-8848
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #7231 from liancheng/spark-6776 and squashes the following commits:
    
    360fe18 [Cheng Lian] Adds ParquetHiveCompatibilitySuite
    c6fbc06 [Cheng Lian] Removes WIP file committed by mistake
    b8c1295 [Cheng Lian] Excludes the whole parquet package from MiMa
    598c3e8 [Cheng Lian] Adds extra Maven repo for hadoop-lzo, which is a transitive dependency of parquet-thrift
    926af87 [Cheng Lian] Simplifies Parquet compatibility test suites
    7946ee1 [Cheng Lian] Fixes Scala styling issues
    3d7ab36 [Cheng Lian] Fixes .rat-excludes
    a8f13bb [Cheng Lian] Using Parquet writer API to do compatibility tests
    f2208cd [Cheng Lian] Adds README.md for Thrift/Avro code generation
    1d390aa [Cheng Lian] Adds parquet-thrift compatibility test
    440f7b3 [Cheng Lian] Adds generated files to .rat-excludes
    13b9121 [Cheng Lian] Adds ParquetAvroCompatibilitySuite
    06cfe9d [Cheng Lian] Adds comments about TimestampType handling
    a099d3e [Cheng Lian] More comments
    0cc1b37 [Cheng Lian] Fixes MiMa checks
    884d3e6 [Cheng Lian] Fixes styling issue and reverts unnecessary changes
    802cbd7 [Cheng Lian] Fixes bugs related to schema merging and empty requested columns
    38fe1e7 [Cheng Lian] Adds explicit return type
    7fb21f1 [Cheng Lian] Reverts an unnecessary debugging change
    1781dff [Cheng Lian] Adds test case for SPARK-8811
    6437d4b [Cheng Lian] Assembles requested schema from Parquet file schema
    bcac49f [Cheng Lian] Removes the 16-byte restriction of decimals
    a74fb2c [Cheng Lian] More comments
    0525346 [Cheng Lian] Removes old Parquet record converters
    03c3bd9 [Cheng Lian] Refactors Parquet read path to implement backwards-compatibility rules