-
- Downloads
SPARK-1293 [SQL] Parquet support for nested types
It should be possible to import and export data stored in Parquet's columnar format that contains nested types. For example: ```java message AddressBook { required binary owner; optional group ownerPhoneNumbers { repeated binary array; } optional group contacts { repeated group array { required binary name; optional binary phoneNumber; } } optional group nameToApartmentNumber { repeated group map { required binary key; required int32 value; } } } ``` The example could model a type (AddressBook) that contains records made of strings (owner), lists (ownerPhoneNumbers) and a table of contacts (e.g., a list of pairs or a map that can contain null values but keys must not be null). The list of tasks are as follows: <h6>Implement support for converting nested Parquet types to Spark/Catalyst types:</h6> - [x] Structs - [x] Lists - [x] Maps (note: currently keys need to be Strings) <h6>Implement import (via ``parquetFile``) of nested Parquet types (first version in this PR)</h6> - [x] Initial version <h6>Implement export (via ``saveAsParquetFile``)</h6> - [x] Initial version <h6>Test support for AvroParquet, etc.</h6> - [x] Initial testing of import of avro-generated Parquet data (simple + nested) Example: ```scala val data = TestSQLContext .parquetFile("input.dir") .toSchemaRDD data.registerAsTable("data") sql("SELECT owner, contacts[1].name, nameToApartmentNumber['John'] FROM data").collect() ``` Author: Andre Schumacher <andre.schumacher@iki.fi> Author: Michael Armbrust <michael@databricks.com> Closes #360 from AndreSchumacher/nested_parquet and squashes the following commits: 30708c8 [Andre Schumacher] Taking out AvroParquet test for now to remove Avro dependency 95c1367 [Andre Schumacher] Changes to ParquetRelation and its metadata 7eceb67 [Andre Schumacher] Review feedback 94eea3a [Andre Schumacher] Scalastyle 403061f [Andre Schumacher] Fixing some issues with tests and schema metadata b8a8b9a [Andre Schumacher] More fixes to short and byte conversion 63d1b57 [Andre Schumacher] Cleaning up and Scalastyle 88e6bdb [Andre Schumacher] Attempting to fix loss of schema 37e0a0a [Andre Schumacher] Cleaning up 14c3fd8 [Andre Schumacher] Attempting to fix Spark-Parquet schema conversion 3e1456c [Michael Armbrust] WIP: Directly serialize catalyst attributes. f7aeba3 [Michael Armbrust] [SPARK-1982] Support for ByteType and ShortType. 3104886 [Michael Armbrust] Nested Rows should be Rows, not Seqs. 3c6b25f [Andre Schumacher] Trying to reduce no-op changes wrt master 31465d6 [Andre Schumacher] Scalastyle: fixing commented out bottom de02538 [Andre Schumacher] Cleaning up ParquetTestData 2f5a805 [Andre Schumacher] Removing stripMargin from test schemas 191bc0d [Andre Schumacher] Changing to Seq for ArrayType, refactoring SQLParser for nested field extension cbb5793 [Andre Schumacher] Code review feedback 32229c7 [Andre Schumacher] Removing Row nested values and placing by generic types 0ae9376 [Andre Schumacher] Doc strings and simplifying ParquetConverter.scala a6b4f05 [Andre Schumacher] Cleaning up ArrayConverter, moving classTag to NativeType, adding NativeRow 431f00f [Andre Schumacher] Fixing problems introduced during rebase c52ff2c [Andre Schumacher] Adding native-array converter 619c397 [Andre Schumacher] Completing Map testcase 79d81d5 [Andre Schumacher] Replacing field names for array and map in WriteSupport f466ff0 [Andre Schumacher] Added ParquetAvro tests and revised Array conversion adc1258 [Andre Schumacher] Optimizing imports e99cc51 [Andre Schumacher] Fixing nested WriteSupport and adding tests 1dc5ac9 [Andre Schumacher] First version of WriteSupport for nested types d1911dc [Andre Schumacher] Simplifying ArrayType conversion f777b4b [Andre Schumacher] Scalastyle 824500c [Andre Schumacher] Adding attribute resolution for MapType b539fde [Andre Schumacher] First commit for MapType a594aed [Andre Schumacher] Scalastyle 4e25fcb [Andre Schumacher] Adding resolution of complex ArrayTypes f8f8911 [Andre Schumacher] For primitive rows fall back to more efficient converter, code reorg 6dbc9b7 [Andre Schumacher] Fixing some problems intruduced during rebase b7fcc35 [Andre Schumacher] Documenting conversions, bugfix, wrappers of Rows ee70125 [Andre Schumacher] fixing one problem with arrayconverter 98219cf [Andre Schumacher] added struct converter 5d80461 [Andre Schumacher] fixing one problem with nested structs and breaking up files 1b1b3d6 [Andre Schumacher] Fixing one problem with nested arrays ddb40d2 [Andre Schumacher] Extending tests for nested Parquet data 745a42b [Andre Schumacher] Completing testcase for nested data (Addressbook( 6125c75 [Andre Schumacher] First working nested Parquet record input 4d4892a [Andre Schumacher] First commit nested Parquet read converters aa688fe [Andre Schumacher] Adding conversion of nested Parquet schemas
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/SqlParser.scala 56 additions, 55 deletions.../main/scala/org/apache/spark/sql/catalyst/SqlParser.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala 2 additions, 0 deletions.../apache/spark/sql/catalyst/expressions/complexTypes.scala
- sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/types/dataTypes.scala 91 additions, 7 deletions...scala/org/apache/spark/sql/catalyst/types/dataTypes.scala
- sql/core/src/main/scala/org/apache/spark/sql/SQLContext.scala 1 addition, 1 deletion...core/src/main/scala/org/apache/spark/sql/SQLContext.scala
- sql/core/src/main/scala/org/apache/spark/sql/api/java/JavaSQLContext.scala 3 additions, 1 deletion.../scala/org/apache/spark/sql/api/java/JavaSQLContext.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkStrategies.scala 2 additions, 1 deletion...cala/org/apache/spark/sql/execution/SparkStrategies.scala
- sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetConverter.scala 667 additions, 0 deletions...scala/org/apache/spark/sql/parquet/ParquetConverter.scala
- sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetRelation.scala 13 additions, 169 deletions.../scala/org/apache/spark/sql/parquet/ParquetRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableOperations.scala 17 additions, 8 deletions...org/apache/spark/sql/parquet/ParquetTableOperations.scala
- sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTableSupport.scala 216 additions, 110 deletions...la/org/apache/spark/sql/parquet/ParquetTableSupport.scala
- sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTestData.scala 274 additions, 24 deletions.../scala/org/apache/spark/sql/parquet/ParquetTestData.scala
- sql/core/src/main/scala/org/apache/spark/sql/parquet/ParquetTypes.scala 408 additions, 0 deletions...ain/scala/org/apache/spark/sql/parquet/ParquetTypes.scala
- sql/core/src/test/scala/org/apache/spark/sql/parquet/ParquetQuerySuite.scala 349 additions, 7 deletions...cala/org/apache/spark/sql/parquet/ParquetQuerySuite.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala 3 additions, 1 deletion...cala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
Loading
Please register or sign in to comment