Skip to content
Snippets Groups Projects
Commit 4ad5153f authored by Liang-Chi Hsieh's avatar Liang-Chi Hsieh Committed by Cheng Lian
Browse files

[SPARK-6037][SQL] Avoiding duplicate Parquet schema merging

`FilteringParquetRowInputFormat` manually merges Parquet schemas before computing splits. However, it is duplicate because the schemas are already merged in `ParquetRelation2`. We don't need to re-merge them at `InputFormat`.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #4786 from viirya/dup_parquet_schemas_merge and squashes the following commits:

ef78a5a [Liang-Chi Hsieh] Avoiding duplicate Parquet schema merging.
parent 18f20984
No related branches found
No related tags found
No related merge requests found
...@@ -434,22 +434,13 @@ private[parquet] class FilteringParquetRowInputFormat ...@@ -434,22 +434,13 @@ private[parquet] class FilteringParquetRowInputFormat
return splits return splits
} }
Option(globalMetaData.getKeyValueMetaData.get(RowReadSupport.SPARK_METADATA_KEY)).foreach { val metadata = configuration.get(RowWriteSupport.SPARK_ROW_SCHEMA)
schemas => val mergedMetadata = globalMetaData
val mergedSchema = schemas .getKeyValueMetaData
.map(DataType.fromJson(_).asInstanceOf[StructType]) .updated(RowReadSupport.SPARK_METADATA_KEY, setAsJavaSet(Set(metadata)))
.reduce(_ merge _)
.json globalMetaData = new GlobalMetaData(globalMetaData.getSchema,
mergedMetadata, globalMetaData.getCreatedBy)
val mergedMetadata = globalMetaData
.getKeyValueMetaData
.updated(RowReadSupport.SPARK_METADATA_KEY, setAsJavaSet(Set(mergedSchema)))
globalMetaData = new GlobalMetaData(
globalMetaData.getSchema,
mergedMetadata,
globalMetaData.getCreatedBy)
}
val readContext = getReadSupport(configuration).init( val readContext = getReadSupport(configuration).init(
new InitContext(configuration, new InitContext(configuration,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment