-
- Downloads
[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and...
[SPARK-16371][SQL] Do not push down filters incorrectly when inner name and outer name are the same in Parquet ## What changes were proposed in this pull request? Currently, if there is a schema as below: ``` root |-- _1: struct (nullable = true) | |-- _1: integer (nullable = true) ``` and if we execute the codes below: ```scala df.filter("_1 IS NOT NULL").count() ``` This pushes down a filter although this filter is being applied to `StructType`.(If my understanding is correct, Spark does not pushes down filters for those). The reason is, `ParquetFilters.getFieldMap` produces results below: ``` (_1,StructType(StructField(_1,IntegerType,true))) (_1,IntegerType) ``` and then it becomes a `Map` ``` (_1,IntegerType) ``` Now, because of ` ....lift(dataTypeOf(name)).map(_(name, value))`, this pushes down filters for `_1` which Parquet thinks is `IntegerType`. However, it is actually `StructType`. So, Parquet filter2 produces incorrect results, for example, the codes below: ``` df.filter("_1 IS NOT NULL").count() ``` produces always 0. This PR prevents this by not finding nested fields. ## How was this patch tested? Unit test in `ParquetFilterSuite`. Author: hyukjinkwon <gurwls223@gmail.com> Closes #14067 from HyukjinKwon/SPARK-16371.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala 1 addition, 1 deletion...sql/execution/datasources/parquet/ParquetFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala 4 additions, 1 deletion...rk/sql/execution/datasources/parquet/ParquetFilters.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilterSuite.scala 14 additions, 0 deletions...ql/execution/datasources/parquet/ParquetFilterSuite.scala
Please register or sign in to comment