Skip to content
  • Cheng Lian's avatar
    3ecb3794
    [SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down · 3ecb3794
    Cheng Lian authored
    This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.
    
    In Parquet, not all types of columns can be used for filter push-down optimization.  The set of valid column types is controlled by `ValidTypeMap`.  Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down.  On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`.
    
    This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`.  Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer.  Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check.
    
    The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`.  I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
    3ecb3794
    [SPARK-9407] [SQL] Relaxes Parquet ValidTypeMap to allow ENUM predicates to be pushed down
    Cheng Lian authored
    This PR adds a hacky workaround for PARQUET-201, and should be removed once we upgrade to parquet-mr 1.8.1 or higher versions.
    
    In Parquet, not all types of columns can be used for filter push-down optimization.  The set of valid column types is controlled by `ValidTypeMap`.  Unfortunately, in parquet-mr 1.7.0 and prior versions, this limitation is too strict, and doesn't allow `BINARY (ENUM)` columns to be pushed down.  On the other hand, `BINARY (ENUM)` is commonly seen in Parquet files written by libraries like `parquet-avro`.
    
    This restriction is problematic for Spark SQL, because Spark SQL doesn't have a type that maps to Parquet `BINARY (ENUM)` directly, and always converts `BINARY (ENUM)` to Catalyst `StringType`.  Thus, a predicate involving a `BINARY (ENUM)` is recognized as one involving a string field instead and can be pushed down by the query optimizer.  Such predicates are actually perfectly legal except that it fails the `ValidTypeMap` check.
    
    The workaround added here is relaxing `ValidTypeMap` to include `BINARY (ENUM)`.  I also took the chance to simplify `ParquetCompatibilityTest` a little bit when adding regression test.
    
    Author: Cheng Lian <lian@databricks.com>
    
    Closes #8107 from liancheng/spark-9407/parquet-enum-filter-push-down.
Loading