Skip to content
  • Yash Datta's avatar
    2e35e242
    [SPARK-3968][SQL] Use parquet-mr filter2 api · 2e35e242
    Yash Datta authored
    The parquet-mr project has introduced a new filter api  (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
    We can leverage that to further improve performance of queries with filters.
    Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself
    
    Author: Yash Datta <Yash.Datta@guavus.com>
    
    Closes #2841 from saucam/master and squashes the following commits:
    
    8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns
    515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column
    5f4530e [Yash Datta] SPARK-3968: Fix scala code style
    f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering
    ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter
    48163c3 [Yash Datta] SPARK-3968: Code cleanup
    cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working             2. Use the serialization/deserialization from Parquet library for filter pushdown
    caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api
    49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns
    9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr
    2e35e242
    [SPARK-3968][SQL] Use parquet-mr filter2 api
    Yash Datta authored
    The parquet-mr project has introduced a new filter api  (https://github.com/apache/incubator-parquet-mr/pull/4), along with several fixes . It can also eliminate entire RowGroups depending on certain statistics like min/max
    We can leverage that to further improve performance of queries with filters.
    Also filter2 api introduces ability to create custom filters. We can create a custom filter for the optimized In clause (InSet) , so that elimination happens in the ParquetRecordReader itself
    
    Author: Yash Datta <Yash.Datta@guavus.com>
    
    Closes #2841 from saucam/master and squashes the following commits:
    
    8282ba0 [Yash Datta] SPARK-3968: fix scala code style and add some more tests for filtering on optional columns
    515df1c [Yash Datta] SPARK-3968: Add a test case for filter pushdown on optional column
    5f4530e [Yash Datta] SPARK-3968: Fix scala code style
    f304667 [Yash Datta] SPARK-3968: Using task metadata strategy for row group filtering
    ec53e92 [Yash Datta] SPARK-3968: No push down should result in case we are unable to create a record filter
    48163c3 [Yash Datta] SPARK-3968: Code cleanup
    cc7b596 [Yash Datta] SPARK-3968: 1. Fix RowGroupFiltering not working             2. Use the serialization/deserialization from Parquet library for filter pushdown
    caed851 [Yash Datta] Revert "SPARK-3968: Not pushing the filters in case of OPTIONAL columns" since filtering on optional columns is now supported in filter2 api
    49703c9 [Yash Datta] SPARK-3968: Not pushing the filters in case of OPTIONAL columns
    9d09741 [Yash Datta] SPARK-3968: Change parquet filter pushdown to use filter2 api of parquet-mr
Loading