-
- Downloads
[SPARK-14217] [SQL] Fix bug if parquet data has columns that use dictionary...
[SPARK-14217] [SQL] Fix bug if parquet data has columns that use dictionary encoding for some of the data ## What changes were proposed in this pull request? This PR is based on #12017 Currently, this causes batches where some values are dictionary encoded and some which are not. The non-dictionary encoded values cause us to remove the dictionary from the batch causing the first values to return garbage. This patch fixes the issue by first decoding the dictionary for the values that are already dictionary encoded before switching. A similar thing is done for the reverse case where the initial values are not dictionary encoded. ## How was this patch tested? This is difficult to test but replicated on a test cluster using a large tpcds data set. Author: Nong Li <nong@databricks.com> Author: Davies Liu <davies@databricks.com> Closes #12279 from davies/fix_dict.
Showing
- sql/core/src/main/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedColumnReader.java 66 additions, 54 deletions...execution/datasources/parquet/VectorizedColumnReader.java
- sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ColumnVector.java 12 additions, 0 deletions...g/apache/spark/sql/execution/vectorized/ColumnVector.java
Loading
Please register or sign in to comment