-
- Downloads
[SPARK-10811] [SQL] Eliminates unnecessary byte array copying
When reading Parquet string and binary-backed decimal values, Parquet `Binary.getBytes` always returns a copied byte array, which is unnecessary. Since the underlying implementation of `Binary` values there is guaranteed to be `ByteArraySliceBackedBinary`, and Parquet itself never reuses underlying byte arrays, we can use `Binary.toByteBuffer.array()` to steal the underlying byte arrays without copying them. This brings performance benefits when scanning Parquet string and binary-backed decimal columns. Note that, this trick doesn't cover binary-backed decimals with precision greater than 18. My micro-benchmark result is that, this brings a ~15% performance boost for scanning TPC-DS `store_sales` table (scale factor 15). Another minor optimization done in this PR is that, now we directly construct a Java `BigDecimal` in `Decimal.toJavaBigDecimal` without constructing a Scala `BigDecimal` first. This brings another ~5% performance gain. Author: Cheng Lian <lian@databricks.com> Closes #8907 from liancheng/spark-10811/eliminate-array-copying.
Showing
- sql/catalyst/src/main/scala/org/apache/spark/sql/types/Decimal.scala 7 additions, 1 deletion...t/src/main/scala/org/apache/spark/sql/types/Decimal.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystRowConverter.scala 19 additions, 7 deletions.../execution/datasources/parquet/CatalystRowConverter.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/CatalystSchemaConverter.scala 2 additions, 2 deletions...ecution/datasources/parquet/CatalystSchemaConverter.scala
Loading
Please register or sign in to comment