Skip to content
Snippets Groups Projects
Commit d60b09bb authored by Zongheng Yang's avatar Zongheng Yang Committed by Michael Armbrust
Browse files

[SPARK-2443][SQL] Fix slow read from partitioned tables

This fix obtains a comparable performance boost as [PR #1390](https://github.com/apache/spark/pull/1390) by moving an array update and deserializer initialization out of a potentially very long loop. Suggested by yhuai. The below results are updated for this fix.

## Benchmarks
Generated a local text file with 10M rows of simple key-value pairs. The data is loaded as a table through Hive. Results are obtained on my local machine using hive/console.

Without the fix:

Type | Non-partitioned | Partitioned (1 part)
------------ | ------------ | -------------
First run | 9.52s end-to-end (1.64s Spark job) | 36.6s (28.3s)
Stablized runs | 1.21s (1.18s) | 27.6s (27.5s)

With this fix:

Type | Non-partitioned | Partitioned (1 part)
------------ | ------------ | -------------
First run | 9.57s (1.46s) | 11.0s (1.69s)
Stablized runs | 1.13s (1.10s) | 1.23s (1.19s)

Author: Zongheng Yang <zongheng.y@gmail.com>

Closes #1408 from concretevitamin/slow-read-2 and squashes the following commits:

d86e437 [Zongheng Yang] Move update & initialization out of potentially long loop.
parent 38ccd6eb
No related branches found
No related tags found
No related merge requests found
......@@ -164,13 +164,17 @@ class HadoopTableReader(@transient _tableDesc: TableDesc, @transient sc: HiveCon
hivePartitionRDD.mapPartitions { iter =>
val hconf = broadcastedHiveConf.value.value
val rowWithPartArr = new Array[Object](2)
// The update and deserializer initialization are intentionally
// kept out of the below iter.map loop to save performance.
rowWithPartArr.update(1, partValues)
val deserializer = localDeserializer.newInstance()
deserializer.initialize(hconf, partProps)
// Map each tuple to a row object
iter.map { value =>
val deserializer = localDeserializer.newInstance()
deserializer.initialize(hconf, partProps)
val deserializedRow = deserializer.deserialize(value)
rowWithPartArr.update(0, deserializedRow)
rowWithPartArr.update(1, partValues)
rowWithPartArr.asInstanceOf[Object]
}
}
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment