Skip to content
Snippets Groups Projects
Commit a6e2bd31 authored by Nong Li's avatar Nong Li Committed by Davies Liu
Browse files

[SPARK-13255] [SQL] Update vectorized reader to directly return ColumnarBatch...

[SPARK-13255] [SQL] Update vectorized reader to directly return ColumnarBatch instead of InternalRows.

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Currently, the parquet reader returns rows one by one which is bad for performance. This patch
updates the reader to directly return ColumnarBatches. This is only enabled with whole stage
codegen, which is the only operator currently that is able to consume ColumnarBatches (instead
of rows). The current implementation is a bit of a hack to get this to work and we should do
more refactoring of these low level interfaces to make this work better.

## How was this patch tested?

```
Results:
TPCDS:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
---------------------------------------------------------------------------------
q55 (before)                             8897 / 9265         12.9          77.2
q55                                      5486 / 5753         21.0          47.6
```

Author: Nong Li <nong@databricks.com>

Closes #11435 from nongli/spark-13255.
parent 5f42c28b
No related branches found
No related tags found
No related merge requests found
Showing
with 284 additions and 45 deletions
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment