python/pyspark/java_gateway.py · fa28d6f803e2391c61d3663f9e66a63ff43dd3b4 · cs525-sp18-g07 / spark

7 years ago

[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas · 209b9361

Bryan Cutler authored 7 years ago

## What changes were proposed in this pull request?

This change uses Arrow to optimize the creation of a Spark DataFrame from a Pandas DataFrame. The input df is sliced according to the default parallelism. The optimization is enabled with the existing conf "spark.sql.execution.arrow.enabled" and is disabled by default.

## How was this patch tested?

Added new unit test to create DataFrame with and without the optimization enabled, then compare results.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Takuya UESHIN <ueshin@databricks.com>

Closes #19459 from BryanCutler/arrow-createDataFrame-from_pandas-SPARK-20791.

209b9361

History

[SPARK-20791][PYSPARK] Use Arrow to create Spark DataFrame from Pandas

Bryan Cutler authored 7 years ago

## What changes were proposed in this pull request?

This change uses Arrow to optimize the creation of a Spark DataFrame from a Pandas DataFrame. The input df is sliced according to the default parallelism. The optimization is enabled with the existing conf "spark.sql.execution.arrow.enabled" and is disabled by default.

## How was this patch tested?

Added new unit test to create DataFrame with and without the optimization enabled, then compare results.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Takuya UESHIN <ueshin@databricks.com>

Closes #19459 from BryanCutler/arrow-createDataFrame-from_pandas-SPARK-20791.

java_gateway.py 5.78 KiB