-
- Downloads
[SPARK-20451] Filter out nested mapType datatypes from sort order in randomSplit
## What changes were proposed in this pull request? In `randomSplit`, It is possible that the underlying dataset doesn't guarantee the ordering of rows in its constituent partitions each time a split is materialized which could result in overlapping splits. To prevent this, as part of SPARK-12662, we explicitly sort each input partition to make the ordering deterministic. Given that `MapTypes` cannot be sorted this patch explicitly prunes them out from the sort order. Additionally, if the resulting sort order is empty, this patch then materializes the dataset to guarantee determinism. ## How was this patch tested? Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to also test for dataframes with mapTypes nested mapTypes. Author: Sameer Agarwal <sameerag@cs.berkeley.edu> Closes #17751 from sameeragarwal/randomsplit2. (cherry picked from commit 31345fde) Signed-off-by:Wenchen Fan <wenchen@databricks.com>
Showing
- sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala 13 additions, 5 deletionssql/core/src/main/scala/org/apache/spark/sql/Dataset.scala
- sql/core/src/test/scala/org/apache/spark/sql/DataFrameStatSuite.scala 28 additions, 15 deletions.../test/scala/org/apache/spark/sql/DataFrameStatSuite.scala
Loading
Please register or sign in to comment