-
- Downloads
[SPARK-11009] [SQL] fix wrong result of Window function in cluster mode
Currently, All windows function could generate wrong result in cluster sometimes. The root cause is that AttributeReference is called in executor, then id of it may not be unique than others created in driver. Here is the script that could reproduce the problem (run in local cluster): ``` from pyspark import SparkContext, HiveContext from pyspark.sql.window import Window from pyspark.sql.functions import rowNumber sqlContext = HiveContext(SparkContext()) sqlContext.setConf("spark.sql.shuffle.partitions", "3") df = sqlContext.range(1<<20) df2 = df.select((df.id % 1000).alias("A"), (df.id / 1000).alias('B')) ws = Window.partitionBy(df2.A).orderBy(df2.B) df3 = df2.select("client", "date", rowNumber().over(ws).alias("rn")).filter("rn < 0") assert df3.count() == 0 ``` Author: Davies Liu <davies@databricks.com> Author: Yin Huai <yhuai@databricks.com> Closes #9050 from davies/wrong_window.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/Window.scala 10 additions, 10 deletions...rc/main/scala/org/apache/spark/sql/execution/Window.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala 41 additions, 0 deletions...cala/org/apache/spark/sql/hive/HiveSparkSubmitSuite.scala
Please register or sign in to comment