Skip to content
Snippets Groups Projects
Commit a885d5bb authored by buzhihuojie's avatar buzhihuojie Committed by Reynold Xin
Browse files

[SPARK-17895] Improve doc for rangeBetween and rowsBetween

## What changes were proposed in this pull request?

Copied description for row and range based frame boundary from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L56

Added examples to show different behavior of rangeBetween and rowsBetween when involving duplicate values.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 before opening a pull request.

Author: buzhihuojie <ren.weiluo@gmail.com>

Closes #15727 from david-weiluo-ren/improveDocForRangeAndRowsBetween.

(cherry picked from commit 742e0fea)
Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
parent 9be06912
No related branches found
No related tags found
No related merge requests found
...@@ -121,6 +121,32 @@ object Window { ...@@ -121,6 +121,32 @@ object Window {
* and [[Window.currentRow]] to specify special boundary values, rather than using integral * and [[Window.currentRow]] to specify special boundary values, rather than using integral
* values directly. * values directly.
* *
* A row based boundary is based on the position of the row within the partition.
* An offset indicates the number of rows above or below the current row, the frame for the
* current row starts or ends. For instance, given a row based sliding frame with a lower bound
* offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from
* index 4 to index 6.
*
* {{{
* import org.apache.spark.sql.expressions.Window
* val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
* .toDF("id", "category")
* df.withColumn("sum",
* sum('id) over Window.partitionBy('category).orderBy('id).rowsBetween(0,1))
* .show()
*
* +---+--------+---+
* | id|category|sum|
* +---+--------+---+
* | 1| b| 3|
* | 2| b| 5|
* | 3| b| 3|
* | 1| a| 2|
* | 1| a| 3|
* | 2| a| 2|
* +---+--------+---+
* }}}
*
* @param start boundary start, inclusive. The frame is unbounded if this is * @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value ([[Window.unboundedPreceding]]). * the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the * @param end boundary end, inclusive. The frame is unbounded if this is the
...@@ -144,6 +170,35 @@ object Window { ...@@ -144,6 +170,35 @@ object Window {
* and [[Window.currentRow]] to specify special boundary values, rather than using integral * and [[Window.currentRow]] to specify special boundary values, rather than using integral
* values directly. * values directly.
* *
* A range based boundary is based on the actual value of the ORDER BY
* expression(s). An offset is used to alter the value of the ORDER BY expression, for
* instance if the current order by expression has a value of 10 and the lower bound offset
* is -3, the resulting lower bound for the current row will be 10 - 3 = 7. This however puts a
* number of constraints on the ORDER BY expressions: there can be only one expression and this
* expression must have a numerical data type. An exception can be made when the offset is 0,
* because no value modification is needed, in this case multiple and non-numeric ORDER BY
* expression are allowed.
*
* {{{
* import org.apache.spark.sql.expressions.Window
* val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
* .toDF("id", "category")
* df.withColumn("sum",
* sum('id) over Window.partitionBy('category).orderBy('id).rangeBetween(0,1))
* .show()
*
* +---+--------+---+
* | id|category|sum|
* +---+--------+---+
* | 1| b| 3|
* | 2| b| 5|
* | 3| b| 3|
* | 1| a| 4|
* | 1| a| 4|
* | 2| a| 2|
* +---+--------+---+
* }}}
*
* @param start boundary start, inclusive. The frame is unbounded if this is * @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value ([[Window.unboundedPreceding]]). * the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the * @param end boundary end, inclusive. The frame is unbounded if this is the
......
...@@ -89,6 +89,32 @@ class WindowSpec private[sql]( ...@@ -89,6 +89,32 @@ class WindowSpec private[sql](
* and [[Window.currentRow]] to specify special boundary values, rather than using integral * and [[Window.currentRow]] to specify special boundary values, rather than using integral
* values directly. * values directly.
* *
* A row based boundary is based on the position of the row within the partition.
* An offset indicates the number of rows above or below the current row, the frame for the
* current row starts or ends. For instance, given a row based sliding frame with a lower bound
* offset of -1 and a upper bound offset of +2. The frame for row with index 5 would range from
* index 4 to index 6.
*
* {{{
* import org.apache.spark.sql.expressions.Window
* val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
* .toDF("id", "category")
* df.withColumn("sum",
* sum('id) over Window.partitionBy('category).orderBy('id).rowsBetween(0,1))
* .show()
*
* +---+--------+---+
* | id|category|sum|
* +---+--------+---+
* | 1| b| 3|
* | 2| b| 5|
* | 3| b| 3|
* | 1| a| 2|
* | 1| a| 3|
* | 2| a| 2|
* +---+--------+---+
* }}}
*
* @param start boundary start, inclusive. The frame is unbounded if this is * @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value ([[Window.unboundedPreceding]]). * the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the * @param end boundary end, inclusive. The frame is unbounded if this is the
...@@ -111,6 +137,35 @@ class WindowSpec private[sql]( ...@@ -111,6 +137,35 @@ class WindowSpec private[sql](
* and [[Window.currentRow]] to specify special boundary values, rather than using integral * and [[Window.currentRow]] to specify special boundary values, rather than using integral
* values directly. * values directly.
* *
* A range based boundary is based on the actual value of the ORDER BY
* expression(s). An offset is used to alter the value of the ORDER BY expression, for
* instance if the current order by expression has a value of 10 and the lower bound offset
* is -3, the resulting lower bound for the current row will be 10 - 3 = 7. This however puts a
* number of constraints on the ORDER BY expressions: there can be only one expression and this
* expression must have a numerical data type. An exception can be made when the offset is 0,
* because no value modification is needed, in this case multiple and non-numeric ORDER BY
* expression are allowed.
*
* {{{
* import org.apache.spark.sql.expressions.Window
* val df = Seq((1, "a"), (1, "a"), (2, "a"), (1, "b"), (2, "b"), (3, "b"))
* .toDF("id", "category")
* df.withColumn("sum",
* sum('id) over Window.partitionBy('category).orderBy('id).rangeBetween(0,1))
* .show()
*
* +---+--------+---+
* | id|category|sum|
* +---+--------+---+
* | 1| b| 3|
* | 2| b| 5|
* | 3| b| 3|
* | 1| a| 4|
* | 1| a| 4|
* | 2| a| 2|
* +---+--------+---+
* }}}
*
* @param start boundary start, inclusive. The frame is unbounded if this is * @param start boundary start, inclusive. The frame is unbounded if this is
* the minimum long value ([[Window.unboundedPreceding]]). * the minimum long value ([[Window.unboundedPreceding]]).
* @param end boundary end, inclusive. The frame is unbounded if this is the * @param end boundary end, inclusive. The frame is unbounded if this is the
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment