Skip to content
Snippets Groups Projects
Commit 5d00a7bc authored by Tathagata Das's avatar Tathagata Das
Browse files

[SPARK-16256][DOCS] Fix window operation diagram

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #14001 from tdas/SPARK-16256-2.
parent c6226334
No related branches found
No related tags found
No related merge requests found
docs/img/structured-streaming-late-data.png

136 KiB | W: | H:

docs/img/structured-streaming-late-data.png

135 KiB | W: | H:

docs/img/structured-streaming-late-data.png
docs/img/structured-streaming-late-data.png
docs/img/structured-streaming-late-data.png
docs/img/structured-streaming-late-data.png
  • 2-up
  • Swipe
  • Onion skin
docs/img/structured-streaming-window.png

126 KiB | W: | H:

docs/img/structured-streaming-window.png

130 KiB | W: | H:

docs/img/structured-streaming-window.png
docs/img/structured-streaming-window.png
docs/img/structured-streaming-window.png
docs/img/structured-streaming-window.png
  • 2-up
  • Swipe
  • Onion skin
No preview for this file type
......@@ -620,7 +620,7 @@ df.groupBy("type").count()
### Window Operations on Event Time
Aggregations over a sliding event-time window are straightforward with Structured Streaming. The key idea to understand about window-based aggregations are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration.
Imagine the quick example is modified and the stream contains lines along with the time when the line was generated. Instead of running word counts, we want to count words within 10 minute windows, updating every 5 minutes. That is, word counts in words received between 10 minute windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc. Note that 12:00 - 12:10 means data that arrived after 12:00 but before 12:10. Now, consider a word that was received at 12:07. This word should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15. So the counts will be indexed by both, the grouping key (i.e. the word) and the window (can be calculated from the event-time).
Imagine our quick example is modified and the stream now contains lines along with the time when the line was generated. Instead of running word counts, we want to count words within 10 minute windows, updating every 5 minutes. That is, word counts in words received between 10 minute windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc. Note that 12:00 - 12:10 means data that arrived after 12:00 but before 12:10. Now, consider a word that was received at 12:07. This word should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15. So the counts will be indexed by both, the grouping key (i.e. the word) and the window (can be calculated from the event-time).
The result tables would look something like the following.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment