diff --git a/docs/img/structured-streaming-late-data.png b/docs/img/structured-streaming-late-data.png index 5276b4786826236eb9ce043c9373870350aea7ff..2283f6782f380bcb352f294635aaf4165e1611dc 100644 Binary files a/docs/img/structured-streaming-late-data.png and b/docs/img/structured-streaming-late-data.png differ diff --git a/docs/img/structured-streaming-window.png b/docs/img/structured-streaming-window.png index be9d3fbf8ba813b9883b2e38ee3b42536ddd1646..c1842b1ca4fe09af0ad2acae60251f712c7720b6 100644 Binary files a/docs/img/structured-streaming-window.png and b/docs/img/structured-streaming-window.png differ diff --git a/docs/img/structured-streaming.pptx b/docs/img/structured-streaming.pptx index c278323554da8ca3cfca0e3aa7eef3f6225997a6..6aad2ed33e9248341150ab77cfee7d678891efa2 100644 Binary files a/docs/img/structured-streaming.pptx and b/docs/img/structured-streaming.pptx differ diff --git a/docs/structured-streaming-programming-guide.md b/docs/structured-streaming-programming-guide.md index 593256603f92aa1c713495309528d70678ba81d3..79493968db27479085c7bad21013b32b1231137c 100644 --- a/docs/structured-streaming-programming-guide.md +++ b/docs/structured-streaming-programming-guide.md @@ -620,7 +620,7 @@ df.groupBy("type").count() ### Window Operations on Event Time Aggregations over a sliding event-time window are straightforward with Structured Streaming. The key idea to understand about window-based aggregations are very similar to grouped aggregations. In a grouped aggregation, aggregate values (e.g. counts) are maintained for each unique value in the user-specified grouping column. In case of window-based aggregations, aggregate values are maintained for each window the event-time of a row falls into. Let's understand this with an illustration. -Imagine the quick example is modified and the stream contains lines along with the time when the line was generated. Instead of running word counts, we want to count words within 10 minute windows, updating every 5 minutes. That is, word counts in words received between 10 minute windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc. Note that 12:00 - 12:10 means data that arrived after 12:00 but before 12:10. Now, consider a word that was received at 12:07. This word should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15. So the counts will be indexed by both, the grouping key (i.e. the word) and the window (can be calculated from the event-time). +Imagine our quick example is modified and the stream now contains lines along with the time when the line was generated. Instead of running word counts, we want to count words within 10 minute windows, updating every 5 minutes. That is, word counts in words received between 10 minute windows 12:00 - 12:10, 12:05 - 12:15, 12:10 - 12:20, etc. Note that 12:00 - 12:10 means data that arrived after 12:00 but before 12:10. Now, consider a word that was received at 12:07. This word should increment the counts corresponding to two windows 12:00 - 12:10 and 12:05 - 12:15. So the counts will be indexed by both, the grouping key (i.e. the word) and the window (can be calculated from the event-time). The result tables would look something like the following.