-
- Downloads
[SPARK-18826][SS] Add 'latestFirst' option to FileStreamSource
## What changes were proposed in this pull request? When starting a stream with a lot of backfill and maxFilesPerTrigger, the user could often want to start with most recent files first. This would let you keep low latency for recent data and slowly backfill historical data. This PR adds a new option `latestFirst` to control this behavior. When it's true, `FileStreamSource` will sort the files by the modified time from latest to oldest, and take the first `maxFilesPerTrigger` files as a new batch. ## How was this patch tested? The added test. Author: Shixiong Zhu <shixiong@databricks.com> Closes #16251 from zsxwing/newest-first. (cherry picked from commit 68a6dc97) Signed-off-by:Tathagata Das <tathagata.das1565@gmail.com>
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala 14 additions, 0 deletions...che/spark/sql/execution/streaming/FileStreamOptions.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala 10 additions, 1 deletion...ache/spark/sql/execution/streaming/FileStreamSource.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala 47 additions, 0 deletions...rg/apache/spark/sql/streaming/FileStreamSourceSuite.scala
Loading
Please register or sign in to comment