-
- Downloads
[SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files...
[SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times Because of a corner case, a file already selected for batch t can get considered again for batch t+2. This refactoring fixes it by remembering all the files selected in the last 1 minute, so that this corner case does not arise. Also uses spark context's hadoop configuration to access the file system API for listing directories. pwendell Please take look. I still have not run long-running integration tests, so I cannot say for sure whether this has indeed solved the issue. You could do a first pass on this in the meantime. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #3419 from tdas/filestream-fix2 and squashes the following commits: c19dd8a [Tathagata Das] Addressed PR comments. 513b608 [Tathagata Das] Updated docs. d364faf [Tathagata Das] Added the current time condition back 5526222 [Tathagata Das] Removed unnecessary imports. 38bb736 [Tathagata Das] Fix long line. 203bbc7 [Tathagata Das] Un-ignore tests. eaef4e1 [Tathagata Das] Fixed SPARK-4519 9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few batches.
Showing
- streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala 1 addition, 1 deletion...in/scala/org/apache/spark/streaming/dstream/DStream.scala
- streaming/src/main/scala/org/apache/spark/streaming/dstream/FileInputDStream.scala 186 additions, 105 deletions...org/apache/spark/streaming/dstream/FileInputDStream.scala
- streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala 1 addition, 1 deletion...st/scala/org/apache/spark/streaming/CheckpointSuite.scala
- streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala 57 additions, 49 deletions.../scala/org/apache/spark/streaming/InputStreamsSuite.scala
Loading
Please register or sign in to comment