-
- Downloads
[SPARK-17165][SQL] FileStreamSource should not track the list of seen files indefinitely
## What changes were proposed in this pull request? Before this change, FileStreamSource uses an in-memory hash set to track the list of files processed by the engine. The list can grow indefinitely, leading to OOM or overflow of the hash set. This patch introduces a new user-defined option called "maxFileAge", default to 24 hours. If a file is older than this age, FileStreamSource will purge it from the in-memory map that was used to track the list of files that have been processed. ## How was this patch tested? Added unit tests for the underlying utility, and also added an end-to-end test to validate the purge in FileStreamSourceSuite. Also verified the new test cases would fail when the timeout was set to a very large number. Author: petermaxlee <petermaxlee@gmail.com> Closes #14728 from petermaxlee/SPARK-17165.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamOptions.scala 54 additions, 0 deletions...che/spark/sql/execution/streaming/FileStreamOptions.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala 117 additions, 32 deletions...ache/spark/sql/execution/streaming/FileStreamSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala 1 addition, 1 deletion...pache/spark/sql/execution/streaming/HDFSMetadataLog.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/FileStreamSourceSuite.scala 76 additions, 0 deletions...spark/sql/execution/streaming/FileStreamSourceSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala 37 additions, 3 deletions...rg/apache/spark/sql/streaming/FileStreamSourceSuite.scala
Loading
Please register or sign in to comment