-
- Downloads
[SPARK-19633][SS] FileSource read from FileSink
## What changes were proposed in this pull request? Right now file source always uses `InMemoryFileIndex` to scan files from a given path. But when reading the outputs from another streaming query, the file source should use `MetadataFileIndex` to list files from the sink log. This patch adds this support. ## `MetadataFileIndex` or `InMemoryFileIndex` ```scala spark .readStream .format(...) .load("/some/path") // for a non-glob path: // - use `MetadataFileIndex` when `/some/path/_spark_meta` exists // - fall back to `InMemoryFileIndex` otherwise ``` ```scala spark .readStream .format(...) .load("/some/path/*/*") // for a glob path: always use `InMemoryFileIndex` ``` ## How was this patch tested? two newly added tests Author: Liwei Lin <lwlin7@gmail.com> Closes #16987 from lw-lin/source-read-from-sink.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala 3 additions, 23 deletions...g/apache/spark/sql/execution/datasources/DataSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala 26 additions, 1 deletion...apache/spark/sql/execution/streaming/FileStreamSink.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala 58 additions, 5 deletions...ache/spark/sql/execution/streaming/FileStreamSource.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala 106 additions, 11 deletions...rg/apache/spark/sql/streaming/FileStreamSourceSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/StreamTest.scala 7 additions, 1 deletion...est/scala/org/apache/spark/sql/streaming/StreamTest.scala
Loading
Please register or sign in to comment