-
- Downloads
[SPARK-14078] Streaming Parquet Based FileSink
This PR adds a new `Sink` implementation that writes out Parquet files. In order to correctly handle partial failures while maintaining exactly once semantics, the files for each batch are written out to a unique directory and then atomically appended to a metadata log. When a parquet based `DataSource` is initialized for reading, we first check for this log directory and use it instead of file listing when present. Unit tests are added, as well as a stress test that checks the answer after non-deterministic injected failures. Author: Michael Armbrust <michael@databricks.com> Closes #11897 from marmbrus/fileSink.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/ContinuousQuery.scala 9 additions, 0 deletions...src/main/scala/org/apache/spark/sql/ContinuousQuery.scala
- sql/core/src/main/scala/org/apache/spark/sql/ContinuousQueryException.scala 3 additions, 3 deletions...scala/org/apache/spark/sql/ContinuousQueryException.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala 60 additions, 4 deletions...g/apache/spark/sql/execution/datasources/DataSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/CompositeOffset.scala 3 additions, 0 deletions...pache/spark/sql/execution/streaming/CompositeOffset.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSink.scala 81 additions, 0 deletions...apache/spark/sql/execution/streaming/FileStreamSink.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/FileStreamSource.scala 10 additions, 4 deletions...ache/spark/sql/execution/streaming/FileStreamSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLog.scala 4 additions, 3 deletions...pache/spark/sql/execution/streaming/HDFSMetadataLog.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/LongOffset.scala 2 additions, 0 deletions...org/apache/spark/sql/execution/streaming/LongOffset.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetadataLog.scala 1 addition, 1 deletion...rg/apache/spark/sql/execution/streaming/MetadataLog.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala 18 additions, 0 deletions...pache/spark/sql/execution/streaming/StreamExecution.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamFileCatalog.scala 59 additions, 0 deletions...che/spark/sql/execution/streaming/StreamFileCatalog.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/streaming/HDFSMetadataLogSuite.scala 2 additions, 0 deletions.../spark/sql/execution/streaming/HDFSMetadataLogSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala 49 additions, 0 deletions.../org/apache/spark/sql/streaming/FileStreamSinkSuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStressSuite.scala 129 additions, 0 deletions...cala/org/apache/spark/sql/streaming/FileStressSuite.scala
Loading
Please register or sign in to comment