-
- Downloads
[SPARK-15103][SQL] Refactored FileCatalog class to allow StreamFileCatalog to infer partitioning
## What changes were proposed in this pull request? File Stream Sink writes the list of written files in a metadata log. StreamFileCatalog reads the list of the files for processing. However StreamFileCatalog does not infer partitioning like HDFSFileCatalog. This PR enables that by refactoring HDFSFileCatalog to create an abstract class PartitioningAwareFileCatalog, that has all the functionality to infer partitions from a list of leaf files. - HDFSFileCatalog has been renamed to ListingFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from recursive directory scanning. - StreamFileCatalog has been renamed to MetadataLogFileCatalog and it extends PartitioningAwareFileCatalog by providing a list of leaf files from the metadata log. - The above two classes has been moved into their own files as they are not interfaces that should be in fileSourceInterfaces.scala. ## How was this patch tested? - FileStreamSinkSuite was update to see if partitioning gets inferred, and on reading whether the partitions get pruned correctly based on the query. - Other unit tests are unchanged and pass as expected. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #12879 from tdas/SPARK-15103.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala 4 additions, 4 deletions...g/apache/spark/sql/execution/datasources/DataSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ListingFileCatalog.scala 127 additions, 0 deletions.../spark/sql/execution/datasources/ListingFileCatalog.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileCatalog.scala 155 additions, 0 deletions.../execution/datasources/PartitioningAwareFileCatalog.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala 7 additions, 208 deletions...park/sql/execution/datasources/fileSourceInterfaces.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MetadataLogFileCatalog.scala 59 additions, 0 deletions...park/sql/execution/streaming/MetadataLogFileCatalog.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSinkSuite.scala 56 additions, 8 deletions.../org/apache/spark/sql/streaming/FileStreamSinkSuite.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala 2 additions, 7 deletions...cala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala
Loading
Please register or sign in to comment