-
- Downloads
[SPARK-14369] [SQL] Locality support for FileScanRDD
(This PR is a rebased version of PR #12153.) ## What changes were proposed in this pull request? This PR adds preliminary locality support for `FileFormat` data sources by overriding `FileScanRDD.preferredLocations()`. The strategy can be divided into two parts: 1. Block location lookup Unlike `HadoopRDD` or `NewHadoopRDD`, `FileScanRDD` doesn't have access to the underlying `InputFormat` or `InputSplit`, and thus can't rely on `InputSplit.getLocations()` to gather locality information. Instead, this PR queries block locations using `FileSystem.getBlockLocations()` after listing all `FileStatus`es in `HDFSFileCatalog` and convert all `FileStatus`es into `LocatedFileStatus`es. Note that although S3/S3A/S3N file systems don't provide valid locality information, their `getLocatedStatus()` implementations don't actually issue remote calls either. So there's no need to special case these file systems. 2. Selecting preferred locations For each `FilePartition`, we pick up top 3 locations that containing the most data to be retrieved. This isn't necessarily the best algorithm out there. Further improvements may be brought up in follow-up PRs. ## How was this patch tested? Tested by overriding default `FileSystem` implementation for `file:///` with a mocked one, which returns mocked block locations. Author: Cheng Lian <lian@databricks.com> Closes #12527 from liancheng/spark-14369-locality-rebased.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileScanRDD.scala 23 additions, 1 deletion.../apache/spark/sql/execution/datasources/FileScanRDD.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategy.scala 48 additions, 5 deletions.../spark/sql/execution/datasources/FileSourceStrategy.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/fileSourceInterfaces.scala 63 additions, 21 deletions...park/sql/execution/datasources/fileSourceInterfaces.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/FileSourceStrategySuite.scala 97 additions, 9 deletions...k/sql/execution/datasources/FileSourceStrategySuite.scala
- sql/core/src/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala 21 additions, 0 deletions...c/test/scala/org/apache/spark/sql/test/SQLTestUtils.scala
- sql/hive/src/test/scala/org/apache/spark/sql/sources/hadoopFsRelationSuites.scala 39 additions, 1 deletion...org/apache/spark/sql/sources/hadoopFsRelationSuites.scala
Loading
Please register or sign in to comment