-
- Downloads
[SPARK-18726][SQL] resolveRelation for FileFormat DataSource don't need to listFiles twice
## What changes were proposed in this pull request? Currently when we resolveRelation for a `FileFormat DataSource` without providing user schema, it will execute `listFiles` twice in `InMemoryFileIndex` during `resolveRelation`. This PR add a `FileStatusCache` for DataSource, this can avoid listFiles twice. But there is a bug in `InMemoryFileIndex` see: [SPARK-19748](https://github.com/apache/spark/pull/17079) [SPARK-19761](https://github.com/apache/spark/pull/17093), so this pr should be after SPARK-19748/ SPARK-19761. ## How was this patch tested? unit test added Author: windpiger <songjun@outlook.com> Closes #17081 from windpiger/resolveDataSourceScanFilesTwice.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala 9 additions, 4 deletions...g/apache/spark/sql/execution/datasources/DataSource.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala 11 additions, 0 deletions...pache/spark/sql/hive/PartitionedTablePerfStatsSuite.scala
Loading
Please register or sign in to comment