-
- Downloads
[SPARK-15678] Add support to REFRESH data source paths
## What changes were proposed in this pull request? Spark currently incorrectly continues to use cached data even if the underlying data is overwritten. Current behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) sqlContext.read.parquet(dir).count() // outputs 1000 <---- We are still using the cached dataset ``` This patch fixes this bug by adding support for `REFRESH path` that invalidates and refreshes all the cached data (and the associated metadata) for any dataframe that contains the given data source path. Expected behavior: ```scala val dir = "/tmp/test" sqlContext.range(1000).write.mode("overwrite").parquet(dir) val df = sqlContext.read.parquet(dir).cache() df.count() // outputs 1000 sqlContext.range(10).write.mode("overwrite").parquet(dir) spark.catalog.refreshResource(dir) sqlContext.read.parquet(dir).count() // outputs 10 <---- We are not using the cached dataset ``` ## How was this patch tested? Unit tests for overwrites and appends in `ParquetQuerySuite` and `CachedTableSuite`. Author: Sameer Agarwal <sameer@databricks.com> Closes #13566 from sameeragarwal/refresh-path-2.
Showing
- sql/catalyst/src/main/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4 1 addition, 0 deletions...in/antlr4/org/apache/spark/sql/catalyst/parser/SqlBase.g4
- sql/core/src/main/scala/org/apache/spark/sql/catalog/Catalog.scala 7 additions, 0 deletions...src/main/scala/org/apache/spark/sql/catalog/Catalog.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/CacheManager.scala 50 additions, 1 deletion...n/scala/org/apache/spark/sql/execution/CacheManager.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/SparkSqlParser.scala 8 additions, 1 deletion...scala/org/apache/spark/sql/execution/SparkSqlParser.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/ddl.scala 9 additions, 0 deletions...cala/org/apache/spark/sql/execution/datasources/ddl.scala
- sql/core/src/main/scala/org/apache/spark/sql/internal/CatalogImpl.scala 10 additions, 0 deletions...ain/scala/org/apache/spark/sql/internal/CatalogImpl.scala
- sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetQuerySuite.scala 28 additions, 0 deletions...sql/execution/datasources/parquet/ParquetQuerySuite.scala
- sql/hive/src/test/scala/org/apache/spark/sql/hive/CachedTableSuite.scala 45 additions, 0 deletions...st/scala/org/apache/spark/sql/hive/CachedTableSuite.scala
Loading
Please register or sign in to comment