-
- Downloads
[SPARK-15458][SQL][STREAMING] Disable schema inference for streaming datasets on file streams
## What changes were proposed in this pull request? If the user relies on the schema to be inferred in file streams can break easily for multiple reasons - accidentally running on a directory which has no data - schema changing underneath - on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart. To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases. In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default. ## How was this patch tested? Updated unit tests that test error behavior with and without schema inference enabled. Author: Tathagata Das <tathagata.das1565@gmail.com> Closes #13238 from tdas/SPARK-15458.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala 11 additions, 0 deletions...g/apache/spark/sql/execution/datasources/DataSource.scala
- sql/core/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala 7 additions, 0 deletions...rc/main/scala/org/apache/spark/sql/internal/SQLConf.scala
- sql/core/src/test/scala/org/apache/spark/sql/streaming/FileStreamSourceSuite.scala 148 additions, 91 deletions...rg/apache/spark/sql/streaming/FileStreamSourceSuite.scala
Loading
Please register or sign in to comment