-
- Downloads
[SPARK-17666] Ensure that RecordReaders are closed by data source file scans
## What changes were proposed in this pull request? This patch addresses a potential cause of resource leaks in data source file scans. As reported in [SPARK-17666](https://issues.apache.org/jira/browse/SPARK-17666), tasks which do not fully-consume their input may cause file handles / network connections (e.g. S3 connections) to be leaked. Spark's `NewHadoopRDD` uses a TaskContext callback to [close its record readers](https://github.com/apache/spark/blame/master/core/src/main/scala/org/apache/spark/rdd/NewHadoopRDD.scala#L208), but the new data source file scans will only close record readers once their iterators are fully-consumed. This patch modifies `RecordReaderIterator` and `HadoopFileLinesReader` to add `close()` methods and modifies all six implementations of `FileFormat.buildReader()` to register TaskContext task completion callbacks to guarantee that cleanup is eventually performed. ## How was this patch tested? Tested manually for now. Author: Josh Rosen <joshrosen@databricks.com> Closes #15245 from JoshRosen/SPARK-17666-close-recordreader.
Showing
- mllib/src/main/scala/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala 5 additions, 2 deletions...la/org/apache/spark/ml/source/libsvm/LibSVMRelation.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFileLinesReader.scala 5 additions, 1 deletion...ark/sql/execution/datasources/HadoopFileLinesReader.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/RecordReaderIterator.scala 19 additions, 2 deletions...park/sql/execution/datasources/RecordReaderIterator.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVFileFormat.scala 4 additions, 1 deletion...e/spark/sql/execution/datasources/csv/CSVFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala 4 additions, 1 deletion...spark/sql/execution/datasources/json/JsonFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala 2 additions, 1 deletion...sql/execution/datasources/parquet/ParquetFileFormat.scala
- sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/text/TextFileFormat.scala 2 additions, 0 deletions...spark/sql/execution/datasources/text/TextFileFormat.scala
- sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala 5 additions, 1 deletion...n/scala/org/apache/spark/sql/hive/orc/OrcFileFormat.scala
Loading
Please register or sign in to comment