Skip to content
Snippets Groups Projects
Commit 0ee5419b authored by Tathagata Das's avatar Tathagata Das Committed by Michael Armbrust
Browse files

[SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory...

[SPARK-14970][SQL] Prevent DataSource from enumerates all files in a directory if there is user specified schema

## What changes were proposed in this pull request?
The FileCatalog object gets created even if the user specifies schema, which means files in the directory is enumerated even thought its not necessary. For large directories this is very slow. User would want to specify schema in such scenarios of large dirs, and this defeats the purpose quite a bit.

## How was this patch tested?
Hard to test this with unit test.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12748 from tdas/SPARK-14970.
parent d5ab42ce
No related branches found
No related tags found
No related merge requests found
......@@ -127,17 +127,16 @@ case class DataSource(
}
private def inferFileFormatSchema(format: FileFormat): StructType = {
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
val allPaths = caseInsensitiveOptions.get("path")
val globbedPaths = allPaths.toSeq.flatMap { path =>
val hdfsPath = new Path(path)
val fs = hdfsPath.getFileSystem(sparkSession.sessionState.newHadoopConf())
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
SparkHadoopUtil.get.globPathIfNecessary(qualified)
}.toArray
val fileCatalog: FileCatalog = new HDFSFileCatalog(sparkSession, options, globbedPaths, None)
userSpecifiedSchema.orElse {
val caseInsensitiveOptions = new CaseInsensitiveMap(options)
val allPaths = caseInsensitiveOptions.get("path")
val globbedPaths = allPaths.toSeq.flatMap { path =>
val hdfsPath = new Path(path)
val fs = hdfsPath.getFileSystem(sparkSession.sessionState.newHadoopConf())
val qualified = hdfsPath.makeQualified(fs.getUri, fs.getWorkingDirectory)
SparkHadoopUtil.get.globPathIfNecessary(qualified)
}.toArray
val fileCatalog: FileCatalog = new HDFSFileCatalog(sparkSession, options, globbedPaths, None)
format.inferSchema(
sparkSession,
caseInsensitiveOptions,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment