-
- Downloads
[SPARK-7713] [SQL] Use shared broadcast hadoop conf for partitioned table scan.
https://issues.apache.org/jira/browse/SPARK-7713 I tested the performance with the following code: ```scala import sqlContext._ import sqlContext.implicits._ (1 to 5000).foreach { i => val df = (1 to 1000).map(j => (j, s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") } sqlContext.sql(""" CREATE TEMPORARY TABLE partitionedParquet USING org.apache.spark.sql.parquet OPTIONS ( path '/tmp/partitioned' )""") table("partitionedParquet").explain(true) ``` In our master `explain` takes 40s in my laptop. With this PR, `explain` takes 14s. Author: Yin Huai <yhuai@databricks.com> Closes #6252 from yhuai/broadcastHadoopConf and squashes the following commits: 6fa73df [Yin Huai] Address comments of Josh and Andrew. 807fbf9 [Yin Huai] Make the new buildScan and SqlNewHadoopRDD private sql. e393555 [Yin Huai] Cheng's comments. 2eb53bb [Yin Huai] Use a shared broadcast Hadoop Configuration for partitioned HadoopFsRelations.
Showing
- sql/core/src/main/scala/org/apache/spark/sql/parquet/newParquet.scala 70 additions, 43 deletions.../main/scala/org/apache/spark/sql/parquet/newParquet.scala
- sql/core/src/main/scala/org/apache/spark/sql/sources/DataSourceStrategy.scala 16 additions, 3 deletions...ala/org/apache/spark/sql/sources/DataSourceStrategy.scala
- sql/core/src/main/scala/org/apache/spark/sql/sources/SqlNewHadoopRDD.scala 268 additions, 0 deletions.../scala/org/apache/spark/sql/sources/SqlNewHadoopRDD.scala
- sql/core/src/main/scala/org/apache/spark/sql/sources/interfaces.scala 33 additions, 2 deletions.../main/scala/org/apache/spark/sql/sources/interfaces.scala
Loading
Please register or sign in to comment