Skip to content
Snippets Groups Projects
Commit 77bcceb9 authored by Cheng Lian's avatar Cheng Lian
Browse files

[SPARK-6748] [SQL] Makes QueryPlan.schema a lazy val

`DataFrame.collect()` calls `SparkPlan.executeCollect()`, which consists of a single line:

```scala
execute().map(ScalaReflection.convertRowToScala(_, schema)).collect()
```

The problem is that, `QueryPlan.schema` is a function. And since 1.3.0, `convertRowToScala` starts returning a `GenericRowWithSchema`. Thus, every `GenericRowWithSchema` instance holds a separate copy of the schema object. Also, YJP profiling result of the following simple micro benchmark (executed in Spark shell) shows that constructing the schema object takes up to ~35% CPU time.

```scala
sc.parallelize(1 to 10000000).
  map(i => (i, s"val_$i")).
  toDF("key", "value").
  saveAsParquetFile("file:///tmp/src.parquet")

// Profiling started from this line
sqlContext.parquetFile("file:///tmp/src.parquet").collect()
```

<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5398)
<!-- Reviewable:end -->

Author: Cheng Lian <lian@databricks.com>

Closes #5398 from liancheng/spark-6748 and squashes the following commits:

3159469 [Cheng Lian] Makes QueryPlan.schema a lazy val
parent fc957dc7
No related branches found
No related tags found
No related merge requests found
......@@ -150,7 +150,7 @@ abstract class QueryPlan[PlanType <: TreeNode[PlanType]] extends TreeNode[PlanTy
}.toSeq
}
def schema: StructType = StructType.fromAttributes(output)
lazy val schema: StructType = StructType.fromAttributes(output)
/** Returns the output schema in the tree format. */
def schemaString: String = schema.treeString
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment