Skip to content
Snippets Groups Projects
Commit fc29b896 authored by Davies Liu's avatar Davies Liu Committed by Reynold Xin
Browse files

[SPARK-15392][SQL] fix default value of size estimation of logical plan

## What changes were proposed in this pull request?

We use  autoBroadcastJoinThreshold + 1L as the default value of size estimation, that is not good in 2.0, because we will calculate the size based on size of schema, then the estimation could be less than autoBroadcastJoinThreshold if you have an SELECT on top of an DataFrame created from RDD.

This PR change the default value to Long.MaxValue.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #13179 from davies/fix_default_size.
parent cc6a47dd
No related branches found
No related tags found
No related merge requests found
......@@ -605,8 +605,7 @@ private[sql] class SQLConf extends Serializable with CatalystConf with Logging {
def enableRadixSort: Boolean = getConf(RADIX_SORT_ENABLED)
def defaultSizeInBytes: Long =
getConf(DEFAULT_SIZE_IN_BYTES, autoBroadcastJoinThreshold + 1L)
def defaultSizeInBytes: Long = getConf(DEFAULT_SIZE_IN_BYTES, Long.MaxValue)
def isParquetBinaryAsString: Boolean = getConf(PARQUET_BINARY_AS_STRING)
......
......@@ -1476,4 +1476,13 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
getMessage()
assert(e1.startsWith("Path does not exist"))
}
test("SPARK-15392: DataFrame created from RDD should not be broadcasted") {
val rdd = sparkContext.range(1, 100).map(i => Row(i, i))
val df = spark.createDataFrame(rdd, new StructType().add("a", LongType).add("b", LongType))
assert(df.queryExecution.analyzed.statistics.sizeInBytes >
spark.wrapped.conf.autoBroadcastJoinThreshold)
assert(df.selectExpr("a").queryExecution.analyzed.statistics.sizeInBytes >
spark.wrapped.conf.autoBroadcastJoinThreshold)
}
}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment