Skip to content
Snippets Groups Projects
Commit f92f334c authored by Jason White's avatar Jason White Committed by Davies Liu
Browse files

[SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with provided schema

When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided.

Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect.

marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606

Author: Jason White <jason.white@shopify.com>

Closes #9392 from JasonMWhite/createDataFrame_without_take.
parent 71d1c907
No related branches found
No related tags found
No related merge requests found
......@@ -318,13 +318,7 @@ class SQLContext(object):
struct.names[i] = name
schema = struct
elif isinstance(schema, StructType):
# take the first few rows to verify schema
rows = rdd.take(10)
for row in rows:
_verify_type(row, schema)
else:
elif not isinstance(schema, StructType):
raise TypeError("schema should be StructType or list or None, but got: %s" % schema)
# convert python objects to sql data
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment