Skip to content
Snippets Groups Projects
  • Davies Liu's avatar
    24544fbc
    [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling · 24544fbc
    Davies Liu authored
    This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
    
    If sampling is presented, it will infer schema from all the rows after sampling.
    
    Also, add samplingRatio for jsonFile() and jsonRDD()
    
    Author: Davies Liu <davies.liu@gmail.com>
    Author: Davies Liu <davies@databricks.com>
    
    Closes #2716 from davies/infer and squashes the following commits:
    
    e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
    34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
    567dc60 [Davies Liu] update docs
    9767b27 [Davies Liu] Merge branch 'master' into infer
    e48d7fb [Davies Liu] fix tests
    29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
    ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
    540d1d5 [Davies Liu] merge fields for StructType
    f93fd84 [Davies Liu] add more tests
    3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
    24544fbc
    History
    [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling
    Davies Liu authored
    This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
    
    If sampling is presented, it will infer schema from all the rows after sampling.
    
    Also, add samplingRatio for jsonFile() and jsonRDD()
    
    Author: Davies Liu <davies.liu@gmail.com>
    Author: Davies Liu <davies@databricks.com>
    
    Closes #2716 from davies/infer and squashes the following commits:
    
    e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
    34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
    567dc60 [Davies Liu] update docs
    9767b27 [Davies Liu] Merge branch 'master' into infer
    e48d7fb [Davies Liu] fix tests
    29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
    ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
    540d1d5 [Davies Liu] merge fields for StructType
    f93fd84 [Davies Liu] add more tests
    3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD