Skip to content
Snippets Groups Projects
  • Kan Zhang's avatar
    4fdb4917
    [SPARK-2010] Support for nested data in PySpark SQL · 4fdb4917
    Kan Zhang authored
    JIRA issue https://issues.apache.org/jira/browse/SPARK-2010
    
    This PR adds support for nested collection types in PySpark SQL, including
    array, dict, list, set, and tuple. Example,
    
    ```
    >>> from array import array
    >>> from pyspark.sql import SQLContext
    >>> sqlCtx = SQLContext(sc)
    >>> rdd = sc.parallelize([
    ...         {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
    ...         {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}])
    >>> srdd = sqlCtx.inferSchema(rdd)
    >>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
    ...                    {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]
    True
    >>> rdd = sc.parallelize([
    ...         {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
    ...         {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}])
    >>> srdd = sqlCtx.inferSchema(rdd)
    >>> srdd.collect() == \
    ... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
    ...  {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]
    True
    ```
    
    Author: Kan Zhang <kzhang@apache.org>
    
    Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits:
    
    1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO
    504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL
    4fdb4917
    History
    [SPARK-2010] Support for nested data in PySpark SQL
    Kan Zhang authored
    JIRA issue https://issues.apache.org/jira/browse/SPARK-2010
    
    This PR adds support for nested collection types in PySpark SQL, including
    array, dict, list, set, and tuple. Example,
    
    ```
    >>> from array import array
    >>> from pyspark.sql import SQLContext
    >>> sqlCtx = SQLContext(sc)
    >>> rdd = sc.parallelize([
    ...         {"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
    ...         {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}])
    >>> srdd = sqlCtx.inferSchema(rdd)
    >>> srdd.collect() == [{"f1" : array('i', [1, 2]), "f2" : {"row1" : 1.0}},
    ...                    {"f1" : array('i', [2, 3]), "f2" : {"row2" : 2.0}}]
    True
    >>> rdd = sc.parallelize([
    ...         {"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
    ...         {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}])
    >>> srdd = sqlCtx.inferSchema(rdd)
    >>> srdd.collect() == \
    ... [{"f1" : [[1, 2], [2, 3]], "f2" : set([1, 2]), "f3" : (1, 2)},
    ...  {"f1" : [[2, 3], [3, 4]], "f2" : set([2, 3]), "f3" : (2, 3)}]
    True
    ```
    
    Author: Kan Zhang <kzhang@apache.org>
    
    Closes #1041 from kanzhang/SPARK-2010 and squashes the following commits:
    
    1b2891d [Kan Zhang] [SPARK-2010] minor doc change and adding a TODO
    504f27e [Kan Zhang] [SPARK-2010] Support for nested data in PySpark SQL