Skip to content
Snippets Groups Projects
  • Ahir Reddy's avatar
    c99bcb7f
    SPARK-1374: PySpark API for SparkSQL · c99bcb7f
    Ahir Reddy authored
    An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.
    
    ```
    from pyspark.context import SQLContext
    sqlCtx = SQLContext(sc)
    rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    srdd = sqlCtx.applySchema(rdd)
    sqlCtx.registerRDDAsTable(srdd, "table1")
    srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
    srdd2.collect()
    ```
    The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```
    
    Author: Ahir Reddy <ahirreddy@gmail.com>
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #363 from ahirreddy/pysql and squashes the following commits:
    
    0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
    307d6e0 [Ahir Reddy] Style fix
    6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
    3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
    29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
    f2312c7 [Ahir Reddy] Moved everything into sql.py
    a19afe4 [Ahir Reddy] Doc fixes
    6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
    521ff6d [Ahir Reddy] Trying to get spark to build with hive
    ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
    ded03e7 [Ahir Reddy] Added doc test for HiveContext
    22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
    e4da06c [Ahir Reddy] Display message if hive is not built into spark
    227a0be [Michael Armbrust] Update API links. Fix Hive example.
    58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
    4285340 [Michael Armbrust] Fix building of Hive API Docs.
    38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
    337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
    40491c9 [Ahir Reddy] PR Changes + Method Visibility
    1836944 [Michael Armbrust] Fix comments.
    e00980f [Michael Armbrust] First draft of python sql programming guide.
    b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
    f98a422 [Ahir Reddy] HiveContexts
    79621cf [Ahir Reddy] cleaning up cruft
    b406ba0 [Ahir Reddy] doctest formatting
    20936a5 [Ahir Reddy] Added tests and documentation
    e4d21b4 [Ahir Reddy] Added pyrolite dependency
    79f739d [Ahir Reddy] added more tests
    7515ba0 [Ahir Reddy] added more tests :)
    d26ec5e [Ahir Reddy] added test
    e9f5b8d [Ahir Reddy] adding tests
    906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
    251f99d [Ahir Reddy] for now only allow dictionaries as input
    09b9980 [Ahir Reddy] made jrdd explicitly lazy
    c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
    725c91e [Ahir Reddy] awesome row objects
    55d1c76 [Ahir Reddy] return row objects
    4fe1319 [Ahir Reddy] output dictionaries correctly
    be079de [Ahir Reddy] returning dictionaries works
    cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
    e948bd9 [Ahir Reddy] yippie
    4886052 [Ahir Reddy] even better
    c0fb1c6 [Ahir Reddy] more working
    043ca85 [Ahir Reddy] working
    5496f9f [Ahir Reddy] doesn't crash
    b8b904b [Ahir Reddy] Added schema rdd class
    67ba875 [Ahir Reddy] java to python, and python to java
    bcc0f23 [Ahir Reddy] Java to python
    ab6025d [Ahir Reddy] compiling
    c99bcb7f
    History
    SPARK-1374: PySpark API for SparkSQL
    Ahir Reddy authored
    An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.
    
    ```
    from pyspark.context import SQLContext
    sqlCtx = SQLContext(sc)
    rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
    srdd = sqlCtx.applySchema(rdd)
    sqlCtx.registerRDDAsTable(srdd, "table1")
    srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
    srdd2.collect()
    ```
    The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```
    
    Author: Ahir Reddy <ahirreddy@gmail.com>
    Author: Michael Armbrust <michael@databricks.com>
    
    Closes #363 from ahirreddy/pysql and squashes the following commits:
    
    0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
    307d6e0 [Ahir Reddy] Style fix
    6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
    3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
    29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
    f2312c7 [Ahir Reddy] Moved everything into sql.py
    a19afe4 [Ahir Reddy] Doc fixes
    6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
    521ff6d [Ahir Reddy] Trying to get spark to build with hive
    ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
    ded03e7 [Ahir Reddy] Added doc test for HiveContext
    22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
    e4da06c [Ahir Reddy] Display message if hive is not built into spark
    227a0be [Michael Armbrust] Update API links. Fix Hive example.
    58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
    4285340 [Michael Armbrust] Fix building of Hive API Docs.
    38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
    337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
    40491c9 [Ahir Reddy] PR Changes + Method Visibility
    1836944 [Michael Armbrust] Fix comments.
    e00980f [Michael Armbrust] First draft of python sql programming guide.
    b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
    f98a422 [Ahir Reddy] HiveContexts
    79621cf [Ahir Reddy] cleaning up cruft
    b406ba0 [Ahir Reddy] doctest formatting
    20936a5 [Ahir Reddy] Added tests and documentation
    e4d21b4 [Ahir Reddy] Added pyrolite dependency
    79f739d [Ahir Reddy] added more tests
    7515ba0 [Ahir Reddy] added more tests :)
    d26ec5e [Ahir Reddy] added test
    e9f5b8d [Ahir Reddy] adding tests
    906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
    251f99d [Ahir Reddy] for now only allow dictionaries as input
    09b9980 [Ahir Reddy] made jrdd explicitly lazy
    c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
    725c91e [Ahir Reddy] awesome row objects
    55d1c76 [Ahir Reddy] return row objects
    4fe1319 [Ahir Reddy] output dictionaries correctly
    be079de [Ahir Reddy] returning dictionaries works
    cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
    e948bd9 [Ahir Reddy] yippie
    4886052 [Ahir Reddy] even better
    c0fb1c6 [Ahir Reddy] more working
    043ca85 [Ahir Reddy] working
    5496f9f [Ahir Reddy] doesn't crash
    b8b904b [Ahir Reddy] Added schema rdd class
    67ba875 [Ahir Reddy] java to python, and python to java
    bcc0f23 [Ahir Reddy] Java to python
    ab6025d [Ahir Reddy] compiling