Skip to content
Snippets Groups Projects
  1. Apr 15, 2014
    • Ahir Reddy's avatar
      SPARK-1374: PySpark API for SparkSQL · c99bcb7f
      Ahir Reddy authored
      An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.
      
      ```
      from pyspark.context import SQLContext
      sqlCtx = SQLContext(sc)
      rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
      srdd = sqlCtx.applySchema(rdd)
      sqlCtx.registerRDDAsTable(srdd, "table1")
      srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
      srdd2.collect()
      ```
      The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```
      
      Author: Ahir Reddy <ahirreddy@gmail.com>
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #363 from ahirreddy/pysql and squashes the following commits:
      
      0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
      307d6e0 [Ahir Reddy] Style fix
      6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
      3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
      29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
      f2312c7 [Ahir Reddy] Moved everything into sql.py
      a19afe4 [Ahir Reddy] Doc fixes
      6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
      521ff6d [Ahir Reddy] Trying to get spark to build with hive
      ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
      ded03e7 [Ahir Reddy] Added doc test for HiveContext
      22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
      e4da06c [Ahir Reddy] Display message if hive is not built into spark
      227a0be [Michael Armbrust] Update API links. Fix Hive example.
      58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
      4285340 [Michael Armbrust] Fix building of Hive API Docs.
      38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
      337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
      40491c9 [Ahir Reddy] PR Changes + Method Visibility
      1836944 [Michael Armbrust] Fix comments.
      e00980f [Michael Armbrust] First draft of python sql programming guide.
      b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
      f98a422 [Ahir Reddy] HiveContexts
      79621cf [Ahir Reddy] cleaning up cruft
      b406ba0 [Ahir Reddy] doctest formatting
      20936a5 [Ahir Reddy] Added tests and documentation
      e4d21b4 [Ahir Reddy] Added pyrolite dependency
      79f739d [Ahir Reddy] added more tests
      7515ba0 [Ahir Reddy] added more tests :)
      d26ec5e [Ahir Reddy] added test
      e9f5b8d [Ahir Reddy] adding tests
      906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
      251f99d [Ahir Reddy] for now only allow dictionaries as input
      09b9980 [Ahir Reddy] made jrdd explicitly lazy
      c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
      725c91e [Ahir Reddy] awesome row objects
      55d1c76 [Ahir Reddy] return row objects
      4fe1319 [Ahir Reddy] output dictionaries correctly
      be079de [Ahir Reddy] returning dictionaries works
      cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
      e948bd9 [Ahir Reddy] yippie
      4886052 [Ahir Reddy] even better
      c0fb1c6 [Ahir Reddy] more working
      043ca85 [Ahir Reddy] working
      5496f9f [Ahir Reddy] doesn't crash
      b8b904b [Ahir Reddy] Added schema rdd class
      67ba875 [Ahir Reddy] java to python, and python to java
      bcc0f23 [Ahir Reddy] Java to python
      ab6025d [Ahir Reddy] compiling
      c99bcb7f
  2. Mar 09, 2014
    • Aaron Davidson's avatar
      SPARK-929: Fully deprecate usage of SPARK_MEM · 52834d76
      Aaron Davidson authored
      (Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615)
      
      This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables:
      SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY
      
      The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public.
      
      SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property.
      
      SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory.
      
      Other memory considerations:
      - The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY.
      - run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class).
      
      This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #99 from aarondav/sparkmem and squashes the following commits:
      
      9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM
      52834d76
  3. Jan 03, 2014
  4. Dec 29, 2013
  5. Dec 24, 2013
  6. Dec 19, 2013
  7. Sep 26, 2013
  8. Sep 22, 2013
  9. Sep 01, 2013
  10. Aug 29, 2013
    • Matei Zaharia's avatar
      Change build and run instructions to use assemblies · 53cd50c0
      Matei Zaharia authored
      This commit makes Spark invocation saner by using an assembly JAR to
      find all of Spark's dependencies instead of adding all the JARs in
      lib_managed. It also packages the examples into an assembly and uses
      that as SPARK_EXAMPLES_JAR. Finally, it replaces the old "run" script
      with two better-named scripts: "run-examples" for examples, and
      "spark-class" for Spark internal classes (e.g. REPL, master, etc). This
      is also designed to minimize the confusion people have in trying to use
      "run" to run their own classes; it's not meant to do that, but now at
      least if they look at it, they can modify run-examples to do a decent
      job for them.
      
      As part of this, Bagel's examples are also now properly moved to the
      examples package instead of bagel.
      53cd50c0
  11. Aug 28, 2013
  12. Jul 16, 2013
  13. Jan 01, 2013
  14. Dec 29, 2012
  15. Dec 28, 2012
  16. Oct 19, 2012
  17. Aug 21, 2012
  18. Aug 19, 2012
Loading