Skip to content
Snippets Groups Projects
  1. Aug 06, 2014
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
    • Davies Liu's avatar
      [SPARK-2875] [PySpark] [SQL] handle null in schemaRDD() · 48789117
      Davies Liu authored
      Handle null in schemaRDD during converting them into Python.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1802 from davies/json and squashes the following commits:
      
      88e6b1f [Davies Liu] handle null in schemaRDD()
      48789117
  2. Aug 05, 2014
  3. Aug 04, 2014
    • Davies Liu's avatar
      [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple · 9fd82dbb
      Davies Liu authored
      serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1771 from davies/fix and squashes the following commits:
      
      1a9e336 [Davies Liu] fix unit tests
      9fd82dbb
    • Davies Liu's avatar
      [SPARK-1687] [PySpark] pickable namedtuple · 59f84a95
      Davies Liu authored
      Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs.
      
      PS: pyspark should be import BEFORE "from collections import namedtuple"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1623 from davies/namedtuple and squashes the following commits:
      
      045dad8 [Davies Liu] remove unrelated code changes
      4132f32 [Davies Liu] address comment
      55b1c1a [Davies Liu] fix tests
      61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one
      98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
      f7b1bde [Davies Liu] add hack for CloudPickleSerializer
      0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
      21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable.
      93b03b8 [Davies Liu] pickable namedtuple
      59f84a95
  4. Aug 03, 2014
    • Davies Liu's avatar
      [SPARK-1740] [PySpark] kill the python worker · 55349f9f
      Davies Liu authored
      Kill only the python worker related to cancelled tasks.
      
      The daemon will start a background thread to monitor all the opened sockets for all workers. If the socket is closed by JVM, this thread will kill the worker.
      
      When an task is cancelled, the socket to worker will be closed, then the worker will be killed by deamon.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1643 from davies/kill and squashes the following commits:
      
      8ffe9f3 [Davies Liu] kill worker by deamon, because runtime.exec() is too heavy
      46ca150 [Davies Liu] address comment
      acd751c [Davies Liu] kill the worker when task is canceled
      55349f9f
    • Michael Armbrust's avatar
      [SPARK-2784][SQL] Deprecate hql() method in favor of a config option, 'spark.sql.dialect' · 236dfac6
      Michael Armbrust authored
      Many users have reported being confused by the distinction between the `sql` and `hql` methods.  Specifically, many users think that `sql(...)` cannot be used to read hive tables.  In this PR I introduce a new configuration option `spark.sql.dialect` that picks which dialect with be used for parsing.  For SQLContext this must be set to `sql`.  In `HiveContext` it defaults to `hiveql` but can also be set to `sql`.
      
      The `hql` and `hiveql` methods continue to act the same but are now marked as deprecated.
      
      **This is a possibly breaking change for some users unless they set the dialect manually, though this is unlikely.**
      
      For example: `hiveContex.sql("SELECT 1")` will now throw a parsing exception by default.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1746 from marmbrus/sqlLanguageConf and squashes the following commits:
      
      ad375cc [Michael Armbrust] Merge remote-tracking branch 'apache/master' into sqlLanguageConf
      20c43f8 [Michael Armbrust] override function instead of just setting the value
      7e4ae93 [Michael Armbrust] Deprecate hql() method in favor of a config option, 'spark.sql.dialect'
      236dfac6
  5. Aug 02, 2014
    • Michael Armbrust's avatar
      [SPARK-2739][SQL] Rename registerAsTable to registerTempTable · 1a804373
      Michael Armbrust authored
      There have been user complaints that the difference between `registerAsTable` and `saveAsTable` is too subtle.  This PR addresses this by renaming `registerAsTable` to `registerTempTable`, which more clearly reflects what is happening.  `registerAsTable` remains, but will cause a deprecation warning.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1743 from marmbrus/registerTempTable and squashes the following commits:
      
      d031348 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
      4dff086 [Michael Armbrust] Fix .java files too
      89a2f12 [Michael Armbrust] Merge remote-tracking branch 'apache/master' into registerTempTable
      0b7b71e [Michael Armbrust] Rename registerAsTable to registerTempTable
      1a804373
    • Yin Huai's avatar
      [SPARK-2797] [SQL] SchemaRDDs don't support unpersist() · d210022e
      Yin Huai authored
      The cause is explained in https://issues.apache.org/jira/browse/SPARK-2797.
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1745 from yhuai/SPARK-2797 and squashes the following commits:
      
      7b1627d [Yin Huai] The unpersist method of the Scala RDD cannot be called without the input parameter (blocking) from PySpark.
      d210022e
    • Michael Armbrust's avatar
      [SPARK-2097][SQL] UDF Support · 158ad0bb
      Michael Armbrust authored
      This patch adds the ability to register lambda functions written in Python, Java or Scala as UDFs for use in SQL or HiveQL.
      
      Scala:
      ```scala
      registerFunction("strLenScala", (_: String).length)
      sql("SELECT strLenScala('test')")
      ```
      Python:
      ```python
      sqlCtx.registerFunction("strLenPython", lambda x: len(x), IntegerType())
      sqlCtx.sql("SELECT strLenPython('test')")
      ```
      Java:
      ```java
      sqlContext.registerFunction("stringLengthJava", new UDF1<String, Integer>() {
        Override
        public Integer call(String str) throws Exception {
          return str.length();
        }
      }, DataType.IntegerType);
      
      sqlContext.sql("SELECT stringLengthJava('test')");
      ```
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1063 from marmbrus/udfs and squashes the following commits:
      
      9eda0fe [Michael Armbrust] newline
      747c05e [Michael Armbrust] Add some scala UDF tests.
      d92727d [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
      005d684 [Michael Armbrust] Fix naming and formatting.
      d14dac8 [Michael Armbrust] Fix last line of autogened java files.
      8135c48 [Michael Armbrust] Move UDF unit tests to pyspark.
      40b0ffd [Michael Armbrust] Merge remote-tracking branch 'apache/master' into udfs
      6a36890 [Michael Armbrust] Switch logging so that SQLContext can be serializable.
      7a83101 [Michael Armbrust] Drop toString
      795fd15 [Michael Armbrust] Try to avoid capturing SQLContext.
      e54fb45 [Michael Armbrust] Docs and tests.
      437cbe3 [Michael Armbrust] Update use of dataTypes, fix some python tests, address review comments.
      01517d6 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
      8e6c932 [Michael Armbrust] WIP
      3f96a52 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into udfs
      6237c8d [Michael Armbrust] WIP
      2766f0b [Michael Armbrust] Move udfs support to SQL from hive. Add support for Java UDFs.
      0f7d50c [Michael Armbrust] Draft of native Spark SQL UDFs for Scala and Python.
      158ad0bb
    • Joseph K. Bradley's avatar
      [SPARK-2478] [mllib] DecisionTree Python API · 3f67382e
      Joseph K. Bradley authored
      Added experimental Python API for Decision Trees.
      
      API:
      * class DecisionTreeModel
      ** predict() for single examples and RDDs, taking both feature vectors and LabeledPoints
      ** numNodes()
      ** depth()
      ** __str__()
      * class DecisionTree
      ** trainClassifier()
      ** trainRegressor()
      ** train()
      
      Examples and testing:
      * Added example testing classification and regression with batch prediction: examples/src/main/python/mllib/tree.py
      * Have also tested example usage in doc of python/pyspark/mllib/tree.py which tests single-example prediction with dense and sparse vectors
      
      Also: Small bug fix in python/pyspark/mllib/_common.py: In _linear_predictor_typecheck, changed check for RDD to use isinstance() instead of type() in order to catch RDD subclasses.
      
      CC mengxr manishamde
      
      Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
      
      Closes #1727 from jkbradley/decisiontree-python-new and squashes the following commits:
      
      3744488 [Joseph K. Bradley] Renamed test tree.py to decision_tree_runner.py Small updates based on github review.
      6b86a9d [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      affceb9 [Joseph K. Bradley] * Fixed bug in doc tests in pyspark/mllib/util.py caused by change in loadLibSVMFile behavior.  (It used to threshold labels at 0 to make them 0/1, but it now leaves them as they are.) * Fixed small bug in loadLibSVMFile: If a data file had no features, then loadLibSVMFile would create a single all-zero feature.
      67a29bc [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      cf46ad7 [Joseph K. Bradley] Python DecisionTreeModel * predict(empty RDD) returns an empty RDD instead of an error. * Removed support for calling predict() on LabeledPoint and RDD[LabeledPoint] * predict() does not cache serialized RDD any more.
      aa29873 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      bf21be4 [Joseph K. Bradley] removed old run() func from DecisionTree
      fa10ea7 [Joseph K. Bradley] Small style update
      7968692 [Joseph K. Bradley] small braces typo fix
      e34c263 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4801b40 [Joseph K. Bradley] Small style update to DecisionTreeSuite
      db0eab2 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix2' into decisiontree-python-new
      6873fa9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      225822f [Joseph K. Bradley] Bug: In DecisionTree, the method sequentialBinSearchForOrderedCategoricalFeatureInClassification() indexed bins from 0 to (math.pow(2, featureCategories.toInt - 1) - 1). This upper bound is the bound for unordered categorical features, not ordered ones. The upper bound should be the arity (i.e., max value) of the feature.
      93953f1 [Joseph K. Bradley] Likely done with Python API.
      6df89a9 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      4562c08 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      665ba78 [Joseph K. Bradley] Small updates towards Python DecisionTree API
      188cb0d [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      6622247 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      b8fac57 [Joseph K. Bradley] Finished Python DecisionTree API and example but need to test a bit more.
      2b20c61 [Joseph K. Bradley] Small doc and style updates
      1b29c13 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      584449a [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      dab0b67 [Joseph K. Bradley] Added documentation for DecisionTree internals
      8bb8aa0 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      978cfcf [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      6eed482 [Joseph K. Bradley] In DecisionTree: Changed from using procedural syntax for functions returning Unit to explicitly writing Unit return type.
      376dca2 [Joseph K. Bradley] Updated meaning of maxDepth by 1 to fit scikit-learn and rpart. * In code, replaced usages of maxDepth <-- maxDepth + 1 * In params, replace settings of maxDepth <-- maxDepth - 1
      e06e423 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      bab3f19 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      59750f8 [Joseph K. Bradley] * Updated Strategy to check numClassesForClassification only if algo=Classification. * Updates based on comments: ** DecisionTreeRunner *** Made dataFormat arg default to libsvm ** Small cleanups ** tree.Node: Made recursive helper methods private, and renamed them.
      52e17c5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      f5a036c [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      da50db7 [Joseph K. Bradley] Added one more test to DecisionTreeSuite: stump with 2 continuous variables for binary classification.  Caused problems in past, but fixed now.
      8e227ea [Joseph K. Bradley] Changed Strategy so it only requires numClassesForClassification >= 2 for classification
      cd1d933 [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      8ea8750 [Joseph K. Bradley] Bug fix: Off-by-1 when finding thresholds for splits for continuous features.
      8a758db [Joseph K. Bradley] Merge branch 'decisiontree-bugfix' into decisiontree-python-new
      5fe44ed [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-python-new
      2283df8 [Joseph K. Bradley] 2 bug fixes.
      73fbea2 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into decisiontree-bugfix
      5f920a1 [Joseph K. Bradley] Demonstration of bug before submitting fix: Updated DecisionTreeSuite so that 3 tests fail.  Will describe bug in next commit.
      f825352 [Joseph K. Bradley] Wrote Python API and example for DecisionTree.  Also added toString, depth, and numNodes methods to DecisionTreeModel.
      3f67382e
    • Andrew Or's avatar
      [SPARK-2454] Do not ship spark home to Workers · 148af608
      Andrew Or authored
      When standalone Workers launch executors, they inherit the Spark home set by the driver. This means if the worker machines do not share the same directory structure as the driver node, the Workers will attempt to run scripts (e.g. bin/compute-classpath.sh) that do not exist locally and fail. This is a common scenario if the driver is launched from outside of the cluster.
      
      The solution is to simply not pass the driver's Spark home to the Workers. This PR further makes an attempt to avoid overloading the usages of `spark.home`, which is now only used for setting executor Spark home on Mesos and in python.
      
      This is based on top of #1392 and originally reported by YanTangZhai. Tested on standalone cluster.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1734 from andrewor14/spark-home-reprise and squashes the following commits:
      
      f71f391 [Andrew Or] Revert changes in python
      1c2532c [Andrew Or] Merge branch 'master' of github.com:apache/spark into spark-home-reprise
      188fc5d [Andrew Or] Avoid using spark.home where possible
      09272b7 [Andrew Or] Always use Worker's working directory as spark home
      148af608
    • Jeremy Freeman's avatar
      StatCounter on NumPy arrays [PYSPARK][SPARK-2012] · 4bc3bb29
      Jeremy Freeman authored
      These changes allow StatCounters to work properly on NumPy arrays, to fix the issue reported here  (https://issues.apache.org/jira/browse/SPARK-2012).
      
      If NumPy is installed, the NumPy functions ``maximum``, ``minimum``, and ``sqrt``, which work on arrays, are used to merge statistics. If not, we fall back on scalar operators, so it will work on arrays with NumPy, but will also work without NumPy.
      
      New unit tests added, along with a check for NumPy in the tests.
      
      Author: Jeremy Freeman <the.freeman.lab@gmail.com>
      
      Closes #1725 from freeman-lab/numpy-max-statcounter and squashes the following commits:
      
      fe973b1 [Jeremy Freeman] Avoid duplicate array import in tests
      7f0e397 [Jeremy Freeman] Refactored check for numpy
      8e764dd [Jeremy Freeman] Explicit numpy imports
      875414c [Jeremy Freeman] Fixed indents
      1c8a832 [Jeremy Freeman] Unit tests for StatCounter with NumPy arrays
      176a127 [Jeremy Freeman] Use numpy arrays in StatCounter
      4bc3bb29
  6. Aug 01, 2014
    • Michael Giannakopoulos's avatar
      [SPARK-2550][MLLIB][APACHE SPARK] Support regularization and intercept in pyspark's linear methods. · c2811892
      Michael Giannakopoulos authored
      Related to issue: [SPARK-2550](https://issues.apache.org/jira/browse/SPARK-2550?jql=project%20%3D%20SPARK%20AND%20resolution%20%3D%20Unresolved%20AND%20priority%20%3D%20Major%20ORDER%20BY%20key%20DESC).
      
      Author: Michael Giannakopoulos <miccagiann@gmail.com>
      
      Closes #1624 from miccagiann/new-branch and squashes the following commits:
      
      c02e5f5 [Michael Giannakopoulos] Merge cleanly with upstream/master.
      8dcb888 [Michael Giannakopoulos] Putting the if/else if statements in brackets.
      fed8eaa [Michael Giannakopoulos] Adding a space in the message related to the IllegalArgumentException.
      44e6ff0 [Michael Giannakopoulos] Adding a blank line before python class LinearRegressionWithSGD.
      8eba9c5 [Michael Giannakopoulos] Change function signatures. Exception is thrown from the scala component and not from the python one.
      638be47 [Michael Giannakopoulos] Modified code to comply with code standards.
      ec50ee9 [Michael Giannakopoulos] Shorten the if-elif-else statement in regression.py file
      b962744 [Michael Giannakopoulos] Replaced the enum classes, with strings-keywords for defining the values of 'regType' parameter.
      78853ec [Michael Giannakopoulos] Providing intercept and regualizer functionallity for linear methods in only one function.
      3ac8874 [Michael Giannakopoulos] Added support for regularizer and intercection parameters for linear regression method.
      c2811892
    • Josh Rosen's avatar
      [SPARK-2764] Simplify daemon.py process structure · e8e0fd69
      Josh Rosen authored
      Curently, daemon.py forks a pool of numProcessors subprocesses, and those processes fork themselves again to create the actual Python worker processes that handle data.
      
      I think that this extra layer of indirection is unnecessary and adds a lot of complexity.  This commit attempts to remove this middle layer of subprocesses by launching the workers directly from daemon.py.
      
      See https://github.com/mesos/spark/pull/563 for the original PR that added daemon.py, where I raise some issues with the current design.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1680 from JoshRosen/pyspark-daemon and squashes the following commits:
      
      5abbcb9 [Josh Rosen] Replace magic number: 4 -> EINTR
      5495dff [Josh Rosen] Throw IllegalStateException if worker launch fails.
      b79254d [Josh Rosen] Detect failed fork() calls; improve error logging.
      282c2c4 [Josh Rosen] Remove daemon.py exit logging, since it caused problems:
      8554536 [Josh Rosen] Fix daemon’s shutdown(); log shutdown reason.
      4e0fab8 [Josh Rosen] Remove shared-memory exit_flag; don't die on worker death.
      e9892b4 [Josh Rosen] [WIP] [SPARK-2764] Simplify daemon.py process structure.
      e8e0fd69
    • Davies Liu's avatar
      [SPARK-2010] [PySpark] [SQL] support nested structure in SchemaRDD · 880eabec
      Davies Liu authored
      Convert Row in JavaSchemaRDD into Array[Any] and unpickle them as tuple in Python, then convert them into namedtuple, so use can access fields just like attributes.
      
      This will let nested structure can be accessed as object, also it will reduce the size of serialized data and better performance.
      
      root
       |-- field1: integer (nullable = true)
       |-- field2: string (nullable = true)
       |-- field3: struct (nullable = true)
       |    |-- field4: integer (nullable = true)
       |    |-- field5: array (nullable = true)
       |    |    |-- element: integer (containsNull = false)
       |-- field6: array (nullable = true)
       |    |-- element: struct (containsNull = false)
       |    |    |-- field7: string (nullable = true)
      
      Then we can access them by row.field3.field5[0]  or row.field6[5].field7
      
      It also will infer the schema in Python, convert Row/dict/namedtuple/objects into tuple before serialization, then call applySchema in JVM. During inferSchema(), the top level of dict in row will be StructType, but any nested dictionary will be MapType.
      
      You can use pyspark.sql.Row to convert unnamed structure into Row object, make the RDD can be inferable. Such as:
      
      ctx.inferSchema(rdd.map(lambda x: Row(a=x[0], b=x[1]))
      
      Or you could use Row to create a class just like namedtuple, for example:
      
      Person = Row("name", "age")
      ctx.inferSchema(rdd.map(lambda x: Person(*x)))
      
      Also, you can call applySchema to apply an schema to a RDD of tuple/list and turn it into a SchemaRDD. The `schema` should be StructType, see the API docs for details.
      
      schema = StructType([StructField("name, StringType, True),
                                          StructType("age", IntegerType, True)])
      ctx.applySchema(rdd, schema)
      
      PS: In order to use namedtuple to inferSchema, you should make namedtuple picklable.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1598 from davies/nested and squashes the following commits:
      
      f1d15b6 [Davies Liu] verify schema with the first few rows
      8852aaf [Davies Liu] check type of schema
      abe9e6e [Davies Liu] address comments
      61b2292 [Davies Liu] add @deprecated to pythonToJavaMap
      1e5b801 [Davies Liu] improve cache of classes
      51aa135 [Davies Liu] use Row to infer schema
      e9c0d5c [Davies Liu] remove string typed schema
      353a3f2 [Davies Liu] fix code style
      63de8f8 [Davies Liu] fix typo
      c79ca67 [Davies Liu] fix serialization of nested data
      6b258b5 [Davies Liu] fix pep8
      9d8447c [Davies Liu] apply schema provided by string of names
      f5df97f [Davies Liu] refactor, address comments
      9d9af55 [Davies Liu] use arrry to applySchema and infer schema in Python
      84679b3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into nested
      0eaaf56 [Davies Liu] fix doc tests
      b3559b4 [Davies Liu] use generated Row instead of namedtuple
      c4ddc30 [Davies Liu] fix conflict between name of fields and variables
      7f6f251 [Davies Liu] address all comments
      d69d397 [Davies Liu] refactor
      2cc2d45 [Davies Liu] refactor
      182fb46 [Davies Liu] refactor
      bc6e9e1 [Davies Liu] switch to new Schema API
      547bf3e [Davies Liu] Merge branch 'master' into nested
      a435b5a [Davies Liu] add docs and code refactor
      2c8debc [Davies Liu] Merge branch 'master' into nested
      644665a [Davies Liu] use tuple and namedtuple for schemardd
      880eabec
    • Doris Xin's avatar
      [SPARK-2786][mllib] Python correlations · d88e6956
      Doris Xin authored
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1713 from dorx/pythonCorrelation and squashes the following commits:
      
      5f1e60c [Doris Xin] reviewer comments.
      46ff6eb [Doris Xin] reviewer comments.
      ad44085 [Doris Xin] style fix
      e69d446 [Doris Xin] fixed missed conflicts.
      eb5bf56 [Doris Xin] merge master
      cc9f725 [Doris Xin] units passed.
      9141a63 [Doris Xin] WIP2
      d199f1f [Doris Xin] Moved correlation names into a public object
      cd163d6 [Doris Xin] WIP
      d88e6956
  7. Jul 31, 2014
    • Doris Xin's avatar
      [SPARK-2724] Python version of RandomRDDGenerators · d8430148
      Doris Xin authored
      RandomRDDGenerators but without support for randomRDD and randomVectorRDD, which take in arbitrary DistributionGenerator.
      
      `randomRDD.py` is named to avoid collision with the built-in Python `random` package.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1628 from dorx/pythonRDD and squashes the following commits:
      
      55c6de8 [Doris Xin] review comments. all python units passed.
      f831d9b [Doris Xin] moved default args logic into PythonMLLibAPI
      2d73917 [Doris Xin] fix for linalg.py
      8663e6a [Doris Xin] reverting back to a single python file for random
      f47c481 [Doris Xin] docs update
      687aac0 [Doris Xin] add RandomRDDGenerators.py to run-tests
      4338f40 [Doris Xin] renamed randomRDD to rand and import as random
      29d205e [Doris Xin] created mllib.random package
      bd2df13 [Doris Xin] typos
      07ddff2 [Doris Xin] units passed.
      23b2ecd [Doris Xin] WIP
      d8430148
    • Aaron Davidson's avatar
      SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark · ef4ff00f
      Aaron Davidson authored
      Prior to this change, every PySpark task completion opened a new socket to the accumulator server, passed its updates through, and then quit. I'm not entirely sure why PySpark always sends accumulator updates, but regardless this causes a very rapid buildup of ephemeral TCP connections that remain in the TCP_WAIT state for around a minute before being cleaned up.
      
      Rather than trying to allow these sockets to be cleaned up faster, this patch simply reuses the connection between tasks completions (since they're fed updates in a single-threaded manner by the DAGScheduler anyway).
      
      The only tricky part here was making sure that the AccumulatorServer was able to shutdown in a timely manner (i.e., stop polling for new data), and this was accomplished via minor feats of magic.
      
      I have confirmed that this patch eliminates the buildup of ephemeral sockets due to the accumulator updates. However, I did note that there were still significant sockets being created against the PySpark daemon port, but my machine was not able to create enough sockets fast enough to fail. This may not be the last time we've seen this issue, though.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1503 from aarondav/accum and squashes the following commits:
      
      b3e12f7 [Aaron Davidson] SPARK-2282: Reuse Socket for sending accumulator updates to Pyspark
      ef4ff00f
    • Michael Armbrust's avatar
      [SPARK-2397][SQL] Deprecate LocalHiveContext · 72cfb139
      Michael Armbrust authored
      LocalHiveContext is redundant with HiveContext.  The only difference is it creates `./metastore` instead of `./metastore_db`.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #1641 from marmbrus/localHiveContext and squashes the following commits:
      
      e5ec497 [Michael Armbrust] Add deprecation version
      626e056 [Michael Armbrust] Don't remove from imports yet
      905cc5f [Michael Armbrust] Merge remote-tracking branch 'apache/master' into localHiveContext
      1c2727e [Michael Armbrust] Deprecate LocalHiveContext
      72cfb139
  8. Jul 30, 2014
    • Sean Owen's avatar
      SPARK-2341 [MLLIB] loadLibSVMFile doesn't handle regression datasets · e9b275b7
      Sean Owen authored
      Per discussion at https://issues.apache.org/jira/browse/SPARK-2341 , this is a look at deprecating the multiclass parameter. Thoughts welcome of course.
      
      Author: Sean Owen <srowen@gmail.com>
      
      Closes #1663 from srowen/SPARK-2341 and squashes the following commits:
      
      8a3abd7 [Sean Owen] Suppress MIMA error for removed package private classes
      18a8c8e [Sean Owen] Updates from review
      83d0092 [Sean Owen] Deprecated methods with multiclass, and instead always parse target as a double (ie. multiclass = true)
      e9b275b7
    • Kan Zhang's avatar
      [SPARK-2024] Add saveAsSequenceFile to PySpark · 94d1f46f
      Kan Zhang authored
      JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
      
      This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
      
      * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
      
      * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
      
      * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
      
      * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
      
      cc MLnick mateiz ahirreddy pwendell
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
      
      c01e3ef [Kan Zhang] [SPARK-2024] code formatting
      6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
      d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
      57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
      75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
      0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
      9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
      0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
      7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
      94d1f46f
    • Naftali Harris's avatar
      Avoid numerical instability · e3d85b7e
      Naftali Harris authored
      This avoids basically doing 1 - 1, for example:
      
      ```python
      >>> from math import exp
      >>> margin = -40
      >>> 1 - 1 / (1 + exp(margin))
      0.0
      >>> exp(margin) / (1 + exp(margin))
      4.248354255291589e-18
      >>>
      ```
      
      Author: Naftali Harris <naftaliharris@gmail.com>
      
      Closes #1652 from naftaliharris/patch-2 and squashes the following commits:
      
      0d55a9f [Naftali Harris] Avoid numerical instability
      e3d85b7e
    • Yin Huai's avatar
      [SPARK-2179][SQL] Public API for DataTypes and Schema · 7003c163
      Yin Huai authored
      The current PR contains the following changes:
      * Expose `DataType`s in the sql package (internal details are private to sql).
      * Users can create Rows.
      * Introduce `applySchema` to create a `SchemaRDD` by applying a `schema: StructType` to an `RDD[Row]`.
      * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
      * `ScalaReflection.typeOfObject` provides a way to infer the Catalyst data type based on an object. Also, we can compose `typeOfObject` with some custom logics to form a new function to infer the data type (for different use cases).
      * `JsonRDD` has been refactored to use changes introduced by this PR.
      * Add a field `containsNull` to `ArrayType`. So, we can explicitly mark if an `ArrayType` can contain null values. The default value of `containsNull` is `false`.
      
      New APIs are introduced in the sql package object and SQLContext. You can find the scaladoc at
      [sql package object](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.package) and [SQLContext](http://yhuai.github.io/site/api/scala/index.html#org.apache.spark.sql.SQLContext).
      
      An example of using `applySchema` is shown below.
      ```scala
      import org.apache.spark.sql._
      val sqlContext = new org.apache.spark.sql.SQLContext(sc)
      
      val schema =
        StructType(
          StructField("name", StringType, false) ::
          StructField("age", IntegerType, true) :: Nil)
      
      val people = sc.textFile("examples/src/main/resources/people.txt").map(_.split(",")).map(p => Row(p(0), p(1).trim.toInt))
      val peopleSchemaRDD = sqlContext. applySchema(people, schema)
      peopleSchemaRDD.printSchema
      // root
      // |-- name: string (nullable = false)
      // |-- age: integer (nullable = true)
      
      peopleSchemaRDD.registerAsTable("people")
      sqlContext.sql("select name from people").collect.foreach(println)
      ```
      
      I will add new contents to the SQL programming guide later.
      
      JIRA: https://issues.apache.org/jira/browse/SPARK-2179
      
      Author: Yin Huai <huai@cse.ohio-state.edu>
      
      Closes #1346 from yhuai/dataTypeAndSchema and squashes the following commits:
      
      1d45977 [Yin Huai] Clean up.
      a6e08b4 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      c712fbf [Yin Huai] Converts types of values based on defined schema.
      4ceeb66 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      e5f8df5 [Yin Huai] Scaladoc.
      122d1e7 [Yin Huai] Address comments.
      03bfd95 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      2476ed0 [Yin Huai] Minor updates.
      ab71f21 [Yin Huai] Format.
      fc2bed1 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      bd40a33 [Yin Huai] Address comments.
      991f860 [Yin Huai] Move "asJavaDataType" and "asScalaDataType" to DataTypeConversions.scala.
      1cb35fe [Yin Huai] Add "valueContainsNull" to MapType.
      3edb3ae [Yin Huai] Python doc.
      692c0b9 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      1d93395 [Yin Huai] Python APIs.
      246da96 [Yin Huai] Add java data type APIs to javadoc index.
      1db9531 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      d48fc7b [Yin Huai] Minor updates.
      33c4fec [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      b9f3071 [Yin Huai] Java API for applySchema.
      1c9f33c [Yin Huai] Java APIs for DataTypes and Row.
      624765c [Yin Huai] Tests for applySchema.
      aa92e84 [Yin Huai] Update data type tests.
      8da1a17 [Yin Huai] Add Row.fromSeq.
      9c99bc0 [Yin Huai] Several minor updates.
      1d9c13a [Yin Huai] Update applySchema API.
      85e9b51 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      e495e4e [Yin Huai] More comments.
      42d47a3 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      c3f4a02 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      2e58dbd [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      b8b7db4 [Yin Huai] 1. Move sql package object and package-info to sql-core. 2. Minor updates on APIs. 3. Update scala doc.
      68525a2 [Yin Huai] Update JSON unit test.
      3209108 [Yin Huai] Add unit tests.
      dcaf22f [Yin Huai] Add a field containsNull to ArrayType to indicate if an array can contain null values or not. If an ArrayType is constructed by "ArrayType(elementType)" (the existing constructor), the value of containsNull is false.
      9168b83 [Yin Huai] Update comments.
      fc649d7 [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      eca7d04 [Yin Huai] Add two apply methods which will be used to extract StructField(s) from a StructType.
      949d6bb [Yin Huai] When creating a SchemaRDD for a JSON dataset, users can apply an existing schema.
      7a6a7e5 [Yin Huai] Fix bug introduced by the change made on SQLContext.inferSchema.
      43a45e1 [Yin Huai] Remove sql.util.package introduced in a previous commit.
      0266761 [Yin Huai] Format
      03eec4c [Yin Huai] Merge remote-tracking branch 'upstream/master' into dataTypeAndSchema
      90460ac [Yin Huai] Infer the Catalyst data type from an object and cast a data value to the expected type.
      3fa0df5 [Yin Huai] Provide easier ways to construct a StructType.
      16be3e5 [Yin Huai] This commit contains three changes: * Expose `DataType`s in the sql package (internal details are private to sql). * Introduce `createSchemaRDD` to create a `SchemaRDD` from an `RDD` with a provided schema (represented by a `StructType`) and a provided function to construct `Row`, * Add a function `simpleString` to every `DataType`. Also, the schema represented by a `StructType` can be visualized by `printSchema`.
      7003c163
  9. Jul 29, 2014
    • Josh Rosen's avatar
      [SPARK-2305] [PySpark] Update Py4J to version 0.8.2.1 · 22649b6c
      Josh Rosen authored
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1626 from JoshRosen/SPARK-2305 and squashes the following commits:
      
      03fb283 [Josh Rosen] Update Py4J to version 0.8.2.1.
      22649b6c
    • Davies Liu's avatar
      [SPARK-2674] [SQL] [PySpark] support datetime type for SchemaRDD · f0d880e2
      Davies Liu authored
      Datetime and time in Python will be converted into java.util.Calendar after serialization, it will be converted into java.sql.Timestamp during inferSchema().
      
      In javaToPython(), Timestamp will be converted into Calendar, then be converted into datetime in Python after pickling.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1601 from davies/date and squashes the following commits:
      
      f0599b0 [Davies Liu] remove tests for sets and tuple in sql, fix list of list
      c9d607a [Davies Liu] convert datetype for runtime
      709d40d [Davies Liu] remove brackets
      96db384 [Davies Liu] support datetime type for SchemaRDD
      f0d880e2
    • Davies Liu's avatar
      [SPARK-791] [PySpark] fix pickle itemgetter with cloudpickle · 92ef0262
      Davies Liu authored
      fix the problem with pickle operator.itemgetter with multiple index.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1627 from davies/itemgetter and squashes the following commits:
      
      aabd7fa [Davies Liu] fix pickle itemgetter with cloudpickle
      92ef0262
    • Davies Liu's avatar
      [SPARK-2580] [PySpark] keep silent in worker if JVM close the socket · ccd5ab5f
      Davies Liu authored
      During rdd.take(n), JVM will close the socket if it had got enough data, the Python worker should keep silent in this case.
      
      In the same time, the worker should not print the trackback into stderr if it send the traceback to JVM successfully.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1625 from davies/error and squashes the following commits:
      
      4fbcc6d [Davies Liu] disable log4j during testing when exception is expected.
      cc14202 [Davies Liu] keep silent in worker if JVM close the socket
      ccd5ab5f
  10. Jul 28, 2014
    • Josh Rosen's avatar
      [SPARK-1550] [PySpark] Allow SparkContext creation after failed attempts · a7d145e9
      Josh Rosen authored
      This addresses a PySpark issue where a failed attempt to construct SparkContext would prevent any future SparkContext creation.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1606 from JoshRosen/SPARK-1550 and squashes the following commits:
      
      ec7fadc [Josh Rosen] [SPARK-1550] [PySpark] Allow SparkContext creation after failed attempts
      a7d145e9
  11. Jul 27, 2014
    • Doris Xin's avatar
      [SPARK-2679] [MLLib] Ser/De for Double · 3a69c72e
      Doris Xin authored
      Added a set of serializer/deserializer for Double in _common.py and PythonMLLibAPI in MLLib.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1581 from dorx/doubleSerDe and squashes the following commits:
      
      86a85b3 [Doris Xin] Merge branch 'master' into doubleSerDe
      2bfe7a4 [Doris Xin] Removed magic byte
      ad4d0d9 [Doris Xin] removed a space in unit
      a9020bc [Doris Xin] units passed
      7dad9af [Doris Xin] WIP
      3a69c72e
  12. Jul 26, 2014
    • Josh Rosen's avatar
      [SPARK-2601] [PySpark] Fix Py4J error when transforming pickleFiles · ba46bbed
      Josh Rosen authored
      Similar to SPARK-1034, the problem was that Py4J didn’t cope well with the fake ClassTags used in the Java API.  It doesn’t look like there’s any reason why PythonRDD needs to take a ClassTag, since it just ignores the type of the previous RDD, so I removed the type parameter and we no longer pass ClassTags from Python.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1605 from JoshRosen/spark-2601 and squashes the following commits:
      
      b68e118 [Josh Rosen] Fix Py4J error when transforming pickleFiles [SPARK-2601]
      ba46bbed
    • Davies Liu's avatar
      [SPARK-2652] [PySpark] Turning some default configs for PySpark · 75663b57
      Davies Liu authored
      Add several default configs for PySpark, related to serialization in JVM.
      
      spark.serializer = org.apache.spark.serializer.KryoSerializer
      spark.serializer.objectStreamReset = 100
      spark.rdd.compress = True
      
      This will help to reduce the memory usage during RDD.partitionBy()
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1568 from davies/conf and squashes the following commits:
      
      cd316f1 [Davies Liu] remove duplicated line
      f71a355 [Davies Liu] rebase to master, add spark.rdd.compress = True
      8f63f45 [Davies Liu] Merge branch 'master' into conf
      8bc9f08 [Davies Liu] fix unittest
      c04a83d [Davies Liu] some default configs for PySpark
      75663b57
    • Josh Rosen's avatar
      [SPARK-1458] [PySpark] Expose sc.version in Java and PySpark · cf3e9fd8
      Josh Rosen authored
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1596 from JoshRosen/spark-1458 and squashes the following commits:
      
      fdbb0bf [Josh Rosen] Add SparkContext.version to Python & Java [SPARK-1458]
      cf3e9fd8
  13. Jul 25, 2014
    • Doris Xin's avatar
      [SPARK-2656] Python version of stratified sampling · 2f75a4a3
      Doris Xin authored
      exact sample size not supported for now.
      
      Author: Doris Xin <doris.s.xin@gmail.com>
      
      Closes #1554 from dorx/pystratified and squashes the following commits:
      
      4ba927a [Doris Xin] use rel diff (+- 50%) instead of abs diff (+- 50)
      bdc3f8b [Doris Xin] updated unit to check sample holistically
      7713c7b [Doris Xin] Python version of stratified sampling
      2f75a4a3
    • Davies Liu's avatar
      [SPARK-2538] [PySpark] Hash based disk spilling aggregation · 14174abd
      Davies Liu authored
      During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
      
      It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1460 from davies/spill and squashes the following commits:
      
      cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
      37d71f7 [Davies Liu] balance the partitions
      902f036 [Davies Liu] add shuffle.py into run-tests
      dcf03a9 [Davies Liu] fix memory_info() of psutil
      67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
      f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
      e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
      400be01 [Davies Liu] address all the comments
      6178844 [Davies Liu] refactor and improve docs
      fdd0a49 [Davies Liu] add long doc string for ExternalMerger
      1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
      e6cc7f9 [Davies Liu] Merge branch 'master' into spill
      3652583 [Davies Liu] address comments
      e78a0a0 [Davies Liu] fix style
      24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
      57ee7ef [Davies Liu] update docs
      286aaff [Davies Liu] let spilled aggregation in Python configurable
      e9a40f6 [Davies Liu] recursive merger
      6edbd1f [Davies Liu] Hash based disk spilling aggregation
      14174abd
  14. Jul 24, 2014
  15. Jul 22, 2014
    • Nicholas Chammas's avatar
      [SPARK-2470] PEP8 fixes to PySpark · 5d16d5bb
      Nicholas Chammas authored
      This pull request aims to resolve all outstanding PEP8 violations in PySpark.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1505 from nchammas/master and squashes the following commits:
      
      98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle
      cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses
      e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break
      9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses
      22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses
      24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest
      7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py
      8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py
      b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py
      d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py
      aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py
      1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py
      95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py
      a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib
      c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py
      d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py
      81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py
      1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py
      7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py
      ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py
      f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py
      a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py
      f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py
      4dd148f [nchammas] Merge pull request #5 from apache/master
      f7e4581 [Nicholas Chammas] unrelated pep8 fix
      a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
      de7292a [nchammas] Merge pull request #4 from apache/master
      2e4fe00 [nchammas] Merge pull request #3 from apache/master
      89fde08 [nchammas] Merge pull request #2 from apache/master
      69f6e22 [Nicholas Chammas] PEP8 fixes
      2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
      6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
      69da6cf [nchammas] Merge pull request #1 from apache/master
      5d16d5bb
  16. Jul 21, 2014
    • Davies Liu's avatar
      [SPARK-2494] [PySpark] make hash of None consistant cross machines · 872538c6
      Davies Liu authored
      In CPython, hash of None is different cross machines, it will cause wrong result during shuffle. This PR will fix this.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1371 from davies/hash_of_none and squashes the following commits:
      
      d01745f [Davies Liu] add comments, remove outdated unit tests
      5467141 [Davies Liu] disable hijack of hash, use it only for partitionBy()
      b7118aa [Davies Liu] use __builtin__ instead of __builtins__
      839e417 [Davies Liu] hijack hash to make hash of None consistant cross machines
      872538c6
  17. Jul 20, 2014
Loading