Skip to content
Snippets Groups Projects
  1. Mar 15, 2017
    • hyukjinkwon's avatar
      [SPARK-19872] [PYTHON] Use the correct deserializer for RDD construction for coalesce/repartition · 7387126f
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to use the correct deserializer, `BatchedSerializer` for RDD construction for coalesce/repartition when the shuffle is enabled. Currently, it is passing `UTF8Deserializer` as is not `BatchedSerializer` from the copied one.
      
      with the file, `text.txt` below:
      
      ```
      a
      b
      
      d
      e
      f
      g
      h
      i
      j
      k
      l
      
      ```
      
      - Before
      
      ```python
      >>> sc.textFile('text.txt').repartition(1).collect()
      ```
      
      ```
      UTF8Deserializer(True)
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/rdd.py", line 811, in collect
          return list(_load_from_socket(port, self._jrdd_deserializer))
        File ".../spark/python/pyspark/serializers.py", line 549, in load_stream
          yield self.loads(stream)
        File ".../spark/python/pyspark/serializers.py", line 544, in loads
          return s.decode("utf-8") if self.use_unicode else s
        File "/System/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/encodings/utf_8.py", line 16, in decode
          return codecs.utf_8_decode(input, errors, True)
      UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte
      ```
      
      - After
      
      ```python
      >>> sc.textFile('text.txt').repartition(1).collect()
      ```
      
      ```
      [u'a', u'b', u'', u'd', u'e', u'f', u'g', u'h', u'i', u'j', u'k', u'l', u'']
      ```
      
      ## How was this patch tested?
      
      Unit test in `python/pyspark/tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17282 from HyukjinKwon/SPARK-19872.
      7387126f
  2. Mar 03, 2017
    • Bryan Cutler's avatar
      [SPARK-19348][PYTHON] PySpark keyword_only decorator is not thread-safe · 44281ca8
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      The `keyword_only` decorator in PySpark is not thread-safe.  It writes kwargs to a static class variable in the decorator, which is then retrieved later in the class method as `_input_kwargs`.  If multiple threads are constructing the same class with different kwargs, it becomes a race condition to read from the static class variable before it's overwritten.  See [SPARK-19348](https://issues.apache.org/jira/browse/SPARK-19348) for reproduction code.
      
      This change will write the kwargs to a member variable so that multiple threads can operate on separate instances without the race condition.  It does not protect against multiple threads operating on a single instance, but that is better left to the user to synchronize.
      
      ## How was this patch tested?
      Added new unit tests for using the keyword_only decorator and a regression test that verifies `_input_kwargs` can be overwritten from different class instances.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      
      Closes #16782 from BryanCutler/pyspark-keyword_only-threadsafe-SPARK-19348.
      44281ca8
  3. Feb 28, 2017
  4. Jan 25, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19307][PYSPARK] Make sure user conf is propagated to SparkContext. · 92afaa93
      Marcelo Vanzin authored
      The code was failing to propagate the user conf in the case where the
      JVM was already initialized, which happens when a user submits a
      python script via spark-submit.
      
      Tested with new unit test and by running a python script in a real cluster.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16682 from vanzin/SPARK-19307.
      92afaa93
  5. Dec 20, 2016
    • Holden Karau's avatar
      [SPARK-18576][PYTHON] Add basic TaskContext information to PySpark · 047a9d92
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Adds basic TaskContext information to PySpark.
      
      ## How was this patch tested?
      
      New unit tests to `tests.py` & existing unit tests.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #16211 from holdenk/SPARK-18576-pyspark-taskcontext.
      047a9d92
    • Liang-Chi Hsieh's avatar
      [SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through socket for local iterator · 95c95b71
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      There is a timeout failure when using `rdd.toLocalIterator()` or `df.toLocalIterator()` for a PySpark RDD and DataFrame:
      
          df = spark.createDataFrame([[1],[2],[3]])
          it = df.toLocalIterator()
          row = next(it)
      
          df2 = df.repartition(1000)  # create many empty partitions which increase materialization time so causing timeout
          it2 = df2.toLocalIterator()
          row = next(it2)
      
      The cause of this issue is, we open a socket to serve the data from JVM side. We set timeout for connection and reading through the socket in Python side. In Python we use a generator to read the data, so we only begin to connect the socket once we start to ask data from it. If we don't consume it immediately, there is connection timeout.
      
      In the other side, the materialization time for RDD partitions is unpredictable. So we can't set a timeout for reading data through the socket. Otherwise, it is very possibly to fail.
      
      ## How was this patch tested?
      
      Added tests into PySpark.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16263 from viirya/fix-pyspark-localiterator.
      95c95b71
  6. Dec 08, 2016
    • Andrew Ray's avatar
      [SPARK-16589] [PYTHON] Chained cartesian produces incorrect number of records · 3c68944b
      Andrew Ray authored
      ## What changes were proposed in this pull request?
      
      Fixes a bug in the python implementation of rdd cartesian product related to batching that showed up in repeated cartesian products with seemingly random results. The root cause being multiple iterators pulling from the same stream in the wrong order because of logic that ignored batching.
      
      `CartesianDeserializer` and `PairDeserializer` were changed to implement `_load_stream_without_unbatching` and borrow the one line implementation of `load_stream` from `BatchedSerializer`. The default implementation of `_load_stream_without_unbatching` was changed to give consistent results (always an iterable) so that it could be used without additional checks.
      
      `PairDeserializer` no longer extends `CartesianDeserializer` as it was not really proper. If wanted a new common super class could be added.
      
      Both `CartesianDeserializer` and `PairDeserializer` now only extend `Serializer` (which has no `dump_stream` implementation) since they are only meant for *de*serialization.
      
      ## How was this patch tested?
      
      Additional unit tests (sourced from #14248) plus one for testing a cartesian with zip.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #16121 from aray/fix-cartesian.
      3c68944b
  7. Nov 21, 2016
    • Gabriel Huang's avatar
      [SPARK-18361][PYSPARK] Expose RDD localCheckpoint in PySpark · 70176871
      Gabriel Huang authored
      ## What changes were proposed in this pull request?
      
      Expose RDD's localCheckpoint() and associated functions in PySpark.
      
      ## How was this patch tested?
      
      I added a UnitTest in python/pyspark/tests.py which passes.
      
      I certify that this is my original work, and I license it to the project under the project's open source license.
      
      Gabriel HUANG
      Developer at Cardabel (http://cardabel.com/)
      
      Author: Gabriel Huang <gabi.xiaohuang@gmail.com>
      
      Closes #15811 from gabrielhuang/pyspark-localcheckpoint.
      70176871
  8. Oct 11, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-17817][PYSPARK] PySpark RDD Repartitioning Results in Highly Skewed Partition Sizes · 07508bd0
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      Quoted from JIRA description:
      
      Calling repartition on a PySpark RDD to increase the number of partitions results in highly skewed partition sizes, with most having 0 rows. The repartition method should evenly spread out the rows across the partitions, and this behavior is correctly seen on the Scala side.
      
      Please reference the following code for a reproducible example of this issue:
      
          num_partitions = 20000
          a = sc.parallelize(range(int(1e6)), 2)  # start with 2 even partitions
          l = a.repartition(num_partitions).glom().map(len).collect()  # get length of each partition
          min(l), max(l), sum(l)/len(l), len(l)  # skewed!
      
      In Scala's `repartition` code, we will distribute elements evenly across output partitions. However, the RDD from Python is serialized as a single binary data, so the distribution fails. We need to convert the RDD in Python to java object before repartitioning.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #15389 from viirya/pyspark-rdd-repartition.
      07508bd0
  9. Sep 21, 2016
    • Yanbo Liang's avatar
      [SPARK-17585][PYSPARK][CORE] PySpark SparkContext.addFile supports adding files recursively · d3b88697
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      Users would like to add a directory as dependency in some cases, they can use ```SparkContext.addFile``` with argument ```recursive=true``` to recursively add all files under the directory by using Scala. But Python users can only add file not directory, we should also make it supported.
      
      ## How was this patch tested?
      Unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15140 from yanboliang/spark-17585.
      d3b88697
  10. Jun 28, 2016
    • Yin Huai's avatar
      [SPARK-16224] [SQL] [PYSPARK] SparkSession builder's configs need to be set to... · 0923c4f5
      Yin Huai authored
      [SPARK-16224] [SQL] [PYSPARK] SparkSession builder's configs need to be set to the existing Scala SparkContext's SparkConf
      
      ## What changes were proposed in this pull request?
      When we create a SparkSession at the Python side, it is possible that a SparkContext has been created. For this case, we need to set configs of the SparkSession builder to the Scala SparkContext's SparkConf (we need to do so because conf changes on a active Python SparkContext will not be propagated to the JVM side). Otherwise, we may create a wrong SparkSession (e.g. Hive support is not enabled even if enableHiveSupport is called).
      
      ## How was this patch tested?
      New tests and manual tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #13931 from yhuai/SPARK-16224.
      0923c4f5
  11. May 24, 2016
  12. Apr 09, 2016
    • Holden Karau's avatar
      [SPARK-13687][PYTHON] Cleanup PySpark parallelize temporary files · 00288ea2
      Holden Karau authored
      ## What changes were proposed in this pull request?
      
      Eagerly cleanup PySpark's temporary parallelize cleanup files rather than waiting for shut down.
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #12233 from holdenk/SPARK-13687-cleanup-pyspark-temporary-files.
      00288ea2
  13. Apr 06, 2016
    • Davies Liu's avatar
      [SPARK-14418][PYSPARK] fix unpersist of Broadcast in Python · 90ca1844
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, Broaccast.unpersist() will remove the file of broadcast, which should be the behavior of destroy().
      
      This PR added destroy() for Broadcast in Python, to match the sematics in Scala.
      
      ## How was this patch tested?
      
      Added regression tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12189 from davies/py_unpersist.
      90ca1844
  14. Apr 04, 2016
    • Yong Tang's avatar
      [SPARK-14368][PYSPARK] Support python.spark.worker.memory with upper-case unit. · 7db56244
      Yong Tang authored
      ## What changes were proposed in this pull request?
      
      This fix tries to address the issue in PySpark where `spark.python.worker.memory`
      could only be configured with a lower case unit (`k`, `m`, `g`, `t`). This fix
      allows the upper case unit (`K`, `M`, `G`, `T`) to be used as well. This is to
      conform to the JVM memory string as is specified in the documentation .
      
      ## How was this patch tested?
      
      This fix adds additional test to cover the changes.
      
      Author: Yong Tang <yong.tang.github@outlook.com>
      
      Closes #12163 from yongtang/SPARK-14368.
      7db56244
  15. Mar 06, 2016
    • Shixiong Zhu's avatar
      [SPARK-13697] [PYSPARK] Fix the missing module name of TransformFunctionSerializer.loads · ee913e6e
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`.
      
      ## How was this patch tested?
      
      Manually test in the shell.
      
      Before this patch:
      ```
      >>> from pyspark.streaming import StreamingContext
      >>> from pyspark.streaming.util import TransformFunction
      >>> ssc = StreamingContext(sc, 1)
      >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
      >>> func.rdd_wrapper(lambda x: x)
      TransformFunction(<function <lambda> at 0x106ac8b18>)
      >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
      >>> func2 = ssc._transformerSerializer.loads(bytes)
      >>> print(func2.func.__module__)
      None
      >>> print(func2.rdd_wrap_func.__module__)
      None
      >>>
      ```
      After this patch:
      ```
      >>> from pyspark.streaming import StreamingContext
      >>> from pyspark.streaming.util import TransformFunction
      >>> ssc = StreamingContext(sc, 1)
      >>> func = TransformFunction(sc, lambda x: x, sc.serializer)
      >>> func.rdd_wrapper(lambda x: x)
      TransformFunction(<function <lambda> at 0x108bf1b90>)
      >>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
      >>> func2 = ssc._transformerSerializer.loads(bytes)
      >>> print(func2.func.__module__)
      __main__
      >>> print(func2.rdd_wrap_func.__module__)
      __main__
      >>>
      ```
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11535 from zsxwing/loads-module.
      ee913e6e
  16. Jan 20, 2016
  17. Jan 19, 2016
  18. Oct 22, 2015
  19. Oct 19, 2015
  20. Sep 29, 2015
  21. Sep 19, 2015
    • Josh Rosen's avatar
      [SPARK-10710] Remove ability to disable spilling in core and SQL · 2117eea7
      Josh Rosen authored
      It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
      
      This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
      2117eea7
  22. Sep 14, 2015
  23. Jul 22, 2015
    • Matei Zaharia's avatar
      [SPARK-9244] Increase some memory defaults · fe26584a
      Matei Zaharia authored
      There are a few memory limits that people hit often and that we could
      make higher, especially now that memory sizes have grown.
      
      - spark.akka.frameSize: This defaults at 10 but is often hit for map
        output statuses in large shuffles. This memory is not fully allocated
        up-front, so we can just make this larger and still not affect jobs
        that never sent a status that large. We increase it to 128.
      
      - spark.executor.memory: Defaults at 512m, which is really small. We
        increase it to 1g.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #7586 from mateiz/configs and squashes the following commits:
      
      ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
      fe26584a
  24. Jul 19, 2015
    • Nicholas Hwang's avatar
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions())... · a803ac3e
      Nicholas Hwang authored
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()
      
      I'm relatively new to Spark and functional programming, so forgive me if this pull request is just a result of my misunderstanding of how Spark should be used.
      
      Currently, if one happens to use a mutable object as `zeroValue` for `RDD.aggregate()`, possibly unexpected behavior can occur.
      
      This is because pyspark's current implementation of `RDD.aggregate()` does not serialize or make a copy of `zeroValue` before handing it off to `RDD.mapPartitions(...).fold(...)`. This results in a single reference to `zeroValue` being used for both `RDD.mapPartitions()` and `RDD.fold()` on each partition. This can result in strange accumulator values being fed into each partition's call to `RDD.fold()`, as the `zeroValue` may have been changed in-place during the `RDD.mapPartitions()` call.
      
      As an illustrative example, submit the following to `spark-submit`:
      ```
      from pyspark import SparkConf, SparkContext
      import collections
      
      def updateCounter(acc, val):
          print 'update acc:', acc
          print 'update val:', val
          acc[val] += 1
          return acc
      
      def comboCounter(acc1, acc2):
          print 'combo acc1:', acc1
          print 'combo acc2:', acc2
          acc1.update(acc2)
          return acc1
      
      def main():
          conf = SparkConf().setMaster("local").setAppName("Aggregate with Counter")
          sc = SparkContext(conf = conf)
      
          print '======= AGGREGATING with ONE PARTITION ======='
          print sc.parallelize(range(1,10), 1).aggregate(collections.Counter(), updateCounter, comboCounter)
      
          print '======= AGGREGATING with TWO PARTITIONS ======='
          print sc.parallelize(range(1,10), 2).aggregate(collections.Counter(), updateCounter, comboCounter)
      
      if __name__ == "__main__":
          main()
      ```
      
      One probably expects this to output the following:
      ```
      Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})
      ```
      
      But it instead outputs this (regardless of the number of partitions):
      ```
      Counter({1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2})
      ```
      
      This is because (I believe) `zeroValue` gets passed correctly to each partition, but after `RDD.mapPartitions()` completes, the `zeroValue` object has been updated and is then passed to `RDD.fold()`, which results in all items being double-counted within each partition before being finally reduced at the calling node.
      
      I realize that this type of calculation is typically done by `RDD.mapPartitions(...).reduceByKey(...)`, but hopefully this illustrates some potentially confusing behavior. I also noticed that other `RDD` methods use this `deepcopy` approach to creating unique copies of `zeroValue` (i.e., `RDD.aggregateByKey()` and `RDD.foldByKey()`), and that the Scala implementations do seem to serialize the `zeroValue` object appropriately to prevent this type of behavior.
      
      Author: Nicholas Hwang <moogling@gmail.com>
      
      Closes #7378 from njhwang/master and squashes the following commits:
      
      659bb27 [Nicholas Hwang] Fixed RDD.aggregate() to perform a reduce operation on collected mapPartitions results, similar to how fold currently is implemented. This prevents an initial combOp being performed on each partition with zeroValue (which leads to unexpected behavior if zeroValue is a mutable object) before being combOp'ed with other partition results.
      8d8d694 [Nicholas Hwang] Changed dict construction to be compatible with Python 2.6 (cannot use list comprehensions to make dicts)
      56eb2ab [Nicholas Hwang] Fixed whitespace after colon to conform with PEP8
      391de4a [Nicholas Hwang] Removed used of collections.Counter from RDD tests for Python 2.6 compatibility; used defaultdict(int) instead. Merged treeAggregate test with mutable zero value into aggregate test to reduce code duplication.
      2fa4e4b [Nicholas Hwang] Merge branch 'master' of https://github.com/njhwang/spark
      ba528bd [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3. Also replaced some parallelizations of ranges with xranges, per the documentation's recommendations of preferring xrange over range.
      7820391 [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3.
      90d1544 [Nicholas Hwang] Made sure RDD.aggregate() makes a deepcopy of zeroValue for all partitions; this ensures that the mapPartitions call works with unique copies of zeroValue in each partition, and prevents a single reference to zeroValue being used for both map and fold calls on each partition (resulting in possibly unexpected behavior).
      a803ac3e
  25. Jul 15, 2015
    • MechCoder's avatar
      [SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark · 20bb10f8
      MechCoder authored
      This adds Pylint checks to PySpark.
      
      For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script).
      We still need to figure out what rules to be allowed.
      
      Author: MechCoder <manojkumarsivaraj334@gmail.com>
      
      Closes #7241 from MechCoder/pylint and squashes the following commits:
      
      2fc7291 [MechCoder] Remove pylint test fail
      6d883a2 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins
      f3a5e17 [MechCoder] undefined-variable
      ca8b749 [MechCoder] Minor changes
      71629f8 [MechCoder] remove trailing whitespace
      8498ff9 [MechCoder] Remove blacklisted arguments and pointless statements check
      1dbd094 [MechCoder] Disable all checks for now
      8b8aa8a [MechCoder] Add pylint configuration file
      7871bb1 [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark
      20bb10f8
  26. Jul 10, 2015
    • Scott Taylor's avatar
      [SPARK-7735] [PYSPARK] Raise Exception on non-zero exit from pipe commands · 6e1c7e27
      Scott Taylor authored
      This will allow problems with piped commands to be detected.
      This will also allow tasks to be retried where errors are rare (such as network problems in piped commands).
      
      Author: Scott Taylor <github@megatron.me.uk>
      
      Closes #6262 from megatron-me-uk/patch-2 and squashes the following commits:
      
      04ae1d5 [Scott Taylor] Remove spurious empty line
      98fa101 [Scott Taylor] fix blank line style error
      574b564 [Scott Taylor] Merge pull request #2 from megatron-me-uk/patch-4
      0c1e762 [Scott Taylor] Update rdd pipe method for checkCode
      ab9a2e1 [Scott Taylor] Update rdd pipe tests for checkCode
      eb4801c [Scott Taylor] fix fail_condition
      b0ac3a4 [Scott Taylor] Merge pull request #1 from megatron-me-uk/megatron-me-uk-patch-1
      a307d13 [Scott Taylor] update rdd tests to test pipe modes
      34fcdc3 [Scott Taylor] add optional argument 'mode' for rdd.pipe
      a0c0161 [Scott Taylor] fix generator issue
      8a9ef9c [Scott Taylor] make check_return_code an iterator
      0486ae3 [Scott Taylor] style fixes
      8ed89a6 [Scott Taylor] Chain generators to prevent potential deadlock
      4153b02 [Scott Taylor] fix list.sort returns None
      491d3fc [Scott Taylor] Pass a function handle to assertRaises
      3344a21 [Scott Taylor] wrap assertRaises with QuietTest
      3ab8c7a [Scott Taylor] remove whitespace for style
      cc1a73d [Scott Taylor] fix style issues in pipe test
      8db4073 [Scott Taylor] Add a test for rdd pipe functions
      1b3dc4e [Scott Taylor] fix missing space around operator style
      0974f98 [Scott Taylor] add space between words in multiline string
      45f4977 [Scott Taylor] fix line too long style error
      5745d85 [Scott Taylor] Remove space to fix style
      f552d49 [Scott Taylor] Catch non-zero exit from pipe commands
      6e1c7e27
  27. Jun 27, 2015
    • Josh Rosen's avatar
      [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with... · 40648c56
      Josh Rosen authored
      [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with dev/run-tests module system
      
      This patch refactors the `python/run-tests` script:
      
      - It's now written in Python instead of Bash.
      - The descriptions of the tests to run are now stored in `dev/run-tests`'s modules.  This allows the pull request builder to skip Python tests suites that were not affected by the pull request's changes.  For example, we can now skip the PySpark Streaming test cases when only SQL files are changed.
      - `python/run-tests` now supports command-line flags to make it easier to run individual test suites (this addresses SPARK-5482):
      
        ```
      Usage: run-tests [options]
      
      Options:
        -h, --help            show this help message and exit
        --python-executables=PYTHON_EXECUTABLES
                              A comma-separated list of Python executables to test
                              against (default: python2.6,python3.4,pypy)
        --modules=MODULES     A comma-separated list of Python modules to test
                              (default: pyspark-core,pyspark-ml,pyspark-mllib
                              ,pyspark-sql,pyspark-streaming)
         ```
      - `dev/run-tests` has been split into multiple files: the module definitions and test utility functions are now stored inside of a `dev/sparktestsupport` Python module, allowing them to be re-used from the Python test runner script.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #6967 from JoshRosen/run-tests-python-modules and squashes the following commits:
      
      f578d6d [Josh Rosen] Fix print for Python 2.x
      8233d61 [Josh Rosen] Add python/run-tests.py to Python lint checks
      34c98d2 [Josh Rosen] Fix universal_newlines for Python 3
      8f65ed0 [Josh Rosen] Fix handling of  module in python/run-tests
      37aff00 [Josh Rosen] Python 3 fix
      27a389f [Josh Rosen] Skip MLLib tests for PyPy
      c364ccf [Josh Rosen] Use which() to convert PYSPARK_PYTHON to an absolute path before shelling out to run tests
      568a3fd [Josh Rosen] Fix hashbang
      3b852ae [Josh Rosen] Fall back to PYSPARK_PYTHON when sys.executable is None (fixes a test)
      f53db55 [Josh Rosen] Remove python2 flag, since the test runner script also works fine under Python 3
      9c80469 [Josh Rosen] Fix passing of PYSPARK_PYTHON
      d33e525 [Josh Rosen] Merge remote-tracking branch 'origin/master' into run-tests-python-modules
      4f8902c [Josh Rosen] Python lint fixes.
      8f3244c [Josh Rosen] Use universal_newlines to fix dev/run-tests doctest failures on Python 3.
      f542ac5 [Josh Rosen] Fix lint check for Python 3
      fff4d09 [Josh Rosen] Add dev/sparktestsupport to pep8 checks
      2efd594 [Josh Rosen] Update dev/run-tests to use new Python test runner flags
      b2ab027 [Josh Rosen] Add command-line options for running individual suites in python/run-tests
      caeb040 [Josh Rosen] Fixes to PySpark test module definitions
      d6a77d3 [Josh Rosen] Fix the tests of dev/run-tests
      def2d8a [Josh Rosen] Two minor fixes
      aec0b8f [Josh Rosen] Actually get the Kafka stuff to run properly
      04015b9 [Josh Rosen] First attempt at getting PySpark Kafka test to work in new runner script
      4c97136 [Josh Rosen] PYTHONPATH fixes
      dcc9c09 [Josh Rosen] Fix time division
      32660fc [Josh Rosen] Initial cut at Python test runner refactoring
      311c6a9 [Josh Rosen] Move shell utility functions to own module.
      1bdeb87 [Josh Rosen] Move module definitions to separate file.
      40648c56
  28. Jun 18, 2015
    • Davies Liu's avatar
      [SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark · 9b200272
      Davies Liu authored
      The batch size during external sort will grow up to max 10000, then shrink down to zero, causing infinite loop.
      Given the assumption that the items usually have similar size, so we don't need to adjust the batch size after first spill.
      
      cc JoshRosen rxin angelini
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6714 from davies/batch_size and squashes the following commits:
      
      b170dfb [Davies Liu] update test
      b9be832 [Davies Liu] Merge branch 'batch_size' of github.com:davies/spark into batch_size
      6ade745 [Davies Liu] update test
      5c21777 [Davies Liu] Update shuffle.py
      e746aec [Davies Liu] fix batch size during sort
      9b200272
  29. Jun 17, 2015
  30. May 21, 2015
    • Holden Karau's avatar
      [SPARK-7711] Add a startTime property to match the corresponding one in Scala · 6b18cdc1
      Holden Karau authored
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #6275 from holdenk/SPARK-771-startTime-is-missing-from-pyspark and squashes the following commits:
      
      06662dc [Holden Karau] add mising blank line for style checks
      7a87410 [Holden Karau] add back missing newline
      7a7876b [Holden Karau] Add a startTime property to match the corresponding one in the Scala SparkContext
      6b18cdc1
  31. May 18, 2015
    • Daoyuan Wang's avatar
      [SPARK-7150] SparkContext.range() and SQLContext.range() · c2437de1
      Daoyuan Wang authored
      This PR is based on #6081, thanks adrian-wang.
      
      Closes #6081
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6230 from davies/range and squashes the following commits:
      
      d3ce5fe [Davies Liu] add tests
      789eda5 [Davies Liu] add range() in Python
      4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
      cbf5200 [Daoyuan Wang] let's add python support in a separate PR
      f45e3b2 [Daoyuan Wang] remove redundant toLong
      617da76 [Daoyuan Wang] fix safe marge for corner cases
      867c417 [Daoyuan Wang] fix
      13dbe84 [Daoyuan Wang] update
      bd998ba [Daoyuan Wang] update comments
      d3a0c1b [Daoyuan Wang] add range api()
      c2437de1
    • Davies Liu's avatar
      [SPARK-6216] [PYSPARK] check python version of worker with driver · 32fbd297
      Davies Liu authored
      This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6203 from davies/py_version and squashes the following commits:
      
      b8fb76e [Davies Liu] fix test
      6ce5096 [Davies Liu] use string for version
      47c6278 [Davies Liu] check python version of worker with driver
      32fbd297
  32. May 09, 2015
    • Vinod K C's avatar
      [SPARK-7438] [SPARK CORE] Fixed validation of relativeSD in countApproxDistinct · dda6d9f4
      Vinod K C authored
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #5974 from vinodkc/fix_countApproxDistinct_Validation and squashes the following commits:
      
      3a3d59c [Vinod K C] Reverted removal of validation relativeSD<0.000017
      799976e [Vinod K C] Removed testcase to assert IAE when relativeSD>3.7
      8ddbfae [Vinod K C] Remove blank line
      b1b00a3 [Vinod K C] Removed relativeSD validation from python API,RDD.scala will do validation
      122d378 [Vinod K C] Fixed validation of relativeSD in  countApproxDistinct
      dda6d9f4
  33. Apr 21, 2015
    • Reynold Xin's avatar
      [SPARK-6953] [PySpark] speed up python tests · 3134c3fe
      Reynold Xin authored
      This PR try to speed up some python tests:
      
      ```
      tests.py                       144s -> 103s      -41s
      mllib/classification.py         24s -> 17s        -7s
      mllib/regression.py             27s -> 15s       -12s
      mllib/tree.py                   27s -> 13s       -14s
      mllib/tests.py                  64s -> 31s       -33s
      streaming/tests.py             185s -> 84s      -101s
      ```
      Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice).
      
      During testing, it will show used time for each test file:
      ```
      Run core tests ...
      Running test: pyspark/rdd.py ... ok (22s)
      Running test: pyspark/context.py ... ok (16s)
      Running test: pyspark/conf.py ... ok (4s)
      Running test: pyspark/broadcast.py ... ok (4s)
      Running test: pyspark/accumulators.py ... ok (4s)
      Running test: pyspark/serializers.py ... ok (6s)
      Running test: pyspark/profiler.py ... ok (5s)
      Running test: pyspark/shuffle.py ... ok (1s)
      Running test: pyspark/tests.py ... ok (103s)   144s
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5605 from rxin/python-tests-speed and squashes the following commits:
      
      d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953
      89321ee [Xiangrui Meng] fix seed in tests
      3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
      3134c3fe
  34. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  35. Apr 15, 2015
    • Davies Liu's avatar
      [SPARK-6886] [PySpark] fix big closure with shuffle · f11288d5
      Davies Liu authored
      Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD.
      
      This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5496 from davies/big_closure and squashes the following commits:
      
      9a0ea4c [Davies Liu] fix big closure with shuffle
      f11288d5
  36. Apr 10, 2015
    • Davies Liu's avatar
      [SPARK-6216] [PySpark] check the python version in worker · 4740d6a1
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5404 from davies/check_version and squashes the following commits:
      
      e559248 [Davies Liu] add tests
      ec33b5f [Davies Liu] check the python version in worker
      4740d6a1
    • Milan Straka's avatar
      [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey. · 0375134f
      Milan Straka authored
      The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index.
      
      The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence).
      
      Author: Milan Straka <fox@ucw.cz>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4761 from foxik/fix-descending-sort and squashes the following commits:
      
      95896b5 [Milan Straka] Add regression test for SPARK-5969.
      5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
      0375134f
Loading