Skip to content
Snippets Groups Projects
  1. Jul 22, 2015
    • Josh Rosen's avatar
      [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled · b217230f
      Josh Rosen authored
      Spark has an option called spark.localExecution.enabled; according to the docs:
      
      > Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.
      
      This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.
      
      This pull request simply brings #7484 up to date.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #7585 from rxin/remove-local-exec and squashes the following commits:
      
      84bd10e [Reynold Xin] Python fix.
      1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
      eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
      b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
      8975d96 [Josh Rosen] Remove local execution tests.
      ffa8c9b [Josh Rosen] Remove documentation for configuration
      b217230f
  2. Jul 19, 2015
    • Nicholas Hwang's avatar
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions())... · a803ac3e
      Nicholas Hwang authored
      [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()
      
      I'm relatively new to Spark and functional programming, so forgive me if this pull request is just a result of my misunderstanding of how Spark should be used.
      
      Currently, if one happens to use a mutable object as `zeroValue` for `RDD.aggregate()`, possibly unexpected behavior can occur.
      
      This is because pyspark's current implementation of `RDD.aggregate()` does not serialize or make a copy of `zeroValue` before handing it off to `RDD.mapPartitions(...).fold(...)`. This results in a single reference to `zeroValue` being used for both `RDD.mapPartitions()` and `RDD.fold()` on each partition. This can result in strange accumulator values being fed into each partition's call to `RDD.fold()`, as the `zeroValue` may have been changed in-place during the `RDD.mapPartitions()` call.
      
      As an illustrative example, submit the following to `spark-submit`:
      ```
      from pyspark import SparkConf, SparkContext
      import collections
      
      def updateCounter(acc, val):
          print 'update acc:', acc
          print 'update val:', val
          acc[val] += 1
          return acc
      
      def comboCounter(acc1, acc2):
          print 'combo acc1:', acc1
          print 'combo acc2:', acc2
          acc1.update(acc2)
          return acc1
      
      def main():
          conf = SparkConf().setMaster("local").setAppName("Aggregate with Counter")
          sc = SparkContext(conf = conf)
      
          print '======= AGGREGATING with ONE PARTITION ======='
          print sc.parallelize(range(1,10), 1).aggregate(collections.Counter(), updateCounter, comboCounter)
      
          print '======= AGGREGATING with TWO PARTITIONS ======='
          print sc.parallelize(range(1,10), 2).aggregate(collections.Counter(), updateCounter, comboCounter)
      
      if __name__ == "__main__":
          main()
      ```
      
      One probably expects this to output the following:
      ```
      Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})
      ```
      
      But it instead outputs this (regardless of the number of partitions):
      ```
      Counter({1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2})
      ```
      
      This is because (I believe) `zeroValue` gets passed correctly to each partition, but after `RDD.mapPartitions()` completes, the `zeroValue` object has been updated and is then passed to `RDD.fold()`, which results in all items being double-counted within each partition before being finally reduced at the calling node.
      
      I realize that this type of calculation is typically done by `RDD.mapPartitions(...).reduceByKey(...)`, but hopefully this illustrates some potentially confusing behavior. I also noticed that other `RDD` methods use this `deepcopy` approach to creating unique copies of `zeroValue` (i.e., `RDD.aggregateByKey()` and `RDD.foldByKey()`), and that the Scala implementations do seem to serialize the `zeroValue` object appropriately to prevent this type of behavior.
      
      Author: Nicholas Hwang <moogling@gmail.com>
      
      Closes #7378 from njhwang/master and squashes the following commits:
      
      659bb27 [Nicholas Hwang] Fixed RDD.aggregate() to perform a reduce operation on collected mapPartitions results, similar to how fold currently is implemented. This prevents an initial combOp being performed on each partition with zeroValue (which leads to unexpected behavior if zeroValue is a mutable object) before being combOp'ed with other partition results.
      8d8d694 [Nicholas Hwang] Changed dict construction to be compatible with Python 2.6 (cannot use list comprehensions to make dicts)
      56eb2ab [Nicholas Hwang] Fixed whitespace after colon to conform with PEP8
      391de4a [Nicholas Hwang] Removed used of collections.Counter from RDD tests for Python 2.6 compatibility; used defaultdict(int) instead. Merged treeAggregate test with mutable zero value into aggregate test to reduce code duplication.
      2fa4e4b [Nicholas Hwang] Merge branch 'master' of https://github.com/njhwang/spark
      ba528bd [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3. Also replaced some parallelizations of ranges with xranges, per the documentation's recommendations of preferring xrange over range.
      7820391 [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3.
      90d1544 [Nicholas Hwang] Made sure RDD.aggregate() makes a deepcopy of zeroValue for all partitions; this ensures that the mapPartitions call works with unique copies of zeroValue in each partition, and prevents a single reference to zeroValue being used for both map and fold calls on each partition (resulting in possibly unexpected behavior).
      a803ac3e
  3. Jul 10, 2015
    • Scott Taylor's avatar
      [SPARK-7735] [PYSPARK] Raise Exception on non-zero exit from pipe commands · 6e1c7e27
      Scott Taylor authored
      This will allow problems with piped commands to be detected.
      This will also allow tasks to be retried where errors are rare (such as network problems in piped commands).
      
      Author: Scott Taylor <github@megatron.me.uk>
      
      Closes #6262 from megatron-me-uk/patch-2 and squashes the following commits:
      
      04ae1d5 [Scott Taylor] Remove spurious empty line
      98fa101 [Scott Taylor] fix blank line style error
      574b564 [Scott Taylor] Merge pull request #2 from megatron-me-uk/patch-4
      0c1e762 [Scott Taylor] Update rdd pipe method for checkCode
      ab9a2e1 [Scott Taylor] Update rdd pipe tests for checkCode
      eb4801c [Scott Taylor] fix fail_condition
      b0ac3a4 [Scott Taylor] Merge pull request #1 from megatron-me-uk/megatron-me-uk-patch-1
      a307d13 [Scott Taylor] update rdd tests to test pipe modes
      34fcdc3 [Scott Taylor] add optional argument 'mode' for rdd.pipe
      a0c0161 [Scott Taylor] fix generator issue
      8a9ef9c [Scott Taylor] make check_return_code an iterator
      0486ae3 [Scott Taylor] style fixes
      8ed89a6 [Scott Taylor] Chain generators to prevent potential deadlock
      4153b02 [Scott Taylor] fix list.sort returns None
      491d3fc [Scott Taylor] Pass a function handle to assertRaises
      3344a21 [Scott Taylor] wrap assertRaises with QuietTest
      3ab8c7a [Scott Taylor] remove whitespace for style
      cc1a73d [Scott Taylor] fix style issues in pipe test
      8db4073 [Scott Taylor] Add a test for rdd pipe functions
      1b3dc4e [Scott Taylor] fix missing space around operator style
      0974f98 [Scott Taylor] add space between words in multiline string
      45f4977 [Scott Taylor] fix line too long style error
      5745d85 [Scott Taylor] Remove space to fix style
      f552d49 [Scott Taylor] Catch non-zero exit from pipe commands
      6e1c7e27
  4. Jun 30, 2015
    • Davies Liu's avatar
      [SPARK-8738] [SQL] [PYSPARK] capture SQL AnalysisException in Python API · 58ee2a2e
      Davies Liu authored
      Capture the AnalysisException in SQL, hide the long java stack trace, only show the error message.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7135 from davies/ananylis and squashes the following commits:
      
      dad7ae7 [Davies Liu] add comment
      ec0c0e8 [Davies Liu] Update utils.py
      cdd7edd [Davies Liu] add doc
      7b044c2 [Davies Liu] fix python 3
      f84d3bd [Davies Liu] capture SQL AnalysisException in Python API
      58ee2a2e
  5. Jun 29, 2015
    • Ai He's avatar
      [SPARK-7810] [PYSPARK] solve python rdd socket connection problem · ecd3aacf
      Ai He authored
      Method "_load_from_socket" in rdd.py cannot load data from jvm socket when ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols.
      
      Author: Ai He <ai.he@ussuning.com>
      Author: AiHe <ai.he@ussuning.com>
      
      Closes #6338 from AiHe/pyspark-networking-issue and squashes the following commits:
      
      d4fc9c4 [Ai He] handle code review 2
      e75c5c8 [Ai He] handle code review
      5644953 [AiHe] solve python rdd socket connection problem to jvm
      ecd3aacf
  6. Jun 23, 2015
    • Scott Taylor's avatar
      [SPARK-8541] [PYSPARK] test the absolute error in approx doctests · f0dcbe8a
      Scott Taylor authored
      A minor change but one which is (presumably) visible on the public api docs webpage.
      
      Author: Scott Taylor <github@megatron.me.uk>
      
      Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits:
      
      fbed000 [Scott Taylor] test the absolute error in approx doctests
      f0dcbe8a
  7. Jun 17, 2015
  8. May 21, 2015
    • Sean Owen's avatar
      [SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutative · 6e534026
      Sean Owen authored
      Document current limitation of rdd.fold.
      
      This does not resolve SPARK-6416 but just documents the issue.
      CC JoshRosen
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #6231 from srowen/SPARK-6416 and squashes the following commits:
      
      9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed
      da40d84 [Sean Owen] Document current limitation of rdd.fold.
      6e534026
  9. May 18, 2015
    • Davies Liu's avatar
      [SPARK-6216] [PYSPARK] check python version of worker with driver · 32fbd297
      Davies Liu authored
      This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6203 from davies/py_version and squashes the following commits:
      
      b8fb76e [Davies Liu] fix test
      6ce5096 [Davies Liu] use string for version
      47c6278 [Davies Liu] check python version of worker with driver
      32fbd297
  10. May 09, 2015
    • Vinod K C's avatar
      [SPARK-7438] [SPARK CORE] Fixed validation of relativeSD in countApproxDistinct · dda6d9f4
      Vinod K C authored
      Author: Vinod K C <vinod.kc@huawei.com>
      
      Closes #5974 from vinodkc/fix_countApproxDistinct_Validation and squashes the following commits:
      
      3a3d59c [Vinod K C] Reverted removal of validation relativeSD<0.000017
      799976e [Vinod K C] Removed testcase to assert IAE when relativeSD>3.7
      8ddbfae [Vinod K C] Remove blank line
      b1b00a3 [Vinod K C] Removed relativeSD validation from python API,RDD.scala will do validation
      122d378 [Vinod K C] Fixed validation of relativeSD in  countApproxDistinct
      dda6d9f4
  11. Apr 21, 2015
    • Davies Liu's avatar
      [SPARK-6949] [SQL] [PySpark] Support Date/Timestamp in Column expression · ab9128fb
      Davies Liu authored
      This PR enable auto_convert in JavaGateway, then we could register a converter for a given types, for example, date and datetime.
      
      There are two bugs related to auto_convert, see [1] and [2], we workaround it in this PR.
      
      [1]  https://github.com/bartdag/py4j/issues/160
      [2] https://github.com/bartdag/py4j/issues/161
      
      cc rxin JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5570 from davies/py4j_date and squashes the following commits:
      
      eb4fa53 [Davies Liu] fix tests in python 3
      d17d634 [Davies Liu] rollback changes in mllib
      2e7566d [Davies Liu] convert tuple into ArrayList
      ceb3779 [Davies Liu] Update rdd.py
      3c373f3 [Davies Liu] support date and datetime by auto_convert
      cb094ff [Davies Liu] enable auto convert
      ab9128fb
  12. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  13. Apr 15, 2015
    • Davies Liu's avatar
      [SPARK-6886] [PySpark] fix big closure with shuffle · f11288d5
      Davies Liu authored
      Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD.
      
      This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5496 from davies/big_closure and squashes the following commits:
      
      9a0ea4c [Davies Liu] fix big closure with shuffle
      f11288d5
  14. Apr 10, 2015
    • Davies Liu's avatar
      [SPARK-6216] [PySpark] check the python version in worker · 4740d6a1
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5404 from davies/check_version and squashes the following commits:
      
      e559248 [Davies Liu] add tests
      ec33b5f [Davies Liu] check the python version in worker
      4740d6a1
    • Milan Straka's avatar
      [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey. · 0375134f
      Milan Straka authored
      The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index.
      
      The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence).
      
      Author: Milan Straka <fox@ucw.cz>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4761 from foxik/fix-descending-sort and squashes the following commits:
      
      95896b5 [Milan Straka] Add regression test for SPARK-5969.
      5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
      0375134f
  15. Apr 09, 2015
    • Davies Liu's avatar
      [SPARK-3074] [PySpark] support groupByKey() with single huge key · b5c51c8d
      Davies Liu authored
      This patch change groupByKey() to use external sort based approach, so it can support single huge key.
      
      For example, it can group by a dataset including one hot key with 40 millions values (strings), using 500M memory for Python worker, finished in about 2 minutes. (it will need 6G memory in hash based approach).
      
      During groupByKey(), it will do in-memory groupBy first. If the dataset can not fit in memory, then data will be partitioned by hash. If one partition still can not fit in memory, it will switch to sort based groupBy().
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #1977 from davies/groupby and squashes the following commits:
      
      af3713a [Davies Liu] make sure it's iterator
      67772dd [Davies Liu] fix tests
      e78c15c [Davies Liu] address comments
      0b0fde8 [Davies Liu] address comments
      0dcf320 [Davies Liu] address comments, rollback changes in ResultIterable
      e3b8eab [Davies Liu] fix narrow dependency
      2a1857a [Davies Liu] typo
      d2f053b [Davies Liu] add repr for FlattedValuesSerializer
      c6a2f8d [Davies Liu] address comments
      9e2df24 [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      2b9c261 [Davies Liu] fix typo in comments
      70aadcd [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      a14b4bd [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      ab5515b [Davies Liu] Merge branch 'master' into groupby
      651f891 [Davies Liu] simplify GroupByKey
      1578f2e [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      1f69f93 [Davies Liu] fix tests
      0d3395f [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      341f1e0 [Davies Liu] add comments, refactor
      47918b8 [Davies Liu] remove unused code
      6540948 [Davies Liu] address comments:
      17f4ec6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      4d4bc86 [Davies Liu] bugfix
      8ef965e [Davies Liu] Merge branch 'master' into groupby
      fbc504a [Davies Liu] Merge branch 'master' into groupby
      779ed03 [Davies Liu] fix merge conflict
      2c1d05b [Davies Liu] refactor, minor turning
      b48cda5 [Davies Liu] Merge branch 'master' into groupby
      85138e6 [Davies Liu] Merge branch 'master' into groupby
      acd8e1b [Davies Liu] fix memory when groupByKey().count()
      905b233 [Davies Liu] Merge branch 'sort' into groupby
      1f075ed [Davies Liu] Merge branch 'master' into sort
      4b07d39 [Davies Liu] compress the data while spilling
      0a081c6 [Davies Liu] Merge branch 'master' into groupby
      f157fe7 [Davies Liu] Merge branch 'sort' into groupby
      eb53ca6 [Davies Liu] Merge branch 'master' into sort
      b2dc3bf [Davies Liu] Merge branch 'sort' into groupby
      644abaf [Davies Liu] add license in LICENSE
      19f7873 [Davies Liu] improve tests
      11ba318 [Davies Liu] typo
      085aef8 [Davies Liu] Merge branch 'master' into groupby
      3ee58e5 [Davies Liu] switch to sort based groupBy, based on size of data
      1ea0669 [Davies Liu] choose sort based groupByKey() automatically
      b40bae7 [Davies Liu] bugfix
      efa23df [Davies Liu] refactor, add spark.shuffle.sort=False
      250be4e [Davies Liu] flatten the combined values when dumping into disks
      d05060d [Davies Liu] group the same key before shuffle, reduce the comparison during sorting
      083d842 [Davies Liu] sorted based groupByKey()
      55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
      b5c51c8d
  16. Apr 02, 2015
    • Davies Liu's avatar
      [SPARK-6667] [PySpark] remove setReuseAddress · 0cce5451
      Davies Liu authored
      The reused address on server side had caused the server can not acknowledge the connected connections, remove it.
      
      This PR will retry once after timeout, it also add a timeout at client side.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5324 from davies/collect_hang and squashes the following commits:
      
      e5a51a2 [Davies Liu] remove setReuseAddress
      7977c2f [Davies Liu] do retry on client side
      b838f35 [Davies Liu] retry after timeout
      0cce5451
  17. Mar 20, 2015
    • mbonaci's avatar
      [SPARK-6370][core] Documentation: Improve all 3 docs for RDD.sample · 28bcb9e9
      mbonaci authored
      The docs for the `sample` method were insufficient, now less so.
      
      Author: mbonaci <mbonaci@gmail.com>
      
      Closes #5097 from mbonaci/master and squashes the following commits:
      
      a6a9d97 [mbonaci] [SPARK-6370][core] Documentation: Improve all 3 docs for RDD.sample method
      28bcb9e9
  18. Mar 09, 2015
    • Davies Liu's avatar
      [SPARK-6194] [SPARK-677] [PySpark] fix memory leak in collect() · 8767565c
      Davies Liu authored
      Because circular reference between JavaObject and JavaMember, an Java object can not be released until Python GC kick in, then it will cause memory leak in collect(), which may consume lots of memory in JVM.
      
      This PR change the way we sending collected data back into Python from local file to socket, which could avoid any disk IO during collect, also avoid any referrers of Java object in Python.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4923 from davies/fix_collect and squashes the following commits:
      
      d730286 [Davies Liu] address comments
      24c92a4 [Davies Liu] fix style
      ba54614 [Davies Liu] use socket to transfer data from JVM
      9517c8f [Davies Liu] fix memory leak in collect()
      8767565c
  19. Feb 25, 2015
    • Davies Liu's avatar
      [SPARK-5944] [PySpark] fix version in Python API docs · f3f4c87b
      Davies Liu authored
      use RELEASE_VERSION when building the Python API docs
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4731 from davies/api_version and squashes the following commits:
      
      c9744c9 [Davies Liu] Update create-release.sh
      08cbc3f [Davies Liu] fix python docs
      f3f4c87b
  20. Feb 24, 2015
  21. Feb 17, 2015
    • Davies Liu's avatar
      [SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark · c3d2b90b
      Davies Liu authored
      Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage will come in.
      
      The Python implementation of cogroup/join is different than Scala one, it depends on union() and partitionBy(). This patch will try to use PartitionerAwareUnionRDD() in union(), when all the RDDs have the same partitioner. It also fix `reservePartitioner` in all the map() or mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4629 from davies/narrow and squashes the following commits:
      
      dffe34e [Davies Liu] improve test, check number of stages for join/cogroup
      1ed3ba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into narrow
      4d29932 [Davies Liu] address comment
      cc28d97 [Davies Liu] add unit tests
      940245e [Davies Liu] address comments
      ff5a0a6 [Davies Liu] skip the partitionBy() on Python side
      eb26c62 [Davies Liu] narrow dependency in PySpark
      c3d2b90b
  22. Feb 06, 2015
  23. Feb 04, 2015
    • Davies Liu's avatar
      [SPARK-5577] Python udf for DataFrame · dc101b0e
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4351 from davies/python_udf and squashes the following commits:
      
      d250692 [Davies Liu] fix conflict
      34234d4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python_udf
      440f769 [Davies Liu] address comments
      f0a3121 [Davies Liu] track life cycle of broadcast
      f99b2e1 [Davies Liu] address comments
      462b334 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python_udf
      7bccc3b [Davies Liu] python udf
      58dee20 [Davies Liu] clean up
      dc101b0e
  24. Jan 28, 2015
    • Xiangrui Meng's avatar
      [SPARK-5430] move treeReduce and treeAggregate from mllib to core · 4ee79c71
      Xiangrui Meng authored
      We have seen many use cases of `treeAggregate`/`treeReduce` outside the ML domain. Maybe it is time to move them to Core. pwendell
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #4228 from mengxr/SPARK-5430 and squashes the following commits:
      
      20ad40d [Xiangrui Meng] exclude tree* from mima
      e89a43e [Xiangrui Meng] fix compile and update java doc
      3ae1a4b [Xiangrui Meng] add treeReduce/treeAggregate to Python
      6f948c5 [Xiangrui Meng] add treeReduce/treeAggregate to JavaRDDLike
      d600b6c [Xiangrui Meng] move treeReduce and treeAggregate to core
      4ee79c71
    • Yandu Oppacher's avatar
      [SPARK-4387][PySpark] Refactoring python profiling code to make it extensible · 3bead67d
      Yandu Oppacher authored
      This PR is based on #3255 , fix conflicts and code style.
      
      Closes #3255.
      
      Author: Yandu Oppacher <yandu.oppacher@jadedpixel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3901 from davies/refactor-python-profile-code and squashes the following commits:
      
      b4a9306 [Davies Liu] fix tests
      4b79ce8 [Davies Liu] add docstring for profiler_cls
      2700e47 [Davies Liu] use BasicProfiler as default
      349e341 [Davies Liu] more refactor
      6a5d4df [Davies Liu] refactor and fix tests
      31bf6b6 [Davies Liu] fix code style
      0864b5d [Yandu Oppacher] Remove unused method
      76a6c37 [Yandu Oppacher] Added a profile collector to accumulate the profilers per stage
      9eefc36 [Yandu Oppacher] Fix doc
      9ace076 [Yandu Oppacher] Refactor of profiler, and moved tests around
      8739aff [Yandu Oppacher] Code review fixes
      9bda3ec [Yandu Oppacher] Refactor profiler code
      3bead67d
    • Michael Nazario's avatar
      [SPARK-5440][pyspark] Add toLocalIterator to pyspark rdd · 456c11f1
      Michael Nazario authored
      Since Java and Scala both have access to iterate over partitions via the "toLocalIterator" function, python should also have that same ability.
      
      Author: Michael Nazario <mnazario@palantir.com>
      
      Closes #4237 from mnazario/feature/toLocalIterator and squashes the following commits:
      
      1c58526 [Michael Nazario] Fix documentation off by one error
      0cdc8f8 [Michael Nazario] Add toLocalIterator to PySpark
      456c11f1
    • Sandy Ryza's avatar
      SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs · 406f6d30
      Sandy Ryza authored
      Author: Sandy Ryza <sandy@cloudera.com>
      
      Closes #4251 from sryza/sandy-spark-5458 and squashes the following commits:
      
      460827a [Sandy Ryza] Python too
      d2dc160 [Sandy Ryza] SPARK-5458. Refer to aggregateByKey instead of combineByKey in docs
      406f6d30
  25. Jan 23, 2015
    • Josh Rosen's avatar
      [SPARK-5063] More helpful error messages for several invalid operations · cef1f092
      Josh Rosen authored
      This patch adds more helpful error messages for invalid programs that define nested RDDs, broadcast RDDs, perform actions inside of transformations (e.g. calling `count()` from inside of `map()`), and call certain methods on stopped SparkContexts.  Currently, these invalid programs lead to confusing NullPointerExceptions at runtime and have been a major source of questions on the mailing list and StackOverflow.
      
      In a few cases, I chose to log warnings instead of throwing exceptions in order to avoid any chance that this patch breaks programs that worked "by accident" in earlier Spark releases (e.g. programs that define nested RDDs but never run any jobs with them).
      
      In SparkContext, the new `assertNotStopped()` method is used to check whether methods are being invoked on a stopped SparkContext.  In some cases, user programs will not crash in spite of calling methods on stopped SparkContexts, so I've only added `assertNotStopped()` calls to methods that always throw exceptions when called on stopped contexts (e.g. by dereferencing a null `dagScheduler` pointer).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3884 from JoshRosen/SPARK-5063 and squashes the following commits:
      
      a38774b [Josh Rosen] Fix spelling typo
      a943e00 [Josh Rosen] Convert two exceptions into warnings in order to avoid breaking user programs in some edge-cases.
      2d0d7f7 [Josh Rosen] Fix test to reflect 1.2.1 compatibility
      3f0ea0c [Josh Rosen] Revert two unintentional formatting changes
      8e5da69 [Josh Rosen] Remove assertNotStopped() calls for methods that were sometimes safe to call on stopped SC's in Spark 1.2
      8cff41a [Josh Rosen] IllegalStateException fix
      6ef68d0 [Josh Rosen] Fix Python line length issues.
      9f6a0b8 [Josh Rosen] Add improved error messages to PySpark.
      13afd0f [Josh Rosen] SparkException -> IllegalStateException
      8d404f3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-5063
      b39e041 [Josh Rosen] Fix BroadcastSuite test which broadcasted an RDD
      99cc09f [Josh Rosen] Guard against calling methods on stopped SparkContexts.
      34833e8 [Josh Rosen] Add more descriptive error message.
      57cc8a1 [Josh Rosen] Add error message when directly broadcasting RDD.
      15b2e6b [Josh Rosen] [SPARK-5063] Useful error messages for nested RDDs and actions inside of transformations
      cef1f092
  26. Jan 20, 2015
    • Sean Owen's avatar
      SPARK-5270 [CORE] Provide isEmpty() function in RDD API · 306ff187
      Sean Owen authored
      Pretty minor, but submitted for consideration -- this would at least help people make this check in the most efficient way I know.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4074 from srowen/SPARK-5270 and squashes the following commits:
      
      66885b8 [Sean Owen] Add note that JavaRDDLike should not be implemented by user code
      2e9b490 [Sean Owen] More tests, and Mima-exclude the new isEmpty method in JavaRDDLike
      28395ff [Sean Owen] Add isEmpty to Java, Python
      7dd04b7 [Sean Owen] Add efficient RDD.isEmpty()
      306ff187
  27. Dec 17, 2014
  28. Dec 16, 2014
    • Davies Liu's avatar
      [SPARK-4841] fix zip with textFile() · c246b95d
      Davies Liu authored
      UTF8Deserializer can not be used in BatchedSerializer, so always use PickleSerializer() when change batchSize in zip().
      
      Also, if two RDD have the same batch size already, they did not need re-serialize any more.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3706 from davies/fix_4841 and squashes the following commits:
      
      20ce3a3 [Davies Liu] fix bug in _reserialize()
      e3ebf7c [Davies Liu] add comment
      379d2c8 [Davies Liu] fix zip with textFile()
      c246b95d
  29. Nov 20, 2014
    • Davies Liu's avatar
      [SPARK-4477] [PySpark] remove numpy from RDDSampler · d39f2e9c
      Davies Liu authored
      In RDDSampler, it try use numpy to gain better performance for possion(), but the number of call of random() is only (1+faction) * N in the pure python implementation of possion(), so there is no much performance gain from numpy.
      
      numpy is not a dependent of pyspark, so it maybe introduce some problem, such as there is no numpy installed in slaves, but only installed master, as reported in SPARK-927.
      
      It also complicate the code a lot, so we may should remove numpy from RDDSampler.
      
      I also did some benchmark to verify that:
      ```
      >>> from pyspark.mllib.random import RandomRDDs
      >>> rdd = RandomRDDs.uniformRDD(sc, 1 << 20, 1).cache()
      >>> rdd.count()  # cache it
      >>> rdd.sample(True, 0.9).count()    # measure this line
      ```
      the results:
      
      |withReplacement      |  random  | numpy.random |
       ------- | ------------ |  -------
      |True | 1.5 s|  1.4 s|
      |False|  0.6 s | 0.8 s|
      
      closes #2313
      
      Note: this patch including some commits that not mirrored to github, it will be OK after it catches up.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3351 from davies/numpy and squashes the following commits:
      
      5c438d7 [Davies Liu] fix comment
      c5b9252 [Davies Liu] Merge pull request #1 from mengxr/SPARK-4477
      98eb31b [Xiangrui Meng] make poisson sampling slightly faster
      ee17d78 [Davies Liu] remove = for float
      13f7b05 [Davies Liu] Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into numpy
      f583023 [Davies Liu] fix tests
      51649f5 [Davies Liu] remove numpy in RDDSampler
      78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
      f5fdf63 [Davies Liu] fix bug with int in weights
      4dfa2cd [Davies Liu] refactor
      f866bcf [Davies Liu] remove unneeded change
      c7a2007 [Davies Liu] switch to python implementation
      95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
      0d9b256 [Davies Liu] refactor
      1715ee3 [Davies Liu] address comments
      41fce54 [Davies Liu] randomSplit()
      d39f2e9c
  30. Nov 18, 2014
    • Davies Liu's avatar
      [SPARK-4327] [PySpark] Python API for RDD.randomSplit() · 7f22fa81
      Davies Liu authored
      ```
      pyspark.RDD.randomSplit(self, weights, seed=None)
          Randomly splits this RDD with the provided weights.
      
          :param weights: weights for splits, will be normalized if they don't sum to 1
          :param seed: random seed
          :return: split RDDs in an list
      
          >>> rdd = sc.parallelize(range(10), 1)
          >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11)
          >>> rdd1.collect()
          [3, 6]
          >>> rdd2.collect()
          [0, 5, 7]
          >>> rdd3.collect()
          [1, 2, 4, 8, 9]
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3193 from davies/randomSplit and squashes the following commits:
      
      78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no performance gain
      f5fdf63 [Davies Liu] fix bug with int in weights
      4dfa2cd [Davies Liu] refactor
      f866bcf [Davies Liu] remove unneeded change
      c7a2007 [Davies Liu] switch to python implementation
      95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into randomSplit
      0d9b256 [Davies Liu] refactor
      1715ee3 [Davies Liu] address comments
      41fce54 [Davies Liu] randomSplit()
      7f22fa81
  31. Nov 07, 2014
    • Davies Liu's avatar
      [SPARK-4304] [PySpark] Fix sort on empty RDD · 77791097
      Davies Liu authored
      This PR fix sortBy()/sortByKey() on empty RDD.
      
      This should be back ported into 1.1/1.2
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3162 from davies/fix_sort and squashes the following commits:
      
      84f64b7 [Davies Liu] add tests
      52995b5 [Davies Liu] fix sortByKey() on empty RDD
      77791097
  32. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
  33. Nov 03, 2014
    • Xiangrui Meng's avatar
      [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample · 3cca1962
      Xiangrui Meng authored
      The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.
      
      ~~~
      In [14]: import random
      
      In [15]: r1 = random.Random(10)
      
      In [16]: r1.randint(0, 1)
      Out[16]: 1
      
      In [17]: r1.random()
      Out[17]: 0.4288890546751146
      
      In [18]: r1.random()
      Out[18]: 0.5780913011344704
      
      In [19]: r2 = random.Random(10)
      
      In [20]: r2.randint(0, 1)
      Out[20]: 1
      
      In [21]: r2.randint(0, 1)
      Out[21]: 0
      
      In [22]: r2.random()
      Out[22]: 0.5780913011344704
      ~~~
      
      Note: The new tests are not for this bug fix.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:
      
      869ae4b [Xiangrui Meng] move tests tests.py
      c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample
      3cca1962
  34. Oct 31, 2014
    • Xiangrui Meng's avatar
      [SPARK-4150][PySpark] return self in rdd.setName · f1e7361f
      Xiangrui Meng authored
      Then we can do `rdd.setName('abc').cache().count()`.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3011 from mengxr/rdd-setname and squashes the following commits:
      
      10d0d60 [Xiangrui Meng] update test
      4ac3bbd [Xiangrui Meng] return self in rdd.setName
      f1e7361f
  35. Oct 13, 2014
    • yingjieMiao's avatar
      [Spark] RDD take() method: overestimate too much · 49bbdcb6
      yingjieMiao authored
      In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."
      
      `(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
      Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)
      
      This could be a performance problem. (unless this is the intended behavior)
      
      Author: yingjieMiao <yingjie@42go.com>
      
      Closes #2648 from yingjieMiao/rdd_take and squashes the following commits:
      
      d758218 [yingjieMiao] scala style fix
      a8e74bb [yingjieMiao] python style fix
      4b6e777 [yingjieMiao] infix operator style fix
      4391d3b [yingjieMiao] typo fix.
      692f4e6 [yingjieMiao] cap numPartsToTry
      c4483dc [yingjieMiao] style fix
      1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
      d31ff7e [yingjieMiao] handle the edge case after 1 iteration
      a2aa36b [yingjieMiao] RDD take method: overestimate too much
      49bbdcb6
  36. Oct 11, 2014
    • cocoatomo's avatar
      [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings · 7a3f589e
      cocoatomo authored
      Sphinx documents contains a corrupted ReST format and have some warnings.
      
      The purpose of this issue is same as https://issues.apache.org/jira/browse/SPARK-3773.
      
      commit: 0e8203f4
      
      output
      ```
      $ cd ./python/docs
      $ make clean html
      rm -rf _build/*
      sphinx-build -b html -d _build/doctrees   . _build/html
      Making output directory...
      Running Sphinx v1.2.3
      loading pickled environment... not yet created
      building [html]: targets for 4 source files that are out of date
      updating environment: 4 added, 0 changed, 0 removed
      reading sources... [100%] pyspark.sql
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected unindent.
      looking for now-outdated files... none found
      pickling environment... done
      checking consistency... done
      preparing documents... done
      writing output... [100%] pyspark.sql
      writing additional files... (12 module code pages) _modules/index search
      copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
      done
      copying extra files... done
      dumping search index... done
      dumping object inventory... done
      build succeeded, 4 warnings.
      
      Build finished. The HTML pages are in _build/html.
      ```
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes the following commits:
      
      2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
      7a3f589e
Loading