Skip to content
Snippets Groups Projects
  1. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  2. Apr 09, 2015
    • Davies Liu's avatar
      [SPARK-3074] [PySpark] support groupByKey() with single huge key · b5c51c8d
      Davies Liu authored
      This patch change groupByKey() to use external sort based approach, so it can support single huge key.
      
      For example, it can group by a dataset including one hot key with 40 millions values (strings), using 500M memory for Python worker, finished in about 2 minutes. (it will need 6G memory in hash based approach).
      
      During groupByKey(), it will do in-memory groupBy first. If the dataset can not fit in memory, then data will be partitioned by hash. If one partition still can not fit in memory, it will switch to sort based groupBy().
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #1977 from davies/groupby and squashes the following commits:
      
      af3713a [Davies Liu] make sure it's iterator
      67772dd [Davies Liu] fix tests
      e78c15c [Davies Liu] address comments
      0b0fde8 [Davies Liu] address comments
      0dcf320 [Davies Liu] address comments, rollback changes in ResultIterable
      e3b8eab [Davies Liu] fix narrow dependency
      2a1857a [Davies Liu] typo
      d2f053b [Davies Liu] add repr for FlattedValuesSerializer
      c6a2f8d [Davies Liu] address comments
      9e2df24 [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      2b9c261 [Davies Liu] fix typo in comments
      70aadcd [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      a14b4bd [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      ab5515b [Davies Liu] Merge branch 'master' into groupby
      651f891 [Davies Liu] simplify GroupByKey
      1578f2e [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      1f69f93 [Davies Liu] fix tests
      0d3395f [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      341f1e0 [Davies Liu] add comments, refactor
      47918b8 [Davies Liu] remove unused code
      6540948 [Davies Liu] address comments:
      17f4ec6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      4d4bc86 [Davies Liu] bugfix
      8ef965e [Davies Liu] Merge branch 'master' into groupby
      fbc504a [Davies Liu] Merge branch 'master' into groupby
      779ed03 [Davies Liu] fix merge conflict
      2c1d05b [Davies Liu] refactor, minor turning
      b48cda5 [Davies Liu] Merge branch 'master' into groupby
      85138e6 [Davies Liu] Merge branch 'master' into groupby
      acd8e1b [Davies Liu] fix memory when groupByKey().count()
      905b233 [Davies Liu] Merge branch 'sort' into groupby
      1f075ed [Davies Liu] Merge branch 'master' into sort
      4b07d39 [Davies Liu] compress the data while spilling
      0a081c6 [Davies Liu] Merge branch 'master' into groupby
      f157fe7 [Davies Liu] Merge branch 'sort' into groupby
      eb53ca6 [Davies Liu] Merge branch 'master' into sort
      b2dc3bf [Davies Liu] Merge branch 'sort' into groupby
      644abaf [Davies Liu] add license in LICENSE
      19f7873 [Davies Liu] improve tests
      11ba318 [Davies Liu] typo
      085aef8 [Davies Liu] Merge branch 'master' into groupby
      3ee58e5 [Davies Liu] switch to sort based groupBy, based on size of data
      1ea0669 [Davies Liu] choose sort based groupByKey() automatically
      b40bae7 [Davies Liu] bugfix
      efa23df [Davies Liu] refactor, add spark.shuffle.sort=False
      250be4e [Davies Liu] flatten the combined values when dumping into disks
      d05060d [Davies Liu] group the same key before shuffle, reduce the comparison during sorting
      083d842 [Davies Liu] sorted based groupByKey()
      55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
      b5c51c8d
  3. Feb 02, 2015
    • Davies Liu's avatar
      [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python · 0561c454
      Davies Liu authored
      This PR brings the Python API for Spark Streaming Kafka data source.
      
      ```
          class KafkaUtils(__builtin__.object)
           |  Static methods defined here:
           |
           |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
      2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
           |      Create an input stream that pulls messages from a Kafka Broker.
           |
           |      :param ssc:  StreamingContext object
           |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
           |      :param groupId:  The group id for this consumer.
           |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
           |                      Each partition is consumed in its own thread.
           |      :param storageLevel:  RDD storage level.
           |      :param keyDecoder:  A function used to decode key
           |      :param valueDecoder:  A function used to decode value
           |      :return: A DStream object
      ```
      run the example:
      
      ```
      bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
      ```
      
      Author: Davies Liu <davies@databricks.com>
      Author: Tathagata Das <tdas@databricks.com>
      
      Closes #3715 from davies/kafka and squashes the following commits:
      
      d93bfe0 [Davies Liu] Update make-distribution.sh
      4280d04 [Davies Liu] address comments
      e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      f257071 [Davies Liu] add tests for null in RDD
      23b039a [Davies Liu] address comments
      9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
      a74da87 [Davies Liu] address comments
      dc1eed0 [Davies Liu] Update kafka_wordcount.py
      31e2317 [Davies Liu] Update kafka_wordcount.py
      370ba61 [Davies Liu] Update kafka.py
      97386b3 [Davies Liu] address comment
      2c567a5 [Davies Liu] update logging and comment
      33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
      aea8953 [Tathagata Das] Kafka-assembly for Python API
      eea16a7 [Davies Liu] refactor
      f6ce899 [Davies Liu] add example and fix bugs
      98c8d17 [Davies Liu] fix python style
      5697a01 [Davies Liu] bypass decoder in scala
      048dbe6 [Davies Liu] fix python style
      75d485e [Davies Liu] add mqtt
      07923c4 [Davies Liu] support kafka in Python
      0561c454
  4. Jan 15, 2015
    • Davies Liu's avatar
      [SPARK-5224] [PySpark] improve performance of parallelize list/ndarray · 3c8650c1
      Davies Liu authored
      After the default batchSize changed to 0 (batched based on the size of object), but parallelize() still use BatchedSerializer with batchSize=1, this PR will use batchSize=1024 for parallelize by default.
      
      Also, BatchedSerializer did not work well with list and numpy.ndarray, this improve BatchedSerializer by using __len__ and __getslice__.
      
      Here is the benchmark for parallelize 1 millions int with list or ndarray:
      
          |          before     |   after  | improvements
       ------- | ------------ | ------------- | -------
      list |   11.7 s  | 0.8 s |  14x
      numpy.ndarray     |  32 s  |   0.7 s | 40x
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4024 from davies/opt_numpy and squashes the following commits:
      
      7618c7c [Davies Liu] improve performance of parallelize list/ndarray
      3c8650c1
  5. Dec 16, 2014
    • Davies Liu's avatar
      [SPARK-4841] fix zip with textFile() · c246b95d
      Davies Liu authored
      UTF8Deserializer can not be used in BatchedSerializer, so always use PickleSerializer() when change batchSize in zip().
      
      Also, if two RDD have the same batch size already, they did not need re-serialize any more.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3706 from davies/fix_4841 and squashes the following commits:
      
      20ce3a3 [Davies Liu] fix bug in _reserialize()
      e3ebf7c [Davies Liu] add comment
      379d2c8 [Davies Liu] fix zip with textFile()
      c246b95d
  6. Nov 24, 2014
    • Davies Liu's avatar
      [SPARK-4548] []SPARK-4517] improve performance of python broadcast · 6cf50768
      Davies Liu authored
      Re-implement the Python broadcast using file:
      
      1) serialize the python object using cPickle, write into disks.
      2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
      3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
      4) During deserialization, writing the data into disk.
      5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
      
      It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
      
      Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
      
               name |   1.1   | 1.2 with this patch |  improvement
      ---------|--------|---------|--------
            python-broadcast-w-bytes  |	25.20  |	9.33   |	170.13% |
              python-broadcast-w-set	  |     4.13	   |    4.50  |	-8.35%  |
      
      Testing with 100 tasks (16 CPUs):
      
               name |   1.1   | 1.2 with this patch |  improvement
      ---------|--------|---------|--------
           python-broadcast-w-bytes	| 38.16	| 8.40	 | 353.98%
              python-broadcast-w-set	| 23.29	| 9.59 |	142.80%
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3417 from davies/pybroadcast and squashes the following commits:
      
      50a58e0 [Davies Liu] address comments
      b98de1d [Davies Liu] disable gc while unpickle
      e5ee6b9 [Davies Liu] support large string
      09303b8 [Davies Liu] read all data into memory
      dde02dd [Davies Liu] improve performance of python broadcast
      6cf50768
  7. Nov 18, 2014
    • Davies Liu's avatar
      [SPARK-3721] [PySpark] broadcast objects larger than 2G · 4a377aff
      Davies Liu authored
      This patch will bring support for broadcasting objects larger than 2G.
      
      pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]].
      
      Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2659 from davies/huge and squashes the following commits:
      
      7b57a14 [Davies Liu] add more tests for broadcast
      28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      a2f6a02 [Davies Liu] bug fix
      4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      5875c73 [Davies Liu] address comments
      10a349b [Davies Liu] address comments
      0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      6182c8f [Davies Liu] Merge branch 'master' into huge
      d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      2514848 [Davies Liu] address comments
      fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      1c2d928 [Davies Liu] fix scala style
      091b107 [Davies Liu] broadcast objects larger than 2G
      4a377aff
  8. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
  9. Oct 23, 2014
    • Davies Liu's avatar
      [SPARK-3993] [PySpark] fix bug while reuse worker after take() · e595c8d0
      Davies Liu authored
      After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted data.
      
      We should make sure the socket is clean before reuse it, write END_OF_STREAM at the end, and check it after read out all result from python.
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2838 from davies/fix_reuse and squashes the following commits:
      
      8872914 [Davies Liu] fix tests
      660875b [Davies Liu] fix bug while reuse worker after take()
      e595c8d0
  10. Oct 12, 2014
    • giwa's avatar
      [SPARK-2377] Python API for Streaming · 69c67aba
      giwa authored
      This patch brings Python API for Streaming.
      
      This patch is based on work from @giwa
      
      Author: giwa <ugw.gi.world@gmail.com>
      Author: Ken Takagiwa <ken@Kens-MacBook-Pro.local>
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Ken Takagiwa <ken@kens-mbp.gateway.sonic.net>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Ken <ugw.gi.world@gmail.com>
      Author: Ken Takagiwa <ugw.gi.world@gmail.com>
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2538 from davies/streaming and squashes the following commits:
      
      64561e4 [Davies Liu] fix tests
      331ecce [Davies Liu] fix example
      3e2492b [Davies Liu] change updateStateByKey() to easy API
      182be73 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      02d0575 [Davies Liu] add wrapper for foreachRDD()
      bebeb4a [Davies Liu] address all comments
      6db00da [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      8380064 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      52c535b [Davies Liu] remove fix for sum()
      e108ec1 [Davies Liu]  address comments
      37fe06f [Davies Liu] use random port for callback server
      d05871e [Davies Liu] remove reuse of PythonRDD
      be5e5ff [Davies Liu] merge branch of env, make tests stable.
      8071541 [Davies Liu] Merge branch 'env' into streaming
      c7bbbce [Davies Liu] fix sphinx docs
      6bb9d91 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
      54bd92b [Davies Liu] improve tests
      c2b31cb [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      7a88f9f [Davies Liu] rollback RDD.setContext(), use textFileStream() to test checkpointing
      bd8a4c2 [Davies Liu] fix scala style
      7797c70 [Davies Liu] refactor
      ff88bec [Davies Liu] rename RDDFunction to TransformFunction
      d328aca [Davies Liu] fix serializer in queueStream
      6f0da2f [Davies Liu] recover from checkpoint
      fa7261b [Davies Liu] refactor
      a13ff34 [Davies Liu] address comments
      8466916 [Davies Liu] support checkpoint
      9a16bd1 [Davies Liu] change number of partitions during tests
      b98d63f [Davies Liu] change private[spark] to private[python]
      eed6e2a [Davies Liu] rollback not needed changes
      e00136b [Davies Liu] address comments
      069a94c [Davies Liu] fix the number of partitions during window()
      338580a [Davies Liu] change _first(), _take(), _collect() as private API
      19797f9 [Davies Liu] clean up
      6ebceca [Davies Liu] add more tests
      c40c52d [Davies Liu] change first(), take(n) to has the same behavior as RDD
      98ac6c2 [Davies Liu] support ssc.transform()
      b983f0f [Davies Liu] address comments
      847f9b9 [Davies Liu] add more docs, add first(), take()
      e059ca2 [Davies Liu] move check of window into Python
      fce0ef5 [Davies Liu] rafactor of foreachRDD()
      7001b51 [Davies Liu] refactor of queueStream()
      26ea396 [Davies Liu] refactor
      74df565 [Davies Liu] fix print and docs
      b32774c [Davies Liu] move java_import into streaming
      604323f [Davies Liu] enable streaming tests
      c499ba0 [Davies Liu] remove Time and Duration
      3f0fb4b [Davies Liu] refactor fix tests
      c28f520 [Davies Liu] support updateStateByKey
      d357b70 [Davies Liu] support windowed dstream
      bd13026 [Davies Liu] fix examples
      eec401e [Davies Liu] refactor, combine TransformedRDD, fix reuse PythonRDD, fix union
      9a57685 [Davies Liu] fix python style
      bd27874 [Davies Liu] fix scala style
      7339be0 [Davies Liu] delete tests
      7f53086 [Davies Liu] support transform(), refactor and cleanup
      df098fc [Davies Liu] Merge branch 'master' into giwa
      550dfd9 [giwa] WIP fixing 1.1 merge
      5cdb6fa [giwa] changed for SCCallSiteSync
      e685853 [giwa] meged with rebased 1.1 branch
      2d32a74 [giwa] added some StreamingContextTestSuite
      4a59e1e [giwa] WIP:added more test for StreamingContext
      8ffdbf1 [giwa] added atexit to handle callback server
      d5f5fcb [giwa] added comment for StreamingContext.sparkContext
      63c881a [giwa] added StreamingContext.sparkContext
      d39f102 [giwa] added StreamingContext.remember
      d542743 [giwa] clean up code
      2fdf0de [Matthew Farrellee] Fix scalastyle errors
      c0a06bc [giwa] delete not implemented functions
      f385976 [giwa] delete inproper comments
      b0f2015 [giwa] added comment in dstream._test_output
      bebb3f3 [giwa] remove the last brank line
      fbed8da [giwa] revert pom.xml
      8ed93af [giwa] fixed explanaiton
      066ba90 [giwa] revert pom.xml
      fa4af88 [giwa] remove duplicated import
      6ae3caa [giwa] revert pom.xml
      7dc7391 [giwa] fixed typo
      62dc7a3 [giwa] clean up exmples
      f04882c [giwa] clen up examples
      b171ec3 [giwa] fixed pep8 violation
      f198d14 [giwa] clean up code
      3166d31 [giwa] clean up
      c00e091 [giwa] change test case not to use awaitTermination
      e80647e [giwa] adopted the latest compression way of python command
      58e41ff [giwa] merge with master
      455e5af [giwa] removed wasted print in DStream
      af336b7 [giwa] add comments
      ddd4ee1 [giwa] added TODO coments
      99ce042 [giwa] added saveAsTextFiles and saveAsPickledFiles
      2a06cdb [giwa] remove waste duplicated code
      c5ecfc1 [giwa] basic function test cases are passed
      8dcda84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      795b2cd [giwa] broke something
      1e126bf [giwa] WIP: solved partitioned and None is not recognized
      f67cf57 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      953deb0 [giwa] edited the comment to add more precise description
      af610d3 [giwa] removed unnesessary changes
      c1d546e [giwa] fixed PEP-008 violation
      99410be [giwa] delete waste file
      b3b0362 [giwa] added basic operation test cases
      9cde7c9 [giwa] WIP added test case
      bd3ba53 [giwa] WIP
      5c04a5f [giwa] WIP: added PythonTestInputStream
      019ef38 [giwa] WIP
      1934726 [giwa] update comment
      376e3ac [giwa] WIP
      932372a [giwa] clean up dstream.py
      0b09cff [giwa] added stop in StreamingContext
      92e333e [giwa] implemented reduce and count function in Dstream
      1b83354 [giwa] Removed the waste line
      88f7506 [Ken Takagiwa] Kill py4j callback server properly
      54b5358 [Ken Takagiwa] tried to restart callback server
      4f07163 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      fe02547 [Ken Takagiwa] remove waste file
      2ad7bd3 [Ken Takagiwa] clean up codes
      6197a11 [Ken Takagiwa] clean up code
      eb4bf48 [Ken Takagiwa] fix map function
      98c2a00 [Ken Takagiwa] added count operation but this implementation need double check
      58591d2 [Ken Takagiwa] reduceByKey is working
      0df7111 [Ken Takagiwa] delete old file
      f485b1d [Ken Takagiwa] fied input of socketTextDStream
      dd6de81 [Ken Takagiwa] initial commit for socketTextStream
      247fd74 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      4bcb318 [Ken Takagiwa] implementing transform function in Python
      38adf95 [Ken Takagiwa] added reducedByKey not working yet
      66fcfff [Ken Takagiwa] modify dstream.py to fix indent error
      41886c2 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      0b99bec [Ken] initial commit for pySparkStreaming
      c214199 [giwa] added testcase for combineByKey
      5625bdc [giwa] added gorupByKey testcase
      10ab87b [giwa] added sparkContext as input parameter in StreamingContext
      10b5b04 [giwa] removed wasted print in DStream
      e54f986 [giwa] add comments
      16aa64f [giwa] added TODO coments
      74535d4 [giwa] added saveAsTextFiles and saveAsPickledFiles
      f76c182 [giwa] remove waste duplicated code
      18c8723 [giwa] modified streaming test case to add coment
      13fb44c [giwa] basic function test cases are passed
      3000b2b [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      ff14070 [giwa] broke something
      bcdec33 [giwa] WIP: solved partitioned and None is not recognized
      270a9e1 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      bb10956 [giwa] edited the comment to add more precise description
      253a863 [giwa] removed unnesessary changes
      3d37822 [giwa] fixed PEP-008 violation
      f21cab3 [giwa] delete waste file
      878bad7 [giwa] added basic operation test cases
      ce2acd2 [giwa] WIP added test case
      9ad6855 [giwa] WIP
      1df77f5 [giwa] WIP: added PythonTestInputStream
      1523b66 [giwa] WIP
      8a0fbbc [giwa] update comment
      fe648e3 [giwa] WIP
      29c2bc5 [giwa] initial commit for testcase
      4d40d63 [giwa] clean up dstream.py
      c462bb3 [giwa] added stop in StreamingContext
      d2c01ba [giwa] clean up examples
      3c45cd2 [giwa] implemented reduce and count function in Dstream
      b349649 [giwa] Removed the waste line
      3b498e1 [Ken Takagiwa] Kill py4j callback server properly
      84a9668 [Ken Takagiwa] tried to restart callback server
      9ab8952 [Tathagata Das] Added extra line.
      05e991b [Tathagata Das] Added missing file
      b1d2a30 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      678e854 [Ken Takagiwa] remove waste file
      0a8bbbb [Ken Takagiwa] clean up codes
      bab31c1 [Ken Takagiwa] clean up code
      72b9738 [Ken Takagiwa] fix map function
      d3ee86a [Ken Takagiwa] added count operation but this implementation need double check
      15feea9 [Ken Takagiwa] edit python sparkstreaming example
      6f98e50 [Ken Takagiwa] reduceByKey is working
      c455c8d [Ken Takagiwa] added reducedByKey not working yet
      dc6995d [Ken Takagiwa] delete old file
      b31446a [Ken Takagiwa] fixed typo of network_workdcount.py
      ccfd214 [Ken Takagiwa] added doctest for pyspark.streaming.duration
      0d1b954 [Ken Takagiwa] fied input of socketTextDStream
      f746109 [Ken Takagiwa] initial commit for socketTextStream
      bb7ccf3 [Ken Takagiwa] remove unused import in python
      224fc5e [Ken Takagiwa] add empty line
      d2099d8 [Ken Takagiwa] sorted the import following Spark coding convention
      5bac7ec [Ken Takagiwa] revert streaming/pom.xml
      e1df940 [Ken Takagiwa] revert pom.xml
      494cae5 [Ken Takagiwa] remove not implemented DStream functions in python
      17a74c6 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      1a0f065 [Ken Takagiwa] implementing transform function in Python
      d7b4d6f [Ken Takagiwa] added reducedByKey not working yet
      87438e2 [Ken Takagiwa] modify dstream.py to fix indent error
      b406252 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      454981d [Ken] initial commit for pySparkStreaming
      150b94c [giwa] added some StreamingContextTestSuite
      f7bc8f9 [giwa] WIP:added more test for StreamingContext
      ee50c5a [giwa] added atexit to handle callback server
      fdc9125 [giwa] added comment for StreamingContext.sparkContext
      f5bfb70 [giwa] added StreamingContext.sparkContext
      da09768 [giwa] added StreamingContext.remember
      d68b568 [giwa] clean up code
      4afa390 [giwa] clean up code
      1fd6bc7 [Ken Takagiwa] Merge pull request #2 from mattf/giwa-master
      d9d59fe [Matthew Farrellee] Fix scalastyle errors
      67473a9 [giwa] delete not implemented functions
      c97377c [giwa] delete inproper comments
      2ea769e [giwa] added comment in dstream._test_output
      3b27bd4 [giwa] remove the last brank line
      acfcaeb [giwa] revert pom.xml
      93f7637 [giwa] fixed explanaiton
      50fd6f9 [giwa] revert pom.xml
      4f82c89 [giwa] remove duplicated import
      9d1de23 [giwa] revert pom.xml
      7339df2 [giwa] fixed typo
      9c85e48 [giwa] clean up exmples
      24f95db [giwa] clen up examples
      0d30109 [giwa] fixed pep8 violation
      b7dab85 [giwa] improve test case
      583e66d [giwa] move tests for streaming inside streaming directory
      1d84142 [giwa] remove unimplement test
      f0ea311 [giwa] clean up code
      171edeb [giwa] clean up
      4dedd2d [giwa] change test case not to use awaitTermination
      268a6a5 [giwa] Changed awaitTermination not to call awaitTermincation in Scala. Just use time.sleep instread
      09a28bf [giwa] improve testcases
      58150f5 [giwa] Changed the test case to focus the test operation
      199e37f [giwa] adopted the latest compression way of python command
      185fdbf [giwa] merge with master
      f1798c4 [giwa] merge with master
      e70f706 [giwa] added testcase for combineByKey
      e162822 [giwa] added gorupByKey testcase
      97742fe [giwa] added sparkContext as input parameter in StreamingContext
      14d4c0e [giwa] removed wasted print in DStream
      6d8190a [giwa] add comments
      4aa99e4 [giwa] added TODO coments
      e9fab72 [giwa] added saveAsTextFiles and saveAsPickledFiles
      94f2b65 [giwa] remove waste duplicated code
      580fbc2 [giwa] modified streaming test case to add coment
      99e4bb3 [giwa] basic function test cases are passed
      7051a84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      35933e1 [giwa] broke something
      9767712 [giwa] WIP: solved partitioned and None is not recognized
      4f2d7e6 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      33c0f94d [giwa] edited the comment to add more precise description
      774f18d [giwa] removed unnesessary changes
      3a671cc [giwa] remove export PYSPARK_PYTHON in spark submit
      8efa266 [giwa] fixed PEP-008 violation
      fa75d71 [giwa] delete waste file
      7f96294 [giwa] added basic operation test cases
      3dda31a [giwa] WIP added test case
      1f68b78 [giwa] WIP
      c05922c [giwa] WIP: added PythonTestInputStream
      1fd12ae [giwa] WIP
      c880a33 [giwa] update comment
      5d22c92 [giwa] WIP
      ea4b06b [giwa] initial commit for testcase
      5a9b525 [giwa] clean up dstream.py
      79c5809 [giwa] added stop in StreamingContext
      189dcea [giwa] clean up examples
      b8d7d24 [giwa] implemented reduce and count function in Dstream
      b6468e6 [giwa] Removed the waste line
      b47b5fd [Ken Takagiwa] Kill py4j callback server properly
      19ddcdd [Ken Takagiwa] tried to restart callback server
      c9fc124 [Tathagata Das] Added extra line.
      4caae3f [Tathagata Das] Added missing file
      4eff053 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      5e822d4 [Ken Takagiwa] remove waste file
      aeaf8a5 [Ken Takagiwa] clean up codes
      9fa249b [Ken Takagiwa] clean up code
      05459c6 [Ken Takagiwa] fix map function
      a9f4ecb [Ken Takagiwa] added count operation but this implementation need double check
      d1ee6ca [Ken Takagiwa] edit python sparkstreaming example
      0b8b7d0 [Ken Takagiwa] reduceByKey is working
      d25d5cf [Ken Takagiwa] added reducedByKey not working yet
      7f7c5d1 [Ken Takagiwa] delete old file
      967dc26 [Ken Takagiwa] fixed typo of network_workdcount.py
      57fb740 [Ken Takagiwa] added doctest for pyspark.streaming.duration
      4b69fb1 [Ken Takagiwa] fied input of socketTextDStream
      02f618a [Ken Takagiwa] initial commit for socketTextStream
      4ce4058 [Ken Takagiwa] remove unused import in python
      856d98e [Ken Takagiwa] add empty line
      490e338 [Ken Takagiwa] sorted the import following Spark coding convention
      5594bd4 [Ken Takagiwa] revert pom.xml
      2adca84 [Ken Takagiwa] remove not implemented DStream functions in python
      e551e13 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      3758175 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      c5518b4 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      dcf243f [Ken Takagiwa] implementing transform function in Python
      9af03f4 [Ken Takagiwa] added reducedByKey not working yet
      6e0d9c7 [Ken Takagiwa] modify dstream.py to fix indent error
      e497b9b [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      5c3a683 [Ken] initial commit for pySparkStreaming
      665bfdb [giwa] added testcase for combineByKey
      a3d2379 [giwa] added gorupByKey testcase
      636090a [giwa] added sparkContext as input parameter in StreamingContext
      e7ebb08 [giwa] removed wasted print in DStream
      d8b593b [giwa] add comments
      ea9c873 [giwa] added TODO coments
      89ae38a [giwa] added saveAsTextFiles and saveAsPickledFiles
      e3033fc [giwa] remove waste duplicated code
      a14c7e1 [giwa] modified streaming test case to add coment
      536def4 [giwa] basic function test cases are passed
      2112638 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      080541a [giwa] broke something
      0704b86 [giwa] WIP: solved partitioned and None is not recognized
      90a6484 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      a65f302 [giwa] edited the comment to add more precise description
      bdde697 [giwa] removed unnesessary changes
      e8c7bfc [giwa] remove export PYSPARK_PYTHON in spark submit
      3334169 [giwa] fixed PEP-008 violation
      db0a303 [giwa] delete waste file
      2cfd3a0 [giwa] added basic operation test cases
      90ae568 [giwa] WIP added test case
      a120d07 [giwa] WIP
      f671cdb [giwa] WIP: added PythonTestInputStream
      56fae45 [giwa] WIP
      e35e101 [giwa] Merge branch 'master' into testcase
      ba5112d [giwa] update comment
      28aa56d [giwa] WIP
      fb08559 [giwa] initial commit for testcase
      a613b85 [giwa] clean up dstream.py
      c40c0ef [giwa] added stop in StreamingContext
      31e4260 [giwa] clean up examples
      d2127d6 [giwa] implemented reduce and count function in Dstream
      48f7746 [giwa] Removed the waste line
      0f83eaa [Ken Takagiwa] delete py4j 0.8.1
      1679808 [Ken Takagiwa] Kill py4j callback server properly
      f96cd4e [Ken Takagiwa] tried to restart callback server
      fe86198 [Ken Takagiwa] add py4j 0.8.2.1 but server is not launched
      1064fe0 [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
      28c6620 [Ken Takagiwa] Implemented DStream.foreachRDD in the Python API using Py4J callback server
      85b0fe1 [Ken Takagiwa] Merge pull request #1 from tdas/python-foreach
      54e2e8c [Tathagata Das] Added extra line.
      e185338 [Tathagata Das] Added missing file
      a778d4b [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      cc2092b [Ken Takagiwa] remove waste file
      d042ac6 [Ken Takagiwa] clean up codes
      84a021f [Ken Takagiwa] clean up code
      bd20e17 [Ken Takagiwa] fix map function
      d01a125 [Ken Takagiwa] added count operation but this implementation need double check
      7d05109 [Ken Takagiwa] merge with remote branch
      ae464e0 [Ken Takagiwa] edit python sparkstreaming example
      04af046 [Ken Takagiwa] reduceByKey is working
      3b6d7b0 [Ken Takagiwa] implementing transform function in Python
      571d52d [Ken Takagiwa] added reducedByKey not working yet
      5720979 [Ken Takagiwa] delete old file
      e604fcb [Ken Takagiwa] fixed typo of network_workdcount.py
      4b7c08b [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
      ce7d426 [Ken Takagiwa] added doctest for pyspark.streaming.duration
      a8c9fd5 [Ken Takagiwa] fixed for socketTextStream
      a61fa9e [Ken Takagiwa] fied input of socketTextDStream
      1e84f41 [Ken Takagiwa] initial commit for socketTextStream
      6d012f7 [Ken Takagiwa] remove unused import in python
      25d30d5 [Ken Takagiwa] add empty line
      6e0a64a [Ken Takagiwa] sorted the import following Spark coding convention
      fa4a7fc [Ken Takagiwa] revert streaming/pom.xml
      8f8202b [Ken Takagiwa] revert streaming pom.xml
      c9d79dd [Ken Takagiwa] revert pom.xml
      57e3e52 [Ken Takagiwa] remove not implemented DStream functions in python
      0a516f5 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      a7a0b5c [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      72bfc66 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      69e9cd3 [Ken Takagiwa] implementing transform function in Python
      94a0787 [Ken Takagiwa] added reducedByKey not working yet
      88068cf [Ken Takagiwa] modify dstream.py to fix indent error
      1367be5 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      eb2b3ba [Ken] Merge remote-tracking branch 'upstream/master'
      d8e51f9 [Ken] initial commit for pySparkStreaming
      69c67aba
  11. Oct 10, 2014
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] use AutoBatchedSerializer by default · 72f36ee5
      Davies Liu authored
      Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into  [64k - 640k].
      
      In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2740 from davies/batchsize and squashes the following commits:
      
      52cdb88 [Davies Liu] update docs
      185f2b9 [Davies Liu] use AutoBatchedSerializer by default
      72f36ee5
  12. Oct 06, 2014
    • Sandy Ryza's avatar
      [SPARK-2461] [PySpark] Add a toString method to GeneralizedLinearModel · 20ea54cc
      Sandy Ryza authored
      Add a toString method to GeneralizedLinearModel, also change `__str__` to `__repr__` for some classes, to provide better message in repr.
      
      This PR is based on #1388, thanks to sryza!
      
      closes #1388
      
      Author: Sandy Ryza <sandy@cloudera.com>
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2625 from davies/string and squashes the following commits:
      
      3544aad [Davies Liu] fix LinearModel
      0bcd642 [Davies Liu] Merge branch 'sandy-spark-2461' of github.com:sryza/spark
      1ce5c2d [Sandy Ryza] __repr__ back to __str__ in a couple places
      aa9e962 [Sandy Ryza] Switch __str__ to __repr__
      a0c5041 [Sandy Ryza] Add labels back in
      1aa17f5 [Sandy Ryza] Match existing conventions
      fac1bc4 [Sandy Ryza] Fix PEP8 error
      f7b58ed [Sandy Ryza] SPARK-2461. Add a toString method to GeneralizedLinearModel
      20ea54cc
  13. Sep 19, 2014
    • Davies Liu's avatar
      [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib · fce5e251
      Davies Liu authored
      Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.
      
      This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.
      
      All the modules are refactored to use this protocol.
      
      Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2378 from davies/pickle_mllib and squashes the following commits:
      
      dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
      810f97f [Davies Liu] fix equal of matrix
      032cd62 [Davies Liu] add more type check and conversion for user_product
      bd738ab [Davies Liu] address comments
      e431377 [Davies Liu] fix cache of rdd, refactor
      19d0967 [Davies Liu] refactor Picklers
      2511e76 [Davies Liu] cleanup
      1fccf1a [Davies Liu] address comments
      a2cc855 [Davies Liu] fix tests
      9ceff73 [Davies Liu] test size of serialized Rating
      44e0551 [Davies Liu] fix cache
      a379a81 [Davies Liu] fix pickle array in python2.7
      df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
      154d141 [Davies Liu] fix autobatchedpickler
      44736d7 [Davies Liu] speed up pickling array in Python 2.7
      e1d1bfc [Davies Liu] refactor
      708dc02 [Davies Liu] fix tests
      9dcfb63 [Davies Liu] fix style
      88034f0 [Davies Liu] rafactor, address comments
      46a501e [Davies Liu] choose batch size automatically
      df19464 [Davies Liu] memorize the module and class name during pickleing
      f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
      722dd96 [Davies Liu] cleanup _common.py
      0ee1525 [Davies Liu] remove outdated tests
      b02e34f [Davies Liu] remove _common.py
      84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
      4d7963e [Davies Liu] remove muanlly serialization
      6d26b03 [Davies Liu] fix tests
      c383544 [Davies Liu] classification
      f2a0856 [Davies Liu] mllib/regression
      d9f691f [Davies Liu] mllib/util
      cccb8b1 [Davies Liu] mllib/tree
      8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
      aa2287e [Davies Liu] random
      f1544c4 [Davies Liu] refactor clustering
      52d1350 [Davies Liu] use new protocol in mllib/stat
      b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
      f44f771 [Davies Liu] enable tests about array
      3908f5c [Davies Liu] Merge branch 'master' into pickle
      c77c87b [Davies Liu] cleanup debugging code
      60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
      fce5e251
  14. Sep 16, 2014
    • Davies Liu's avatar
      [SPARK-3430] [PySpark] [Doc] generate PySpark API docs using Sphinx · ec1adecb
      Davies Liu authored
      Using Sphinx to generate API docs for PySpark.
      
      requirement: Sphinx
      
      ```
      $ cd python/docs/
      $ make html
      ```
      
      The generated API docs will be located at python/docs/_build/html/index.html
      
      It can co-exists with those generated by Epydoc.
      
      This is the first working version, after merging in, then we can continue to improve it and replace the epydoc finally.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2292 from davies/sphinx and squashes the following commits:
      
      425a3b1 [Davies Liu] cleanup
      1573298 [Davies Liu] move docs to python/docs/
      5fe3903 [Davies Liu] Merge branch 'master' into sphinx
      9468ab0 [Davies Liu] fix makefile
      b408f38 [Davies Liu] address all comments
      e2ccb1b [Davies Liu] update name and version
      9081ead [Davies Liu] generate PySpark API docs using Sphinx
      ec1adecb
  15. Sep 13, 2014
    • Davies Liu's avatar
      [SPARK-3030] [PySpark] Reuse Python worker · 2aea0da8
      Davies Liu authored
      Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts.
      
      This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming.
      
      For a job with broadcast (43M after compress):
      ```
          b = sc.broadcast(set(range(30000000)))
          print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count()
      ```
      It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks.
      
      It's enabled by default, could be disabled by `spark.python.worker.reuse = false`.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2259 from davies/reuse-worker and squashes the following commits:
      
      f11f617 [Davies Liu] Merge branch 'master' into reuse-worker
      3939f20 [Davies Liu] fix bug in serializer in mllib
      cf1c55e [Davies Liu] address comments
      3133a60 [Davies Liu] fix accumulator with reused worker
      760ab1f [Davies Liu] do not reuse worker if there are any exceptions
      7abb224 [Davies Liu] refactor: sychronized with itself
      ac3206e [Davies Liu] renaming
      8911f44 [Davies Liu] synchronized getWorkerBroadcasts()
      6325fc1 [Davies Liu] bugfix: bid >= 0
      e0131a2 [Davies Liu] fix name of config
      583716e [Davies Liu] only reuse completed and not interrupted worker
      ace2917 [Davies Liu] kill python worker after timeout
      6123d0f [Davies Liu] track broadcasts for each worker
      8d2f08c [Davies Liu] reuse python worker
      2aea0da8
  16. Sep 12, 2014
    • Davies Liu's avatar
      [SPARK-3094] [PySpark] compatitable with PyPy · 71af030b
      Davies Liu authored
      After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:
      
      ```
      PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
      ```
      
      The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:
      
       Job | CPython 2.7 | PyPy 2.3.1  | Speed up
       ------- | ------------ | ------------- | -------
       Word Count | 41s   | 15s  | 2.7x
       Sort | 46s |  44s | 1.05x
       Stats | 174s | 3.6s | 48x
      
      Here is the code used for benchmark:
      
      ```python
      rdd = sc.textFile("text")
      def wordcount():
          rdd.flatMap(lambda x:x.split('/'))\
              .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
      def sort():
          rdd.sortBy(lambda x:x, 1).count()
      def stats():
          sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
      ```
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2144 from davies/pypy and squashes the following commits:
      
      9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
      4bc1f04 [Davies Liu] refactor
      b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
      3ca2351 [Davies Liu] Merge branch 'master' into pypy
      fae8b19 [Davies Liu] improve attrgetter, add tests
      591f830 [Davies Liu] try to run tests with PyPy in run-tests
      c8d62ba [Davies Liu] cleanup
      f651fd0 [Davies Liu] fix tests using array with PyPy
      1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
      3c1dbfe [Davies Liu] Merge branch 'master' into pypy
      42fb5fa [Davies Liu] Merge branch 'master' into pypy
      cb2d724 [Davies Liu] fix tests
      9986692 [Davies Liu] Merge branch 'master' into pypy
      25b4ca7 [Davies Liu] support PyPy
      71af030b
  17. Sep 11, 2014
    • Davies Liu's avatar
      [SPARK-3047] [PySpark] add an option to use str in textFileRDD · 1ef656ea
      Davies Liu authored
      str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future).
      
      use_unicode=True:
      
      daviesliudm:~/work/spark$ time python wc.py
      (u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)
      
      real	2m8.298s
      user	0m0.185s
      sys	0m0.064s
      
      use_unicode=False
      
      daviesliudm:~/work/spark$ time python wc.py
      ('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)
      
      real	1m26.402s
      user	0m0.182s
      sys	0m0.062s
      
      We can see that it got 32% improvement!
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1951 from davies/unicode and squashes the following commits:
      
      8352d57 [Davies Liu] update version number
      a286f2f [Davies Liu] rollback loads()
      85246e5 [Davies Liu] add docs for use_unicode
      a0295e1 [Davies Liu] add an option to use str in textFile()
      1ef656ea
  18. Sep 03, 2014
    • Davies Liu's avatar
      [SPARK-3309] [PySpark] Put all public API in __all__ · 6481d274
      Davies Liu authored
      Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2205 from davies/public and squashes the following commits:
      
      c6c5567 [Davies Liu] fix message
      f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
      7e3016a [Davies Liu] add __all__ in mllib
      6281b48 [Davies Liu] fix doc for SchemaRDD
      6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
      6481d274
  19. Aug 19, 2014
    • Davies Liu's avatar
      [SPARK-2790] [PySpark] fix zip with serializers which have different batch sizes. · d7e80c25
      Davies Liu authored
      If two RDDs have different batch size in serializers, then it will try to re-serialize the one with smaller batch size, then call RDD.zip() in Spark.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1894 from davies/zip and squashes the following commits:
      
      c4652ea [Davies Liu] add more test cases
      6d05fc8 [Davies Liu] Merge branch 'master' into zip
      813b1e4 [Davies Liu] add more tests for failed cases
      a4aafda [Davies Liu] fix zip with serializers which have different batch sizes.
      d7e80c25
  20. Aug 16, 2014
    • Davies Liu's avatar
      [SPARK-1065] [PySpark] improve supporting for large broadcast · 2fc8aca0
      Davies Liu authored
      Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).
      
      Add an option to keep object in driver (it's False by default) to save memory in driver.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1912 from davies/broadcast and squashes the following commits:
      
      e06df4a [Davies Liu] load broadcast from disk in driver automatically
      db3f232 [Davies Liu] fix serialization of accumulator
      631a827 [Davies Liu] Merge branch 'master' into broadcast
      c7baa8c [Davies Liu] compress serrialized broadcast and command
      9a7161f [Davies Liu] fix doc tests
      e93cf4b [Davies Liu] address comments: add test
      6226189 [Davies Liu] improve large broadcast
      2fc8aca0
  21. Aug 11, 2014
    • Josh Rosen's avatar
      [PySpark] [SPARK-2954] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 Fixes · db06a81f
      Josh Rosen authored
      - Modify python/run-tests to test with Python 2.6
      - Use unittest2 when running on Python 2.6.
      - Fix issue with namedtuple.
      - Skip TestOutputFormat.test_newhadoop on Python 2.6 until SPARK-2951 is fixed.
      - Fix MLlib _deserialize_double on Python 2.6.
      
      Closes #1868.  Closes #1042.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1874 from JoshRosen/python2.6 and squashes the following commits:
      
      983d259 [Josh Rosen] [SPARK-2954] Fix MLlib _deserialize_double on Python 2.6.
      5d18fd7 [Josh Rosen] [SPARK-2948] [SPARK-2910] [SPARK-2101] Python 2.6 fixes
      db06a81f
  22. Aug 06, 2014
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
  23. Aug 04, 2014
    • Davies Liu's avatar
      [SPARK-1687] [PySpark] fix unit tests related to pickable namedtuple · 9fd82dbb
      Davies Liu authored
      serializer is imported multiple times during doctests, so it's better to make _hijack_namedtuple() safe to be called multiple times.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1771 from davies/fix and squashes the following commits:
      
      1a9e336 [Davies Liu] fix unit tests
      9fd82dbb
    • Davies Liu's avatar
      [SPARK-1687] [PySpark] pickable namedtuple · 59f84a95
      Davies Liu authored
      Add an hook to replace original namedtuple with an pickable one, then namedtuple could be used in RDDs.
      
      PS: pyspark should be import BEFORE "from collections import namedtuple"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1623 from davies/namedtuple and squashes the following commits:
      
      045dad8 [Davies Liu] remove unrelated code changes
      4132f32 [Davies Liu] address comment
      55b1c1a [Davies Liu] fix tests
      61f86eb [Davies Liu] replace all the reference of namedtuple to new hacked one
      98df6c6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
      f7b1bde [Davies Liu] add hack for CloudPickleSerializer
      0c5c849 [Davies Liu] Merge branch 'master' of github.com:apache/spark into namedtuple
      21991e6 [Davies Liu] hack namedtuple in __main__ module, make it picklable.
      93b03b8 [Davies Liu] pickable namedtuple
      59f84a95
  24. Jul 25, 2014
    • Davies Liu's avatar
      [SPARK-2538] [PySpark] Hash based disk spilling aggregation · 14174abd
      Davies Liu authored
      During aggregation in Python worker, if the memory usage is above spark.executor.memory, it will do disk spilling aggregation.
      
      It will split the aggregation into multiple stage, in each stage, it will partition the aggregated data by hash and dump them into disks. After all the data are aggregated, it will merge all the stages together (partition by partition).
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1460 from davies/spill and squashes the following commits:
      
      cad91bf [Davies Liu] call gc.collect() after data.clear() to release memory as much as possible.
      37d71f7 [Davies Liu] balance the partitions
      902f036 [Davies Liu] add shuffle.py into run-tests
      dcf03a9 [Davies Liu] fix memory_info() of psutil
      67e6eba [Davies Liu] comment for MAX_TOTAL_PARTITIONS
      f6bd5d6 [Davies Liu] rollback next_limit() again, the performance difference is huge:
      e74b785 [Davies Liu] fix code style and change next_limit to memory_limit
      400be01 [Davies Liu] address all the comments
      6178844 [Davies Liu] refactor and improve docs
      fdd0a49 [Davies Liu] add long doc string for ExternalMerger
      1a97ce4 [Davies Liu] limit used memory and size of objects in partitionBy()
      e6cc7f9 [Davies Liu] Merge branch 'master' into spill
      3652583 [Davies Liu] address comments
      e78a0a0 [Davies Liu] fix style
      24cec6a [Davies Liu] get local directory by SPARK_LOCAL_DIR
      57ee7ef [Davies Liu] update docs
      286aaff [Davies Liu] let spilled aggregation in Python configurable
      e9a40f6 [Davies Liu] recursive merger
      6edbd1f [Davies Liu] Hash based disk spilling aggregation
      14174abd
  25. Jul 22, 2014
    • Nicholas Chammas's avatar
      [SPARK-2470] PEP8 fixes to PySpark · 5d16d5bb
      Nicholas Chammas authored
      This pull request aims to resolve all outstanding PEP8 violations in PySpark.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1505 from nchammas/master and squashes the following commits:
      
      98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle
      cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses
      e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break
      9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses
      22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses
      24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest
      7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py
      8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py
      b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py
      d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py
      aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py
      1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py
      95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py
      a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib
      c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py
      d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py
      81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py
      1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py
      7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py
      ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py
      f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py
      a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py
      f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py
      4dd148f [nchammas] Merge pull request #5 from apache/master
      f7e4581 [Nicholas Chammas] unrelated pep8 fix
      a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
      de7292a [nchammas] Merge pull request #4 from apache/master
      2e4fe00 [nchammas] Merge pull request #3 from apache/master
      89fde08 [nchammas] Merge pull request #2 from apache/master
      69f6e22 [Nicholas Chammas] PEP8 fixes
      2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
      6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
      69da6cf [nchammas] Merge pull request #1 from apache/master
      5d16d5bb
  26. Apr 05, 2014
    • Matei Zaharia's avatar
      SPARK-1421. Make MLlib work on Python 2.6 · 0b855167
      Matei Zaharia authored
      The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #335 from mateiz/mllib-python-2.6 and squashes the following commits:
      
      f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
      a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
      0b855167
  27. Apr 04, 2014
    • Matei Zaharia's avatar
      SPARK-1414. Python API for SparkContext.wholeTextFiles · 60e18ce7
      Matei Zaharia authored
      Also clarified comment on each file having to fit in memory
      
      Author: Matei Zaharia <matei@databricks.com>
      
      Closes #327 from mateiz/py-whole-files and squashes the following commits:
      
      9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
      60e18ce7
  28. Mar 10, 2014
    • Prabin Banka's avatar
      SPARK-977 Added Python RDD.zip function · e1e09e0e
      Prabin Banka authored
      was raised earlier as a part of  apache/incubator-spark#486
      
      Author: Prabin Banka <prabin.banka@imaginea.com>
      
      Closes #76 from prabinb/python-api-zip and squashes the following commits:
      
      b1a31a0 [Prabin Banka] Added Python RDD.zip function
      e1e09e0e
  29. Jan 28, 2014
    • Josh Rosen's avatar
      Switch from MUTF8 to UTF8 in PySpark serializers. · 1381fc72
      Josh Rosen authored
      This fixes SPARK-1043, a bug introduced in 0.9.0
      where PySpark couldn't serialize strings > 64kB.
      
      This fix was written by @tyro89 and @bouk in #512.
      This commit squashes and rebases their pull request
      in order to fix some merge conflicts.
      1381fc72
  30. Dec 19, 2013
  31. Nov 26, 2013
  32. Nov 10, 2013
  33. Nov 03, 2013
  34. Oct 04, 2013
    • Andre Schumacher's avatar
      Fixing SPARK-602: PythonPartitioner · c84946fe
      Andre Schumacher authored
      Currently PythonPartitioner determines partition ID by hashing a
      byte-array representation of PySpark's key. This PR lets
      PythonPartitioner use the actual partition ID, which is required e.g.
      for sorting via PySpark.
      c84946fe
  35. Jul 16, 2013
  36. Jun 21, 2013
Loading