Skip to content
Snippets Groups Projects
  1. Nov 18, 2014
    • Davies Liu's avatar
      [SPARK-3721] [PySpark] broadcast objects larger than 2G · 4a377aff
      Davies Liu authored
      This patch will bring support for broadcasting objects larger than 2G.
      
      pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]].
      
      Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2659 from davies/huge and squashes the following commits:
      
      7b57a14 [Davies Liu] add more tests for broadcast
      28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      a2f6a02 [Davies Liu] bug fix
      4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      5875c73 [Davies Liu] address comments
      10a349b [Davies Liu] address comments
      0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      6182c8f [Davies Liu] Merge branch 'master' into huge
      d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      2514848 [Davies Liu] address comments
      fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      1c2d928 [Davies Liu] fix scala style
      091b107 [Davies Liu] broadcast objects larger than 2G
      4a377aff
  2. Nov 14, 2014
    • Xiangrui Meng's avatar
      [SPARK-4398][PySpark] specialize sc.parallelize(xrange) · abd58175
      Xiangrui Meng authored
      `sc.parallelize(range(1 << 20), 1).count()` may take 15 seconds to finish and the rdd object stores the entire list, making task size very large. This PR adds a specialized version for xrange.
      
      JoshRosen davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3264 from mengxr/SPARK-4398 and squashes the following commits:
      
      8953c41 [Xiangrui Meng] follow davies' suggestion
      cbd58e3 [Xiangrui Meng] specialize sc.parallelize(xrange)
      abd58175
  3. Nov 06, 2014
    • Davies Liu's avatar
      [SPARK-4186] add binaryFiles and binaryRecords in Python · b41a39e2
      Davies Liu authored
      add binaryFiles() and binaryRecords() in Python
      ```
      binaryFiles(self, path, minPartitions=None):
          :: Developer API ::
      
          Read a directory of binary files from HDFS, a local file system
          (available on all nodes), or any Hadoop-supported file system URI
          as a byte array. Each file is read as a single record and returned
          in a key-value pair, where the key is the path of each file, the
          value is the content of each file.
      
          Note: Small files are preferred, large file is also allowable, but
          may cause bad performance.
      
      binaryRecords(self, path, recordLength):
          Load data from a flat binary file, assuming each record is a set of numbers
          with the specified numerical format (see ByteBuffer), and the number of
          bytes per record is constant.
      
          :param path: Directory to the input data files
          :param recordLength: The length at which to split the records
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3078 from davies/binary and squashes the following commits:
      
      cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
      3aa349b [Davies Liu] add experimental notes
      24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
      5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
      1900085 [Davies Liu] bugfix
      bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python
      b41a39e2
  4. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
  5. Oct 24, 2014
    • Davies Liu's avatar
      [SPARK-2652] [PySpark] donot use KyroSerializer as default serializer · 809c785b
      Davies Liu authored
      KyroSerializer can not serialize customized class without registered explicitly, use it as default serializer in PySpark will introduce some regression in MLlib.
      
      cc mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2916 from davies/revert and squashes the following commits:
      
      43eb6d3 [Davies Liu] donot use KyroSerializer as default serializer
      809c785b
  6. Oct 16, 2014
    • Davies Liu's avatar
      [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode · 091d32c5
      Davies Liu authored
      Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.
      
      So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2830 from davies/fix_pickle and squashes the following commits:
      
      0c85fb9 [Davies Liu] revert the privacy change
      6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
      0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
      091d32c5
  7. Oct 12, 2014
    • giwa's avatar
      [SPARK-2377] Python API for Streaming · 69c67aba
      giwa authored
      This patch brings Python API for Streaming.
      
      This patch is based on work from @giwa
      
      Author: giwa <ugw.gi.world@gmail.com>
      Author: Ken Takagiwa <ken@Kens-MacBook-Pro.local>
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Ken Takagiwa <ken@kens-mbp.gateway.sonic.net>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Ken <ugw.gi.world@gmail.com>
      Author: Ken Takagiwa <ugw.gi.world@gmail.com>
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2538 from davies/streaming and squashes the following commits:
      
      64561e4 [Davies Liu] fix tests
      331ecce [Davies Liu] fix example
      3e2492b [Davies Liu] change updateStateByKey() to easy API
      182be73 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      02d0575 [Davies Liu] add wrapper for foreachRDD()
      bebeb4a [Davies Liu] address all comments
      6db00da [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      8380064 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      52c535b [Davies Liu] remove fix for sum()
      e108ec1 [Davies Liu]  address comments
      37fe06f [Davies Liu] use random port for callback server
      d05871e [Davies Liu] remove reuse of PythonRDD
      be5e5ff [Davies Liu] merge branch of env, make tests stable.
      8071541 [Davies Liu] Merge branch 'env' into streaming
      c7bbbce [Davies Liu] fix sphinx docs
      6bb9d91 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
      54bd92b [Davies Liu] improve tests
      c2b31cb [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
      7a88f9f [Davies Liu] rollback RDD.setContext(), use textFileStream() to test checkpointing
      bd8a4c2 [Davies Liu] fix scala style
      7797c70 [Davies Liu] refactor
      ff88bec [Davies Liu] rename RDDFunction to TransformFunction
      d328aca [Davies Liu] fix serializer in queueStream
      6f0da2f [Davies Liu] recover from checkpoint
      fa7261b [Davies Liu] refactor
      a13ff34 [Davies Liu] address comments
      8466916 [Davies Liu] support checkpoint
      9a16bd1 [Davies Liu] change number of partitions during tests
      b98d63f [Davies Liu] change private[spark] to private[python]
      eed6e2a [Davies Liu] rollback not needed changes
      e00136b [Davies Liu] address comments
      069a94c [Davies Liu] fix the number of partitions during window()
      338580a [Davies Liu] change _first(), _take(), _collect() as private API
      19797f9 [Davies Liu] clean up
      6ebceca [Davies Liu] add more tests
      c40c52d [Davies Liu] change first(), take(n) to has the same behavior as RDD
      98ac6c2 [Davies Liu] support ssc.transform()
      b983f0f [Davies Liu] address comments
      847f9b9 [Davies Liu] add more docs, add first(), take()
      e059ca2 [Davies Liu] move check of window into Python
      fce0ef5 [Davies Liu] rafactor of foreachRDD()
      7001b51 [Davies Liu] refactor of queueStream()
      26ea396 [Davies Liu] refactor
      74df565 [Davies Liu] fix print and docs
      b32774c [Davies Liu] move java_import into streaming
      604323f [Davies Liu] enable streaming tests
      c499ba0 [Davies Liu] remove Time and Duration
      3f0fb4b [Davies Liu] refactor fix tests
      c28f520 [Davies Liu] support updateStateByKey
      d357b70 [Davies Liu] support windowed dstream
      bd13026 [Davies Liu] fix examples
      eec401e [Davies Liu] refactor, combine TransformedRDD, fix reuse PythonRDD, fix union
      9a57685 [Davies Liu] fix python style
      bd27874 [Davies Liu] fix scala style
      7339be0 [Davies Liu] delete tests
      7f53086 [Davies Liu] support transform(), refactor and cleanup
      df098fc [Davies Liu] Merge branch 'master' into giwa
      550dfd9 [giwa] WIP fixing 1.1 merge
      5cdb6fa [giwa] changed for SCCallSiteSync
      e685853 [giwa] meged with rebased 1.1 branch
      2d32a74 [giwa] added some StreamingContextTestSuite
      4a59e1e [giwa] WIP:added more test for StreamingContext
      8ffdbf1 [giwa] added atexit to handle callback server
      d5f5fcb [giwa] added comment for StreamingContext.sparkContext
      63c881a [giwa] added StreamingContext.sparkContext
      d39f102 [giwa] added StreamingContext.remember
      d542743 [giwa] clean up code
      2fdf0de [Matthew Farrellee] Fix scalastyle errors
      c0a06bc [giwa] delete not implemented functions
      f385976 [giwa] delete inproper comments
      b0f2015 [giwa] added comment in dstream._test_output
      bebb3f3 [giwa] remove the last brank line
      fbed8da [giwa] revert pom.xml
      8ed93af [giwa] fixed explanaiton
      066ba90 [giwa] revert pom.xml
      fa4af88 [giwa] remove duplicated import
      6ae3caa [giwa] revert pom.xml
      7dc7391 [giwa] fixed typo
      62dc7a3 [giwa] clean up exmples
      f04882c [giwa] clen up examples
      b171ec3 [giwa] fixed pep8 violation
      f198d14 [giwa] clean up code
      3166d31 [giwa] clean up
      c00e091 [giwa] change test case not to use awaitTermination
      e80647e [giwa] adopted the latest compression way of python command
      58e41ff [giwa] merge with master
      455e5af [giwa] removed wasted print in DStream
      af336b7 [giwa] add comments
      ddd4ee1 [giwa] added TODO coments
      99ce042 [giwa] added saveAsTextFiles and saveAsPickledFiles
      2a06cdb [giwa] remove waste duplicated code
      c5ecfc1 [giwa] basic function test cases are passed
      8dcda84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      795b2cd [giwa] broke something
      1e126bf [giwa] WIP: solved partitioned and None is not recognized
      f67cf57 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      953deb0 [giwa] edited the comment to add more precise description
      af610d3 [giwa] removed unnesessary changes
      c1d546e [giwa] fixed PEP-008 violation
      99410be [giwa] delete waste file
      b3b0362 [giwa] added basic operation test cases
      9cde7c9 [giwa] WIP added test case
      bd3ba53 [giwa] WIP
      5c04a5f [giwa] WIP: added PythonTestInputStream
      019ef38 [giwa] WIP
      1934726 [giwa] update comment
      376e3ac [giwa] WIP
      932372a [giwa] clean up dstream.py
      0b09cff [giwa] added stop in StreamingContext
      92e333e [giwa] implemented reduce and count function in Dstream
      1b83354 [giwa] Removed the waste line
      88f7506 [Ken Takagiwa] Kill py4j callback server properly
      54b5358 [Ken Takagiwa] tried to restart callback server
      4f07163 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      fe02547 [Ken Takagiwa] remove waste file
      2ad7bd3 [Ken Takagiwa] clean up codes
      6197a11 [Ken Takagiwa] clean up code
      eb4bf48 [Ken Takagiwa] fix map function
      98c2a00 [Ken Takagiwa] added count operation but this implementation need double check
      58591d2 [Ken Takagiwa] reduceByKey is working
      0df7111 [Ken Takagiwa] delete old file
      f485b1d [Ken Takagiwa] fied input of socketTextDStream
      dd6de81 [Ken Takagiwa] initial commit for socketTextStream
      247fd74 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      4bcb318 [Ken Takagiwa] implementing transform function in Python
      38adf95 [Ken Takagiwa] added reducedByKey not working yet
      66fcfff [Ken Takagiwa] modify dstream.py to fix indent error
      41886c2 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      0b99bec [Ken] initial commit for pySparkStreaming
      c214199 [giwa] added testcase for combineByKey
      5625bdc [giwa] added gorupByKey testcase
      10ab87b [giwa] added sparkContext as input parameter in StreamingContext
      10b5b04 [giwa] removed wasted print in DStream
      e54f986 [giwa] add comments
      16aa64f [giwa] added TODO coments
      74535d4 [giwa] added saveAsTextFiles and saveAsPickledFiles
      f76c182 [giwa] remove waste duplicated code
      18c8723 [giwa] modified streaming test case to add coment
      13fb44c [giwa] basic function test cases are passed
      3000b2b [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      ff14070 [giwa] broke something
      bcdec33 [giwa] WIP: solved partitioned and None is not recognized
      270a9e1 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      bb10956 [giwa] edited the comment to add more precise description
      253a863 [giwa] removed unnesessary changes
      3d37822 [giwa] fixed PEP-008 violation
      f21cab3 [giwa] delete waste file
      878bad7 [giwa] added basic operation test cases
      ce2acd2 [giwa] WIP added test case
      9ad6855 [giwa] WIP
      1df77f5 [giwa] WIP: added PythonTestInputStream
      1523b66 [giwa] WIP
      8a0fbbc [giwa] update comment
      fe648e3 [giwa] WIP
      29c2bc5 [giwa] initial commit for testcase
      4d40d63 [giwa] clean up dstream.py
      c462bb3 [giwa] added stop in StreamingContext
      d2c01ba [giwa] clean up examples
      3c45cd2 [giwa] implemented reduce and count function in Dstream
      b349649 [giwa] Removed the waste line
      3b498e1 [Ken Takagiwa] Kill py4j callback server properly
      84a9668 [Ken Takagiwa] tried to restart callback server
      9ab8952 [Tathagata Das] Added extra line.
      05e991b [Tathagata Das] Added missing file
      b1d2a30 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      678e854 [Ken Takagiwa] remove waste file
      0a8bbbb [Ken Takagiwa] clean up codes
      bab31c1 [Ken Takagiwa] clean up code
      72b9738 [Ken Takagiwa] fix map function
      d3ee86a [Ken Takagiwa] added count operation but this implementation need double check
      15feea9 [Ken Takagiwa] edit python sparkstreaming example
      6f98e50 [Ken Takagiwa] reduceByKey is working
      c455c8d [Ken Takagiwa] added reducedByKey not working yet
      dc6995d [Ken Takagiwa] delete old file
      b31446a [Ken Takagiwa] fixed typo of network_workdcount.py
      ccfd214 [Ken Takagiwa] added doctest for pyspark.streaming.duration
      0d1b954 [Ken Takagiwa] fied input of socketTextDStream
      f746109 [Ken Takagiwa] initial commit for socketTextStream
      bb7ccf3 [Ken Takagiwa] remove unused import in python
      224fc5e [Ken Takagiwa] add empty line
      d2099d8 [Ken Takagiwa] sorted the import following Spark coding convention
      5bac7ec [Ken Takagiwa] revert streaming/pom.xml
      e1df940 [Ken Takagiwa] revert pom.xml
      494cae5 [Ken Takagiwa] remove not implemented DStream functions in python
      17a74c6 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      1a0f065 [Ken Takagiwa] implementing transform function in Python
      d7b4d6f [Ken Takagiwa] added reducedByKey not working yet
      87438e2 [Ken Takagiwa] modify dstream.py to fix indent error
      b406252 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      454981d [Ken] initial commit for pySparkStreaming
      150b94c [giwa] added some StreamingContextTestSuite
      f7bc8f9 [giwa] WIP:added more test for StreamingContext
      ee50c5a [giwa] added atexit to handle callback server
      fdc9125 [giwa] added comment for StreamingContext.sparkContext
      f5bfb70 [giwa] added StreamingContext.sparkContext
      da09768 [giwa] added StreamingContext.remember
      d68b568 [giwa] clean up code
      4afa390 [giwa] clean up code
      1fd6bc7 [Ken Takagiwa] Merge pull request #2 from mattf/giwa-master
      d9d59fe [Matthew Farrellee] Fix scalastyle errors
      67473a9 [giwa] delete not implemented functions
      c97377c [giwa] delete inproper comments
      2ea769e [giwa] added comment in dstream._test_output
      3b27bd4 [giwa] remove the last brank line
      acfcaeb [giwa] revert pom.xml
      93f7637 [giwa] fixed explanaiton
      50fd6f9 [giwa] revert pom.xml
      4f82c89 [giwa] remove duplicated import
      9d1de23 [giwa] revert pom.xml
      7339df2 [giwa] fixed typo
      9c85e48 [giwa] clean up exmples
      24f95db [giwa] clen up examples
      0d30109 [giwa] fixed pep8 violation
      b7dab85 [giwa] improve test case
      583e66d [giwa] move tests for streaming inside streaming directory
      1d84142 [giwa] remove unimplement test
      f0ea311 [giwa] clean up code
      171edeb [giwa] clean up
      4dedd2d [giwa] change test case not to use awaitTermination
      268a6a5 [giwa] Changed awaitTermination not to call awaitTermincation in Scala. Just use time.sleep instread
      09a28bf [giwa] improve testcases
      58150f5 [giwa] Changed the test case to focus the test operation
      199e37f [giwa] adopted the latest compression way of python command
      185fdbf [giwa] merge with master
      f1798c4 [giwa] merge with master
      e70f706 [giwa] added testcase for combineByKey
      e162822 [giwa] added gorupByKey testcase
      97742fe [giwa] added sparkContext as input parameter in StreamingContext
      14d4c0e [giwa] removed wasted print in DStream
      6d8190a [giwa] add comments
      4aa99e4 [giwa] added TODO coments
      e9fab72 [giwa] added saveAsTextFiles and saveAsPickledFiles
      94f2b65 [giwa] remove waste duplicated code
      580fbc2 [giwa] modified streaming test case to add coment
      99e4bb3 [giwa] basic function test cases are passed
      7051a84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      35933e1 [giwa] broke something
      9767712 [giwa] WIP: solved partitioned and None is not recognized
      4f2d7e6 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      33c0f94d [giwa] edited the comment to add more precise description
      774f18d [giwa] removed unnesessary changes
      3a671cc [giwa] remove export PYSPARK_PYTHON in spark submit
      8efa266 [giwa] fixed PEP-008 violation
      fa75d71 [giwa] delete waste file
      7f96294 [giwa] added basic operation test cases
      3dda31a [giwa] WIP added test case
      1f68b78 [giwa] WIP
      c05922c [giwa] WIP: added PythonTestInputStream
      1fd12ae [giwa] WIP
      c880a33 [giwa] update comment
      5d22c92 [giwa] WIP
      ea4b06b [giwa] initial commit for testcase
      5a9b525 [giwa] clean up dstream.py
      79c5809 [giwa] added stop in StreamingContext
      189dcea [giwa] clean up examples
      b8d7d24 [giwa] implemented reduce and count function in Dstream
      b6468e6 [giwa] Removed the waste line
      b47b5fd [Ken Takagiwa] Kill py4j callback server properly
      19ddcdd [Ken Takagiwa] tried to restart callback server
      c9fc124 [Tathagata Das] Added extra line.
      4caae3f [Tathagata Das] Added missing file
      4eff053 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      5e822d4 [Ken Takagiwa] remove waste file
      aeaf8a5 [Ken Takagiwa] clean up codes
      9fa249b [Ken Takagiwa] clean up code
      05459c6 [Ken Takagiwa] fix map function
      a9f4ecb [Ken Takagiwa] added count operation but this implementation need double check
      d1ee6ca [Ken Takagiwa] edit python sparkstreaming example
      0b8b7d0 [Ken Takagiwa] reduceByKey is working
      d25d5cf [Ken Takagiwa] added reducedByKey not working yet
      7f7c5d1 [Ken Takagiwa] delete old file
      967dc26 [Ken Takagiwa] fixed typo of network_workdcount.py
      57fb740 [Ken Takagiwa] added doctest for pyspark.streaming.duration
      4b69fb1 [Ken Takagiwa] fied input of socketTextDStream
      02f618a [Ken Takagiwa] initial commit for socketTextStream
      4ce4058 [Ken Takagiwa] remove unused import in python
      856d98e [Ken Takagiwa] add empty line
      490e338 [Ken Takagiwa] sorted the import following Spark coding convention
      5594bd4 [Ken Takagiwa] revert pom.xml
      2adca84 [Ken Takagiwa] remove not implemented DStream functions in python
      e551e13 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      3758175 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      c5518b4 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      dcf243f [Ken Takagiwa] implementing transform function in Python
      9af03f4 [Ken Takagiwa] added reducedByKey not working yet
      6e0d9c7 [Ken Takagiwa] modify dstream.py to fix indent error
      e497b9b [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      5c3a683 [Ken] initial commit for pySparkStreaming
      665bfdb [giwa] added testcase for combineByKey
      a3d2379 [giwa] added gorupByKey testcase
      636090a [giwa] added sparkContext as input parameter in StreamingContext
      e7ebb08 [giwa] removed wasted print in DStream
      d8b593b [giwa] add comments
      ea9c873 [giwa] added TODO coments
      89ae38a [giwa] added saveAsTextFiles and saveAsPickledFiles
      e3033fc [giwa] remove waste duplicated code
      a14c7e1 [giwa] modified streaming test case to add coment
      536def4 [giwa] basic function test cases are passed
      2112638 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
      080541a [giwa] broke something
      0704b86 [giwa] WIP: solved partitioned and None is not recognized
      90a6484 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
      a65f302 [giwa] edited the comment to add more precise description
      bdde697 [giwa] removed unnesessary changes
      e8c7bfc [giwa] remove export PYSPARK_PYTHON in spark submit
      3334169 [giwa] fixed PEP-008 violation
      db0a303 [giwa] delete waste file
      2cfd3a0 [giwa] added basic operation test cases
      90ae568 [giwa] WIP added test case
      a120d07 [giwa] WIP
      f671cdb [giwa] WIP: added PythonTestInputStream
      56fae45 [giwa] WIP
      e35e101 [giwa] Merge branch 'master' into testcase
      ba5112d [giwa] update comment
      28aa56d [giwa] WIP
      fb08559 [giwa] initial commit for testcase
      a613b85 [giwa] clean up dstream.py
      c40c0ef [giwa] added stop in StreamingContext
      31e4260 [giwa] clean up examples
      d2127d6 [giwa] implemented reduce and count function in Dstream
      48f7746 [giwa] Removed the waste line
      0f83eaa [Ken Takagiwa] delete py4j 0.8.1
      1679808 [Ken Takagiwa] Kill py4j callback server properly
      f96cd4e [Ken Takagiwa] tried to restart callback server
      fe86198 [Ken Takagiwa] add py4j 0.8.2.1 but server is not launched
      1064fe0 [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
      28c6620 [Ken Takagiwa] Implemented DStream.foreachRDD in the Python API using Py4J callback server
      85b0fe1 [Ken Takagiwa] Merge pull request #1 from tdas/python-foreach
      54e2e8c [Tathagata Das] Added extra line.
      e185338 [Tathagata Das] Added missing file
      a778d4b [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
      cc2092b [Ken Takagiwa] remove waste file
      d042ac6 [Ken Takagiwa] clean up codes
      84a021f [Ken Takagiwa] clean up code
      bd20e17 [Ken Takagiwa] fix map function
      d01a125 [Ken Takagiwa] added count operation but this implementation need double check
      7d05109 [Ken Takagiwa] merge with remote branch
      ae464e0 [Ken Takagiwa] edit python sparkstreaming example
      04af046 [Ken Takagiwa] reduceByKey is working
      3b6d7b0 [Ken Takagiwa] implementing transform function in Python
      571d52d [Ken Takagiwa] added reducedByKey not working yet
      5720979 [Ken Takagiwa] delete old file
      e604fcb [Ken Takagiwa] fixed typo of network_workdcount.py
      4b7c08b [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
      ce7d426 [Ken Takagiwa] added doctest for pyspark.streaming.duration
      a8c9fd5 [Ken Takagiwa] fixed for socketTextStream
      a61fa9e [Ken Takagiwa] fied input of socketTextDStream
      1e84f41 [Ken Takagiwa] initial commit for socketTextStream
      6d012f7 [Ken Takagiwa] remove unused import in python
      25d30d5 [Ken Takagiwa] add empty line
      6e0a64a [Ken Takagiwa] sorted the import following Spark coding convention
      fa4a7fc [Ken Takagiwa] revert streaming/pom.xml
      8f8202b [Ken Takagiwa] revert streaming pom.xml
      c9d79dd [Ken Takagiwa] revert pom.xml
      57e3e52 [Ken Takagiwa] remove not implemented DStream functions in python
      0a516f5 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      a7a0b5c [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
      72bfc66 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
      69e9cd3 [Ken Takagiwa] implementing transform function in Python
      94a0787 [Ken Takagiwa] added reducedByKey not working yet
      88068cf [Ken Takagiwa] modify dstream.py to fix indent error
      1367be5 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
      eb2b3ba [Ken] Merge remote-tracking branch 'upstream/master'
      d8e51f9 [Ken] initial commit for pySparkStreaming
      69c67aba
  8. Oct 10, 2014
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] use AutoBatchedSerializer by default · 72f36ee5
      Davies Liu authored
      Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into  [64k - 640k].
      
      In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2740 from davies/batchsize and squashes the following commits:
      
      52cdb88 [Davies Liu] update docs
      185f2b9 [Davies Liu] use AutoBatchedSerializer by default
      72f36ee5
  9. Oct 07, 2014
  10. Oct 06, 2014
    • cocoatomo's avatar
      [SPARK-3773][PySpark][Doc] Sphinx build warning · 2300eb58
      cocoatomo authored
      When building Sphinx documents for PySpark, we have 12 warnings.
      Their causes are almost docstrings in broken ReST format.
      
      To reproduce this issue, we should run following commands on the commit: 6e27cb63.
      
      ```bash
      $ cd ./python/docs
      $ make clean html
      ...
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected indentation.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent.
      /Users/<user>/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: missing attribute mentioned in :members: or __all__: module pyspark.mllib.regression, attribute RidgeRegressionModelLinearRegressionWithSGD
      /Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation.
      ...
      checking consistency... /Users/<user>/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document isn't included in any toctree
      ...
      copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
      ...
      build succeeded, 12 warnings.
      ```
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2653 from cocoatomo/issues/3773-sphinx-build-warnings and squashes the following commits:
      
      6f65661 [cocoatomo] [SPARK-3773][PySpark][Doc] Sphinx build warning
      2300eb58
  11. Sep 30, 2014
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · c5414b68
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      This is bugfix of #2351 cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2556 from davies/profiler and squashes the following commits:
      
      e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      858e74c [Davies Liu] compatitable with python 2.6
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      c5414b68
  12. Sep 26, 2014
    • Josh Rosen's avatar
      Revert "[SPARK-3478] [PySpark] Profile the Python tasks" · f872e4fb
      Josh Rosen authored
      This reverts commit 1aa549ba.
      f872e4fb
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · 1aa549ba
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2351 from davies/profiler and squashes the following commits:
      
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      1aa549ba
  13. Sep 24, 2014
    • Davies Liu's avatar
      [SPARK-3634] [PySpark] User's module should take precedence over system modules · c854b9fc
      Davies Liu authored
      Python modules added through addPyFile should take precedence over system modules.
      
      This patch put the path for user added module in the front of sys.path (just after '').
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2492 from davies/path and squashes the following commits:
      
      4a2af78 [Davies Liu] fix tests
      f7ff4da [Davies Liu] ad license header
      6b0002f [Davies Liu] add tests
      c16c392 [Davies Liu] put addPyFile in front of sys.path
      c854b9fc
  14. Sep 19, 2014
    • Davies Liu's avatar
      [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib · fce5e251
      Davies Liu authored
      Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.
      
      This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.
      
      All the modules are refactored to use this protocol.
      
      Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2378 from davies/pickle_mllib and squashes the following commits:
      
      dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
      810f97f [Davies Liu] fix equal of matrix
      032cd62 [Davies Liu] add more type check and conversion for user_product
      bd738ab [Davies Liu] address comments
      e431377 [Davies Liu] fix cache of rdd, refactor
      19d0967 [Davies Liu] refactor Picklers
      2511e76 [Davies Liu] cleanup
      1fccf1a [Davies Liu] address comments
      a2cc855 [Davies Liu] fix tests
      9ceff73 [Davies Liu] test size of serialized Rating
      44e0551 [Davies Liu] fix cache
      a379a81 [Davies Liu] fix pickle array in python2.7
      df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
      154d141 [Davies Liu] fix autobatchedpickler
      44736d7 [Davies Liu] speed up pickling array in Python 2.7
      e1d1bfc [Davies Liu] refactor
      708dc02 [Davies Liu] fix tests
      9dcfb63 [Davies Liu] fix style
      88034f0 [Davies Liu] rafactor, address comments
      46a501e [Davies Liu] choose batch size automatically
      df19464 [Davies Liu] memorize the module and class name during pickleing
      f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
      722dd96 [Davies Liu] cleanup _common.py
      0ee1525 [Davies Liu] remove outdated tests
      b02e34f [Davies Liu] remove _common.py
      84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
      4d7963e [Davies Liu] remove muanlly serialization
      6d26b03 [Davies Liu] fix tests
      c383544 [Davies Liu] classification
      f2a0856 [Davies Liu] mllib/regression
      d9f691f [Davies Liu] mllib/util
      cccb8b1 [Davies Liu] mllib/tree
      8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
      aa2287e [Davies Liu] random
      f1544c4 [Davies Liu] refactor clustering
      52d1350 [Davies Liu] use new protocol in mllib/stat
      b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
      f44f771 [Davies Liu] enable tests about array
      3908f5c [Davies Liu] Merge branch 'master' into pickle
      c77c87b [Davies Liu] cleanup debugging code
      60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
      fce5e251
  15. Sep 16, 2014
    • Davies Liu's avatar
      [SPARK-3430] [PySpark] [Doc] generate PySpark API docs using Sphinx · ec1adecb
      Davies Liu authored
      Using Sphinx to generate API docs for PySpark.
      
      requirement: Sphinx
      
      ```
      $ cd python/docs/
      $ make html
      ```
      
      The generated API docs will be located at python/docs/_build/html/index.html
      
      It can co-exists with those generated by Epydoc.
      
      This is the first working version, after merging in, then we can continue to improve it and replace the epydoc finally.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2292 from davies/sphinx and squashes the following commits:
      
      425a3b1 [Davies Liu] cleanup
      1573298 [Davies Liu] move docs to python/docs/
      5fe3903 [Davies Liu] Merge branch 'master' into sphinx
      9468ab0 [Davies Liu] fix makefile
      b408f38 [Davies Liu] address all comments
      e2ccb1b [Davies Liu] update name and version
      9081ead [Davies Liu] generate PySpark API docs using Sphinx
      ec1adecb
  16. Sep 15, 2014
    • Aaron Staple's avatar
      [SPARK-1087] Move python traceback utilities into new traceback_utils.py file. · 60050f42
      Aaron Staple authored
      Also made some cosmetic cleanups.
      
      Author: Aaron Staple <aaron.staple@gmail.com>
      
      Closes #2385 from staple/SPARK-1087 and squashes the following commits:
      
      7b3bb13 [Aaron Staple] Address review comments, cosmetic cleanups.
      10ba6e1 [Aaron Staple] [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.
      60050f42
    • Davies Liu's avatar
      [SPARK-2951] [PySpark] support unpickle array.array for Python 2.6 · da33acb8
      Davies Liu authored
      Pyrolite can not unpickle array.array which pickled by Python 2.6, this patch fix it by extend Pyrolite.
      
      There is a bug in Pyrolite when unpickle array of float/double, this patch workaround it by reverse the endianness for float/double. This workaround should be removed after Pyrolite have a new release to fix this issue.
      
      I had send an PR to Pyrolite to fix it:  https://github.com/irmen/Pyrolite/pull/11
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2365 from davies/pickle and squashes the following commits:
      
      f44f771 [Davies Liu] enable tests about array
      3908f5c [Davies Liu] Merge branch 'master' into pickle
      c77c87b [Davies Liu] cleanup debugging code
      60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
      da33acb8
  17. Sep 11, 2014
    • Davies Liu's avatar
      [SPARK-3047] [PySpark] add an option to use str in textFileRDD · 1ef656ea
      Davies Liu authored
      str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future).
      
      use_unicode=True:
      
      daviesliudm:~/work/spark$ time python wc.py
      (u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)
      
      real	2m8.298s
      user	0m0.185s
      sys	0m0.064s
      
      use_unicode=False
      
      daviesliudm:~/work/spark$ time python wc.py
      ('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)
      
      real	1m26.402s
      user	0m0.182s
      sys	0m0.062s
      
      We can see that it got 32% improvement!
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1951 from davies/unicode and squashes the following commits:
      
      8352d57 [Davies Liu] update version number
      a286f2f [Davies Liu] rollback loads()
      85246e5 [Davies Liu] add docs for use_unicode
      a0295e1 [Davies Liu] add an option to use str in textFile()
      1ef656ea
  18. Sep 09, 2014
    • Matthew Farrellee's avatar
      [SPARK-3458] enable python "with" statements for SparkContext · 25b5b867
      Matthew Farrellee authored
      allow for best practice code,
      
      ```
      try:
        sc = SparkContext()
        app(sc)
      finally:
        sc.stop()
      ```
      
      to be written using a "with" statement,
      
      ```
      with SparkContext() as sc:
        app(sc)
      ```
      
      Author: Matthew Farrellee <matt@redhat.com>
      
      Closes #2335 from mattf/SPARK-3458 and squashes the following commits:
      
      5b4e37c [Matthew Farrellee] [SPARK-3458] enable python "with" statements for SparkContext
      25b5b867
  19. Sep 03, 2014
    • Davies Liu's avatar
      [SPARK-3309] [PySpark] Put all public API in __all__ · 6481d274
      Davies Liu authored
      Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2205 from davies/public and squashes the following commits:
      
      c6c5567 [Davies Liu] fix message
      f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
      7e3016a [Davies Liu] add __all__ in mllib
      6281b48 [Davies Liu] fix doc for SchemaRDD
      6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
      6481d274
  20. Aug 30, 2014
    • Holden Karau's avatar
      SPARK-3318: Documentation update in addFile on how to use SparkFiles.get · ba78383b
      Holden Karau authored
      Rather than specifying the path to SparkFiles we need to use the filename.
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #2210 from holdenk/SPARK-3318-documentation-for-addfiles-should-say-to-use-file-not-path and squashes the following commits:
      
      a25d27a [Holden Karau] Update the JavaSparkContext addFile method to be clear about using fileName with SparkFiles as well
      0ebcb05 [Holden Karau] Documentation update in addFile on how to use SparkFiles.get to specify filename rather than path
      ba78383b
  21. Aug 29, 2014
  22. Aug 16, 2014
    • Davies Liu's avatar
      [SPARK-1065] [PySpark] improve supporting for large broadcast · 2fc8aca0
      Davies Liu authored
      Passing large object by py4j is very slow (cost much memory), so pass broadcast objects via files (similar to parallelize()).
      
      Add an option to keep object in driver (it's False by default) to save memory in driver.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1912 from davies/broadcast and squashes the following commits:
      
      e06df4a [Davies Liu] load broadcast from disk in driver automatically
      db3f232 [Davies Liu] fix serialization of accumulator
      631a827 [Davies Liu] Merge branch 'master' into broadcast
      c7baa8c [Davies Liu] compress serrialized broadcast and command
      9a7161f [Davies Liu] fix doc tests
      e93cf4b [Davies Liu] address comments: add test
      6226189 [Davies Liu] improve large broadcast
      2fc8aca0
    • iAmGhost's avatar
      [SPARK-3035] Wrong example with SparkContext.addFile · 379e7585
      iAmGhost authored
      https://issues.apache.org/jira/browse/SPARK-3035
      
      fix for wrong document.
      
      Author: iAmGhost <kdh7807@gmail.com>
      
      Closes #1942 from iAmGhost/master and squashes the following commits:
      
      487528a [iAmGhost] [SPARK-3035] Wrong example with SparkContext.addFile fix for wrong document.
      379e7585
  23. Aug 06, 2014
    • Nicholas Chammas's avatar
      [SPARK-2627] [PySpark] have the build enforce PEP 8 automatically · d614967b
      Nicholas Chammas authored
      As described in [SPARK-2627](https://issues.apache.org/jira/browse/SPARK-2627), we'd like Python code to automatically be checked for PEP 8 compliance by Jenkins. This pull request aims to do that.
      
      Notes:
      * We may need to install [`pep8`](https://pypi.python.org/pypi/pep8) on the build server.
      * I'm expecting tests to fail now that PEP 8 compliance is being checked as part of the build. I'm fine with cleaning up any remaining PEP 8 violations as part of this pull request.
      * I did not understand why the RAT and scalastyle reports are saved to text files. I did the same for the PEP 8 check, but only so that the console output style can match those for the RAT and scalastyle checks. The PEP 8 report is removed right after the check is complete.
      * Updates to the ["Contributing to Spark"](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark) guide will be submitted elsewhere, as I don't believe that text is part of the Spark repo.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1744 from nchammas/master and squashes the following commits:
      
      274b238 [Nicholas Chammas] [SPARK-2627] [PySpark] minor indentation changes
      983d963 [nchammas] Merge pull request #5 from apache/master
      1db5314 [nchammas] Merge pull request #4 from apache/master
      0e0245f [Nicholas Chammas] [SPARK-2627] undo erroneous whitespace fixes
      bf30942 [Nicholas Chammas] [SPARK-2627] PEP8: comment spacing
      6db9a44 [nchammas] Merge pull request #3 from apache/master
      7b4750e [Nicholas Chammas] merge upstream changes
      91b7584 [Nicholas Chammas] [SPARK-2627] undo unnecessary line breaks
      44e3e56 [Nicholas Chammas] [SPARK-2627] use tox.ini to exclude files
      b09fae2 [Nicholas Chammas] don't wrap comments unnecessarily
      bfb9f9f [Nicholas Chammas] [SPARK-2627] keep up with the PEP 8 fixes
      9da347f [nchammas] Merge pull request #2 from apache/master
      aa5b4b5 [Nicholas Chammas] [SPARK-2627] follow Spark bash style for if blocks
      d0a83b9 [Nicholas Chammas] [SPARK-2627] check that pep8 downloaded fine
      dffb5dd [Nicholas Chammas] [SPARK-2627] download pep8 at runtime
      a1ce7ae [Nicholas Chammas] [SPARK-2627] space out test report sections
      21da538 [Nicholas Chammas] [SPARK-2627] it's PEP 8, not PEP8
      6f4900b [Nicholas Chammas] [SPARK-2627] more misc PEP 8 fixes
      fe57ed0 [Nicholas Chammas] removing merge conflict backups
      9c01d4c [nchammas] Merge pull request #1 from apache/master
      9a66cb0 [Nicholas Chammas] resolving merge conflicts
      a31ccc4 [Nicholas Chammas] [SPARK-2627] miscellaneous PEP 8 fixes
      beaa9ac [Nicholas Chammas] [SPARK-2627] fail check on non-zero status
      723ed39 [Nicholas Chammas] always delete the report file
      0541ebb [Nicholas Chammas] [SPARK-2627] call Python linter from run-tests
      12440fa [Nicholas Chammas] [SPARK-2627] add Scala linter
      61c07b9 [Nicholas Chammas] [SPARK-2627] add Python linter
      75ad552 [Nicholas Chammas] make check output style consistent
      d614967b
  24. Aug 02, 2014
    • Andrew Or's avatar
      [SPARK-2454] Do not ship spark home to Workers · 148af608
      Andrew Or authored
      When standalone Workers launch executors, they inherit the Spark home set by the driver. This means if the worker machines do not share the same directory structure as the driver node, the Workers will attempt to run scripts (e.g. bin/compute-classpath.sh) that do not exist locally and fail. This is a common scenario if the driver is launched from outside of the cluster.
      
      The solution is to simply not pass the driver's Spark home to the Workers. This PR further makes an attempt to avoid overloading the usages of `spark.home`, which is now only used for setting executor Spark home on Mesos and in python.
      
      This is based on top of #1392 and originally reported by YanTangZhai. Tested on standalone cluster.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #1734 from andrewor14/spark-home-reprise and squashes the following commits:
      
      f71f391 [Andrew Or] Revert changes in python
      1c2532c [Andrew Or] Merge branch 'master' of github.com:apache/spark into spark-home-reprise
      188fc5d [Andrew Or] Avoid using spark.home where possible
      09272b7 [Andrew Or] Always use Worker's working directory as spark home
      148af608
  25. Jul 30, 2014
    • Kan Zhang's avatar
      [SPARK-2024] Add saveAsSequenceFile to PySpark · 94d1f46f
      Kan Zhang authored
      JIRA issue: https://issues.apache.org/jira/browse/SPARK-2024
      
      This PR is a followup to #455 and adds capabilities for saving PySpark RDDs using SequenceFile or any Hadoop OutputFormats.
      
      * Added RDD methods ```saveAsSequenceFile```, ```saveAsHadoopFile``` and ```saveAsHadoopDataset```, for both old and new MapReduce APIs.
      
      * Default converter for converting common data types to Writables. Users may specify custom converters to convert to desired data types.
      
      * No out-of-box support for reading/writing arrays, since ArrayWritable itself doesn't have a no-arg constructor for creating an empty instance upon reading. Users need to provide ArrayWritable subtypes. Custom converters for converting arrays to suitable ArrayWritable subtypes are also needed when writing. When reading, the default converter will convert any custom ArrayWritable subtypes to ```Object[]``` and they get pickled to Python tuples.
      
      * Added HBase and Cassandra output examples to show how custom output formats and converters can be used.
      
      cc MLnick mateiz ahirreddy pwendell
      
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #1338 from kanzhang/SPARK-2024 and squashes the following commits:
      
      c01e3ef [Kan Zhang] [SPARK-2024] code formatting
      6591e37 [Kan Zhang] [SPARK-2024] renaming pickled -> pickledRDD
      d998ad6 [Kan Zhang] [SPARK-2024] refectoring to get method params below 10
      57a7a5e [Kan Zhang] [SPARK-2024] correcting typo
      75ca5bd [Kan Zhang] [SPARK-2024] Better type checking for batch serialized RDD
      0bdec55 [Kan Zhang] [SPARK-2024] Refactoring newly added tests
      9f39ff4 [Kan Zhang] [SPARK-2024] Adding 2 saveAsHadoopDataset tests
      0c134f3 [Kan Zhang] [SPARK-2024] Test refactoring and adding couple unbatched cases
      7a176df [Kan Zhang] [SPARK-2024] Add saveAsSequenceFile to PySpark
      94d1f46f
  26. Jul 28, 2014
    • Josh Rosen's avatar
      [SPARK-1550] [PySpark] Allow SparkContext creation after failed attempts · a7d145e9
      Josh Rosen authored
      This addresses a PySpark issue where a failed attempt to construct SparkContext would prevent any future SparkContext creation.
      
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1606 from JoshRosen/SPARK-1550 and squashes the following commits:
      
      ec7fadc [Josh Rosen] [SPARK-1550] [PySpark] Allow SparkContext creation after failed attempts
      a7d145e9
  27. Jul 26, 2014
    • Davies Liu's avatar
      [SPARK-2652] [PySpark] Turning some default configs for PySpark · 75663b57
      Davies Liu authored
      Add several default configs for PySpark, related to serialization in JVM.
      
      spark.serializer = org.apache.spark.serializer.KryoSerializer
      spark.serializer.objectStreamReset = 100
      spark.rdd.compress = True
      
      This will help to reduce the memory usage during RDD.partitionBy()
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1568 from davies/conf and squashes the following commits:
      
      cd316f1 [Davies Liu] remove duplicated line
      f71a355 [Davies Liu] rebase to master, add spark.rdd.compress = True
      8f63f45 [Davies Liu] Merge branch 'master' into conf
      8bc9f08 [Davies Liu] fix unittest
      c04a83d [Davies Liu] some default configs for PySpark
      75663b57
    • Josh Rosen's avatar
      [SPARK-1458] [PySpark] Expose sc.version in Java and PySpark · cf3e9fd8
      Josh Rosen authored
      Author: Josh Rosen <joshrosen@apache.org>
      
      Closes #1596 from JoshRosen/spark-1458 and squashes the following commits:
      
      fdbb0bf [Josh Rosen] Add SparkContext.version to Python & Java [SPARK-1458]
      cf3e9fd8
  28. Jul 24, 2014
  29. Jul 22, 2014
    • Nicholas Chammas's avatar
      [SPARK-2470] PEP8 fixes to PySpark · 5d16d5bb
      Nicholas Chammas authored
      This pull request aims to resolve all outstanding PEP8 violations in PySpark.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      Author: nchammas <nicholas.chammas@gmail.com>
      
      Closes #1505 from nchammas/master and squashes the following commits:
      
      98171af [Nicholas Chammas] [SPARK-2470] revert PEP 8 fixes to cloudpickle
      cba7768 [Nicholas Chammas] [SPARK-2470] wrap expression list in parentheses
      e178dbe [Nicholas Chammas] [SPARK-2470] style - change position of line break
      9127d2b [Nicholas Chammas] [SPARK-2470] wrap expression lists in parentheses
      22132a4 [Nicholas Chammas] [SPARK-2470] wrap conditionals in parentheses
      24639bc [Nicholas Chammas] [SPARK-2470] fix whitespace for doctest
      7d557b7 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to tests.py
      8f8e4c0 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to storagelevel.py
      b3b96cf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to statcounter.py
      d644477 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to worker.py
      aa3a7b6 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to sql.py
      1916859 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to shell.py
      95d1d95 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to serializers.py
      a0fec2e [Nicholas Chammas] [SPARK-2470] PEP8 fixes to mllib
      c85e1e5 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to join.py
      d14f2f1 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to __init__.py
      81fcb20 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to resultiterable.py
      1bde265 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to java_gateway.py
      7fc849c [Nicholas Chammas] [SPARK-2470] PEP8 fixes to daemon.py
      ca2d28b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to context.py
      f4e0039 [Nicholas Chammas] [SPARK-2470] PEP8 fixes to conf.py
      a6d5e4b [Nicholas Chammas] [SPARK-2470] PEP8 fixes to cloudpickle.py
      f0a7ebf [Nicholas Chammas] [SPARK-2470] PEP8 fixes to rddsampler.py
      4dd148f [nchammas] Merge pull request #5 from apache/master
      f7e4581 [Nicholas Chammas] unrelated pep8 fix
      a36eed0 [Nicholas Chammas] name ec2 instances and security groups consistently
      de7292a [nchammas] Merge pull request #4 from apache/master
      2e4fe00 [nchammas] Merge pull request #3 from apache/master
      89fde08 [nchammas] Merge pull request #2 from apache/master
      69f6e22 [Nicholas Chammas] PEP8 fixes
      2627247 [Nicholas Chammas] broke up lines before they hit 100 chars
      6544b7e [Nicholas Chammas] [SPARK-2065] give launched instances names
      69da6cf [nchammas] Merge pull request #1 from apache/master
      5d16d5bb
  30. Jun 20, 2014
    • Anant's avatar
      [SPARK-2061] Made splits deprecated in JavaRDDLike · 010c460d
      Anant authored
      The jira for the issue can be found at: https://issues.apache.org/jira/browse/SPARK-2061
      Most of spark has used over to consistently using `partitions` instead of `splits`. We should do likewise and add a `partitions` method to JavaRDDLike and have `splits` just call that. We should also go through all cases where other API's (e.g. Python) call `splits` and we should change those to use the newer API.
      
      Author: Anant <anant.asty@gmail.com>
      
      Closes #1062 from anantasty/SPARK-2061 and squashes the following commits:
      
      b83ce6b [Anant] Fixed syntax issue
      21f9210 [Anant] Fixed version number in deprecation string
      9315b76 [Anant] made related changes to use partitions in python api
      8c62dd1 [Anant] Made splits deprecated in JavaRDDLike
      010c460d
  31. Jun 10, 2014
    • Nick Pentreath's avatar
      SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats · f971d6cb
      Nick Pentreath authored
      So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.
      
      This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark.
      
      # Overview
      The basics are as follows:
      1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark
      2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives)
      3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString```
      4. ```PickleSerializer``` on the Python side deserializes.
      
      This works "out the box" for simple ```Writable```s:
      * ```Text```
      * ```IntWritable```, ```DoubleWritable```, ```FloatWritable```
      * ```NullWritable```
      * ```BooleanWritable```
      * ```BytesWritable```
      * ```MapWritable```
      
      It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).
      
      I've tested it out with ```ESInputFormat```  as an example and it works very nicely:
      ```python
      conf = {"es.resource" : "index/type" }
      rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
      rdd.first()
      ```
      
      I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.
      
      # Some things still outstanding:
      1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~
      2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~
      3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~
      4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR
      
      Author: Nick Pentreath <nick.pentreath@gmail.com>
      
      Closes #455 from MLnick/pyspark-inputformats and squashes the following commits:
      
      268df7e [Nick Pentreath] Documentation changes mer @pwendell comments
      761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry.
      4c972d8 [Nick Pentreath] Add license headers
      d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      cde6af9 [Nick Pentreath] Parameterize converter trait
      5ebacfa [Nick Pentreath] Update docs for PySpark input formats
      a985492 [Nick Pentreath] Move Converter examples to own package
      365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface.
      eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests
      1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight
      3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python
      b65606f [Nick Pentreath] Add converter interface
      5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None
      085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs
      43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide
      94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods
      1a4a1d6 [Nick Pentreath] Address @mateiz style comments
      01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase
      9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      84fe8e3 [Nick Pentreath] Python programming guide space formatting
      d0f52b6 [Nick Pentreath] Python programming guide
      7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      93ef995 [Nick Pentreath] Add back context.py changes
      9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py
      077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py
      5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      35b8e3a [Nick Pentreath] Another fix for test ordering
      bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e001b94 [Nick Pentreath] Fix test failures due to ordering
      78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide
      64eb051 [Nick Pentreath] Scalastyle fix
      e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring
      c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests
      1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir
      17a656b [Nick Pentreath] remove binary sequencefile for tests
      f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark
      450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      31a2fff [Nick Pentreath] Scalastyle fixes
      fc5099e [Nick Pentreath] Add Apache license headers
      4e08983 [Nick Pentreath] Clean up docs for PySpark context methods
      b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies
      951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      f6aac55 [Nick Pentreath] Bring back msgpack
      9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge
      a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering
      7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging
      25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps
      65360d5 [Nick Pentreath] Adding test SequenceFiles
      0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      d72bf18 [Nick Pentreath] msgpack
      dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      e67212a [Nick Pentreath] Add back msgpack dependency
      f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
      97ef708 [Nick Pentreath] Remove old writeToStream
      2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data.
      174f520 [Nick Pentreath] Add back graphx settings
      703ee65 [Nick Pentreath] Add back msgpack
      619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      eb40036 [Nick Pentreath] Remove unused comment lines
      4d7ef2e [Nick Pentreath] Fix indentation
      f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments
      0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer
      4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names
      818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD
      4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
      c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up
      4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code
      d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
      f971d6cb
  32. Jun 03, 2014
    • Kan Zhang's avatar
      [SPARK-1161] Add saveAsPickleFile and SparkContext.pickleFile in Python · 21e40ed8
      Kan Zhang authored
      Author: Kan Zhang <kzhang@apache.org>
      
      Closes #755 from kanzhang/SPARK-1161 and squashes the following commits:
      
      24ed8a2 [Kan Zhang] [SPARK-1161] Fixing doc tests
      44e0615 [Kan Zhang] [SPARK-1161] Adding an optional batchSize with default value 10
      d929429 [Kan Zhang] [SPARK-1161] Add saveAsObjectFile and SparkContext.objectFile in Python
      21e40ed8
  33. May 31, 2014
    • Aaron Davidson's avatar
      SPARK-1839: PySpark RDD#take() shouldn't always read from driver · 9909efc1
      Aaron Davidson authored
      This patch simply ports over the Scala implementation of RDD#take(), which reads the first partition at the driver, then decides how many more partitions it needs to read and will possibly start a real job if it's more than 1. (Note that SparkContext#runJob(allowLocal=true) only runs the job locally if there's 1 partition selected and no parent stages.)
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #922 from aarondav/take and squashes the following commits:
      
      fa06df9 [Aaron Davidson] SPARK-1839: PySpark RDD#take() shouldn't always read from driver
      9909efc1
  34. May 29, 2014
    • Jyotiska NK's avatar
      Added doctest and method description in context.py · 9cff1dd2
      Jyotiska NK authored
      Added doctest for method textFile and description for methods _initialize_context and _ensure_initialized in context.py
      
      Author: Jyotiska NK <jyotiska123@gmail.com>
      
      Closes #187 from jyotiska/pyspark_context and squashes the following commits:
      
      356f945 [Jyotiska NK] Added doctest for textFile method in context.py
      5b23686 [Jyotiska NK] Updated context.py with method descriptions
      9cff1dd2
  35. May 24, 2014
    • Andrew Or's avatar
      [SPARK-1900 / 1918] PySpark on YARN is broken · 5081a0a9
      Andrew Or authored
      If I run the following on a YARN cluster
      ```
      bin/spark-submit sheep.py --master yarn-client
      ```
      it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
      ```
      bin/spark-submit file:/path/to/sheep.py --master yarn-client
      ```
      However, this also fails. This time it is because python does not understand URI schemes.
      
      This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.
      
      Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.
      
      Author: Andrew Or <andrewor14@gmail.com>
      
      Closes #853 from andrewor14/submit-paths and squashes the following commits:
      
      0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
      323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
      3c36587 [Andrew Or] Improve error messages (minor)
      854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
      6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
      3bb0359 [Andrew Or] Update more comments (minor)
      2a1f8a0 [Andrew Or] Update comments (minor)
      6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
      a68c4d1 [Andrew Or] Handle Windows python file path correctly
      427a250 [Andrew Or] Resolve paths properly for Windows
      a591a4a [Andrew Or] Update tests for resolving URIs
      6c8621c [Andrew Or] Move resolveURIs to Utils
      db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
      f542dce [Andrew Or] Fix outdated tests
      691c4ce [Andrew Or] Ignore special primary resource names
      5342ac7 [Andrew Or] Add missing space in error message
      02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly
      5081a0a9
  36. May 21, 2014
Loading