Skip to content
Snippets Groups Projects
  1. Apr 21, 2015
    • Reynold Xin's avatar
      [SPARK-6953] [PySpark] speed up python tests · 3134c3fe
      Reynold Xin authored
      This PR try to speed up some python tests:
      
      ```
      tests.py                       144s -> 103s      -41s
      mllib/classification.py         24s -> 17s        -7s
      mllib/regression.py             27s -> 15s       -12s
      mllib/tree.py                   27s -> 13s       -14s
      mllib/tests.py                  64s -> 31s       -33s
      streaming/tests.py             185s -> 84s      -101s
      ```
      Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice).
      
      During testing, it will show used time for each test file:
      ```
      Run core tests ...
      Running test: pyspark/rdd.py ... ok (22s)
      Running test: pyspark/context.py ... ok (16s)
      Running test: pyspark/conf.py ... ok (4s)
      Running test: pyspark/broadcast.py ... ok (4s)
      Running test: pyspark/accumulators.py ... ok (4s)
      Running test: pyspark/serializers.py ... ok (6s)
      Running test: pyspark/profiler.py ... ok (5s)
      Running test: pyspark/shuffle.py ... ok (1s)
      Running test: pyspark/tests.py ... ok (103s)   144s
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5605 from rxin/python-tests-speed and squashes the following commits:
      
      d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953
      89321ee [Xiangrui Meng] fix seed in tests
      3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
      3134c3fe
  2. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  3. Apr 15, 2015
    • Davies Liu's avatar
      [SPARK-6886] [PySpark] fix big closure with shuffle · f11288d5
      Davies Liu authored
      Currently, the created broadcast object will have same life cycle as RDD in Python. For multistage jobs, an PythonRDD will be created in JVM and the RDD in Python may be GCed, then the broadcast will be destroyed in JVM before the PythonRDD.
      
      This PR change to use PythonRDD to track the lifecycle of the broadcast object. It also have a refactor about getNumPartitions() to avoid unnecessary creation of PythonRDD, which could be heavy.
      
      cc JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5496 from davies/big_closure and squashes the following commits:
      
      9a0ea4c [Davies Liu] fix big closure with shuffle
      f11288d5
  4. Apr 10, 2015
    • Davies Liu's avatar
      [SPARK-6216] [PySpark] check the python version in worker · 4740d6a1
      Davies Liu authored
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5404 from davies/check_version and squashes the following commits:
      
      e559248 [Davies Liu] add tests
      ec33b5f [Davies Liu] check the python version in worker
      4740d6a1
    • Milan Straka's avatar
      [SPARK-5969][PySpark] Fix descending pyspark.rdd.sortByKey. · 0375134f
      Milan Straka authored
      The samples should always be sorted in ascending order, because bisect.bisect_left is used on it. The reverse order of the result is already achieved in rangePartitioner by reversing the found index.
      
      The current implementation also work, but always uses only two partitions -- the first one and the last one (because the bisect_left return returns either "beginning" or "end" for a descending sequence).
      
      Author: Milan Straka <fox@ucw.cz>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4761 from foxik/fix-descending-sort and squashes the following commits:
      
      95896b5 [Milan Straka] Add regression test for SPARK-5969.
      5757490 [Milan Straka] Fix descending pyspark.rdd.sortByKey.
      0375134f
  5. Apr 09, 2015
    • Davies Liu's avatar
      [SPARK-3074] [PySpark] support groupByKey() with single huge key · b5c51c8d
      Davies Liu authored
      This patch change groupByKey() to use external sort based approach, so it can support single huge key.
      
      For example, it can group by a dataset including one hot key with 40 millions values (strings), using 500M memory for Python worker, finished in about 2 minutes. (it will need 6G memory in hash based approach).
      
      During groupByKey(), it will do in-memory groupBy first. If the dataset can not fit in memory, then data will be partitioned by hash. If one partition still can not fit in memory, it will switch to sort based groupBy().
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #1977 from davies/groupby and squashes the following commits:
      
      af3713a [Davies Liu] make sure it's iterator
      67772dd [Davies Liu] fix tests
      e78c15c [Davies Liu] address comments
      0b0fde8 [Davies Liu] address comments
      0dcf320 [Davies Liu] address comments, rollback changes in ResultIterable
      e3b8eab [Davies Liu] fix narrow dependency
      2a1857a [Davies Liu] typo
      d2f053b [Davies Liu] add repr for FlattedValuesSerializer
      c6a2f8d [Davies Liu] address comments
      9e2df24 [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      2b9c261 [Davies Liu] fix typo in comments
      70aadcd [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      a14b4bd [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      ab5515b [Davies Liu] Merge branch 'master' into groupby
      651f891 [Davies Liu] simplify GroupByKey
      1578f2e [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      1f69f93 [Davies Liu] fix tests
      0d3395f [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      341f1e0 [Davies Liu] add comments, refactor
      47918b8 [Davies Liu] remove unused code
      6540948 [Davies Liu] address comments:
      17f4ec6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into groupby
      4d4bc86 [Davies Liu] bugfix
      8ef965e [Davies Liu] Merge branch 'master' into groupby
      fbc504a [Davies Liu] Merge branch 'master' into groupby
      779ed03 [Davies Liu] fix merge conflict
      2c1d05b [Davies Liu] refactor, minor turning
      b48cda5 [Davies Liu] Merge branch 'master' into groupby
      85138e6 [Davies Liu] Merge branch 'master' into groupby
      acd8e1b [Davies Liu] fix memory when groupByKey().count()
      905b233 [Davies Liu] Merge branch 'sort' into groupby
      1f075ed [Davies Liu] Merge branch 'master' into sort
      4b07d39 [Davies Liu] compress the data while spilling
      0a081c6 [Davies Liu] Merge branch 'master' into groupby
      f157fe7 [Davies Liu] Merge branch 'sort' into groupby
      eb53ca6 [Davies Liu] Merge branch 'master' into sort
      b2dc3bf [Davies Liu] Merge branch 'sort' into groupby
      644abaf [Davies Liu] add license in LICENSE
      19f7873 [Davies Liu] improve tests
      11ba318 [Davies Liu] typo
      085aef8 [Davies Liu] Merge branch 'master' into groupby
      3ee58e5 [Davies Liu] switch to sort based groupBy, based on size of data
      1ea0669 [Davies Liu] choose sort based groupByKey() automatically
      b40bae7 [Davies Liu] bugfix
      efa23df [Davies Liu] refactor, add spark.shuffle.sort=False
      250be4e [Davies Liu] flatten the combined values when dumping into disks
      d05060d [Davies Liu] group the same key before shuffle, reduce the comparison during sorting
      083d842 [Davies Liu] sorted based groupByKey()
      55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
      b5c51c8d
  6. Mar 12, 2015
    • Davies Liu's avatar
      [SPARK-6294] fix hang when call take() in JVM on PythonRDD · 712679a7
      Davies Liu authored
      The Thread.interrupt() can not terminate the thread in some cases, so we should not wait for the writerThread of PythonRDD.
      
      This PR also ignore some exception during clean up.
      
      cc JoshRosen mengxr
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4987 from davies/fix_take and squashes the following commits:
      
      4488f1a [Davies Liu] fix hang when call take() in JVM on PythonRDD
      712679a7
  7. Feb 24, 2015
  8. Feb 17, 2015
    • Burak Yavuz's avatar
      [SPARK-5811] Added documentation for maven coordinates and added Spark Packages support · ae6cfb3a
      Burak Yavuz authored
      Documentation for maven coordinates + Spark Package support. Added pyspark tests for `--packages`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4662 from brkyvz/SPARK-5811 and squashes the following commits:
      
      56ccccd [Burak Yavuz] fixed broken test
      64cb8ee [Burak Yavuz] passed pep8 on local
      c07b81e [Burak Yavuz] fixed pep8
      a8bd6b7 [Burak Yavuz] submit PR
      4ef4046 [Burak Yavuz] ready for PR
      8fb02e5 [Burak Yavuz] merged master
      25c9b9f [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into python-jar
      560d13b [Burak Yavuz] before PR
      17d3f76 [Davies Liu] support .jar as python package
      a3eb717 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into SPARK-5811
      c60156d [Burak Yavuz] [SPARK-5811] Added documentation for maven coordinates
      ae6cfb3a
    • Davies Liu's avatar
      [SPARK-5785] [PySpark] narrow dependency for cogroup/join in PySpark · c3d2b90b
      Davies Liu authored
      Currently, PySpark does not support narrow dependency during cogroup/join when the two RDDs have the partitioner, another unnecessary shuffle stage will come in.
      
      The Python implementation of cogroup/join is different than Scala one, it depends on union() and partitionBy(). This patch will try to use PartitionerAwareUnionRDD() in union(), when all the RDDs have the same partitioner. It also fix `reservePartitioner` in all the map() or mapPartitions(), then partitionBy() can skip the unnecessary shuffle stage.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4629 from davies/narrow and squashes the following commits:
      
      dffe34e [Davies Liu] improve test, check number of stages for join/cogroup
      1ed3ba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into narrow
      4d29932 [Davies Liu] address comment
      cc28d97 [Davies Liu] add unit tests
      940245e [Davies Liu] address comments
      ff5a0a6 [Davies Liu] skip the partitionBy() on Python side
      eb26c62 [Davies Liu] narrow dependency in PySpark
      c3d2b90b
    • Davies Liu's avatar
      [SPARK-4172] [PySpark] Progress API in Python · 445a755b
      Davies Liu authored
      This patch bring the pull based progress API into Python, also a example in Python.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3027 from davies/progress_api and squashes the following commits:
      
      b1ba984 [Davies Liu] fix style
      d3b9253 [Davies Liu] add tests, mute the exception after stop
      4297327 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
      969fa9d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
      25590c9 [Davies Liu] update with Java API
      360de2d [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
      c0f1021 [Davies Liu] Merge branch 'master' of github.com:apache/spark into progress_api
      023afb3 [Davies Liu] add Python API and example for progress API
      445a755b
  9. Feb 03, 2015
    • Davies Liu's avatar
      [SPARK-5554] [SQL] [PySpark] add more tests for DataFrame Python API · 068c0e2e
      Davies Liu authored
      Add more tests and docs for DataFrame Python API, improve test coverage, fix bugs.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4331 from davies/fix_df and squashes the following commits:
      
      dd9919f [Davies Liu] fix tests
      467332c [Davies Liu] support string in cast()
      83c92fe [Davies Liu] address comments
      c052f6f [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df
      8dd19a9 [Davies Liu] fix tests in python 2.6
      35ccb9f [Davies Liu] fix build
      78ebcfa [Davies Liu] add sql_test.py in run_tests
      9ab78b4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_df
      6040ba7 [Davies Liu] fix docs
      3ab2661 [Davies Liu] add more tests for DataFrame
      068c0e2e
  10. Feb 02, 2015
    • Davies Liu's avatar
      [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python · 0561c454
      Davies Liu authored
      This PR brings the Python API for Spark Streaming Kafka data source.
      
      ```
          class KafkaUtils(__builtin__.object)
           |  Static methods defined here:
           |
           |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
      2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
           |      Create an input stream that pulls messages from a Kafka Broker.
           |
           |      :param ssc:  StreamingContext object
           |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
           |      :param groupId:  The group id for this consumer.
           |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
           |                      Each partition is consumed in its own thread.
           |      :param storageLevel:  RDD storage level.
           |      :param keyDecoder:  A function used to decode key
           |      :param valueDecoder:  A function used to decode value
           |      :return: A DStream object
      ```
      run the example:
      
      ```
      bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
      ```
      
      Author: Davies Liu <davies@databricks.com>
      Author: Tathagata Das <tdas@databricks.com>
      
      Closes #3715 from davies/kafka and squashes the following commits:
      
      d93bfe0 [Davies Liu] Update make-distribution.sh
      4280d04 [Davies Liu] address comments
      e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      f257071 [Davies Liu] add tests for null in RDD
      23b039a [Davies Liu] address comments
      9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
      a74da87 [Davies Liu] address comments
      dc1eed0 [Davies Liu] Update kafka_wordcount.py
      31e2317 [Davies Liu] Update kafka_wordcount.py
      370ba61 [Davies Liu] Update kafka.py
      97386b3 [Davies Liu] address comment
      2c567a5 [Davies Liu] update logging and comment
      33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
      adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
      aea8953 [Tathagata Das] Kafka-assembly for Python API
      eea16a7 [Davies Liu] refactor
      f6ce899 [Davies Liu] add example and fix bugs
      98c8d17 [Davies Liu] fix python style
      5697a01 [Davies Liu] bypass decoder in scala
      048dbe6 [Davies Liu] fix python style
      75d485e [Davies Liu] add mqtt
      07923c4 [Davies Liu] support kafka in Python
      0561c454
    • Reynold Xin's avatar
      [SQL] Improve DataFrame API error reporting · 554403fd
      Reynold Xin authored
      1. Throw UnsupportedOperationException if a Column is not computable.
      2. Perform eager analysis on DataFrame so we can catch errors when they happen (not when an action is run).
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4296 from rxin/col-computability and squashes the following commits:
      
      6527b86 [Reynold Xin] Merge pull request #8 from davies/col-computability
      fd92bc7 [Reynold Xin] Merge branch 'master' into col-computability
      f79034c [Davies Liu] fix python tests
      5afe1ff [Reynold Xin] Fix scala test.
      17f6bae [Reynold Xin] Various fixes.
      b932e86 [Reynold Xin] Added eager analysis for error reporting.
      e6f00b8 [Reynold Xin] [SQL][API] ComputableColumn vs IncomputableColumn
      554403fd
  11. Jan 29, 2015
    • Josh Rosen's avatar
      [SPARK-5464] Fix help() for Python DataFrame instances · 0bb15f22
      Josh Rosen authored
      This fixes an exception that prevented users from calling `help()` on Python DataFrame instances.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #4278 from JoshRosen/SPARK-5464-python-dataframe-help-command and squashes the following commits:
      
      08f95f7 [Josh Rosen] Fix exception when calling help() on Python DataFrame instances
      0bb15f22
  12. Jan 28, 2015
    • Yandu Oppacher's avatar
      [SPARK-4387][PySpark] Refactoring python profiling code to make it extensible · 3bead67d
      Yandu Oppacher authored
      This PR is based on #3255 , fix conflicts and code style.
      
      Closes #3255.
      
      Author: Yandu Oppacher <yandu.oppacher@jadedpixel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3901 from davies/refactor-python-profile-code and squashes the following commits:
      
      b4a9306 [Davies Liu] fix tests
      4b79ce8 [Davies Liu] add docstring for profiler_cls
      2700e47 [Davies Liu] use BasicProfiler as default
      349e341 [Davies Liu] more refactor
      6a5d4df [Davies Liu] refactor and fix tests
      31bf6b6 [Davies Liu] fix code style
      0864b5d [Yandu Oppacher] Remove unused method
      76a6c37 [Yandu Oppacher] Added a profile collector to accumulate the profilers per stage
      9eefc36 [Yandu Oppacher] Fix doc
      9ace076 [Yandu Oppacher] Refactor of profiler, and moved tests around
      8739aff [Yandu Oppacher] Code review fixes
      9bda3ec [Yandu Oppacher] Refactor profiler code
      3bead67d
    • Winston Chen's avatar
      [SPARK-5361]Multiple Java RDD <-> Python RDD conversions not working correctly · 453d7999
      Winston Chen authored
      This is found through reading RDD from `sc.newAPIHadoopRDD` and writing it back using `rdd.saveAsNewAPIHadoopFile` in pyspark.
      
      It turns out that whenever there are multiple RDD conversions from JavaRDD to PythonRDD then back to JavaRDD, the exception below happens:
      
      ```
      15/01/16 10:28:31 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 7)
      java.lang.ClassCastException: [Ljava.lang.Object; cannot be cast to java.util.ArrayList
      	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:157)
      	at org.apache.spark.api.python.SerDeUtil$$anonfun$pythonToJava$1$$anonfun$apply$1.apply(SerDeUtil.scala:153)
      	at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
      	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:308)
      ```
      
      The test case code below reproduces it:
      
      ```
      from pyspark.rdd import RDD
      
      dl = [
          (u'2', {u'director': u'David Lean'}),
          (u'7', {u'director': u'Andrew Dominik'})
      ]
      
      dl_rdd = sc.parallelize(dl)
      tmp = dl_rdd._to_java_object_rdd()
      tmp2 = sc._jvm.SerDe.javaToPython(tmp)
      t = RDD(tmp2, sc)
      t.count()
      
      tmp = t._to_java_object_rdd()
      tmp2 = sc._jvm.SerDe.javaToPython(tmp)
      t = RDD(tmp2, sc)
      t.count() # it blows up here during the 2nd time of conversion
      ```
      
      Author: Winston Chen <wchen@quid.com>
      
      Closes #4146 from wingchen/master and squashes the following commits:
      
      903df7d [Winston Chen] SPARK-5361, update to toSeq based on the PR
      5d90a83 [Winston Chen] SPARK-5361, make python pretty, so to pass PEP 8 checks
      126be6b [Winston Chen] SPARK-5361, add in test case
      4cf1187 [Winston Chen] SPARK-5361, add in test case
      9f1a097 [Winston Chen] add in tuple handling while converting form python RDD back to JavaRDD
      453d7999
  13. Jan 27, 2015
    • Reynold Xin's avatar
      [SPARK-5097][SQL] DataFrame · 119f45d6
      Reynold Xin authored
      This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.
      
      TODOs:
      With the exception of Python support, other tasks can be done in separate, follow-up PRs.
      - [ ] Audit of the API
      - [ ] Documentation
      - [ ] More test cases to cover the new API
      - [x] Python support
      - [ ] Type alias SchemaRDD
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4173 from rxin/df1 and squashes the following commits:
      
      0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1
      23b4427 [Reynold Xin] Mima.
      828f70d [Reynold Xin] Merge pull request #7 from davies/df
      257b9e6 [Davies Liu] add repartition
      6bf2b73 [Davies Liu] fix collect with UDT and tests
      e971078 [Reynold Xin] Missing quotes.
      b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now.
      a728bf2 [Reynold Xin] Example rename.
      e8aa3d3 [Reynold Xin] groupby -> groupBy.
      9662c9e [Davies Liu] improve DataFrame Python API
      4ae51ea [Davies Liu] python API for dataframe
      1e5e454 [Reynold Xin] Fixed a bug with symbol conversion.
      2ca74db [Reynold Xin] Couple minor fixes.
      ea98ea1 [Reynold Xin] Documentation & literal expressions.
      2b22684 [Reynold Xin] Got rid of IntelliJ problems.
      02bbfbc [Reynold Xin] Tightening imports.
      ffbce66 [Reynold Xin] Fixed compilation error.
      59b6d8b [Reynold Xin] Style violation.
      b85edfb [Reynold Xin] ALS.
      8c37f0a [Reynold Xin] Made MLlib and examples compile
      6d53134 [Reynold Xin] Hive module.
      d35efd5 [Reynold Xin] Fixed compilation error.
      ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite.
      66d5ef1 [Reynold Xin] SQLContext minor patch.
      c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
      119f45d6
  14. Dec 16, 2014
    • Davies Liu's avatar
      [SPARK-4866] support StructType as key in MapType · ec5c4279
      Davies Liu authored
      This PR brings support of using StructType(and other hashable types) as key in MapType.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3714 from davies/fix_struct_in_map and squashes the following commits:
      
      68585d7 [Davies Liu] fix primitive types in MapType
      9601534 [Davies Liu] support StructType as key in MapType
      ec5c4279
    • Davies Liu's avatar
      [SPARK-4841] fix zip with textFile() · c246b95d
      Davies Liu authored
      UTF8Deserializer can not be used in BatchedSerializer, so always use PickleSerializer() when change batchSize in zip().
      
      Also, if two RDD have the same batch size already, they did not need re-serialize any more.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3706 from davies/fix_4841 and squashes the following commits:
      
      20ce3a3 [Davies Liu] fix bug in _reserialize()
      e3ebf7c [Davies Liu] add comment
      379d2c8 [Davies Liu] fix zip with textFile()
      c246b95d
  15. Nov 24, 2014
    • Davies Liu's avatar
      [SPARK-4548] []SPARK-4517] improve performance of python broadcast · 6cf50768
      Davies Liu authored
      Re-implement the Python broadcast using file:
      
      1) serialize the python object using cPickle, write into disks.
      2) Create a wrapper in JVM (for the dumped file), it read data from during serialization
      3) Using TorrentBroadcast or HttpBroadcast to transfer the data (compressed) into executors
      4) During deserialization, writing the data into disk.
      5) Passing the path into Python worker, read data from disk and unpickle it into python object, until the first access.
      
      It fixes the performance regression introduced in #2659, has similar performance as 1.1, but support object larger than 2G, also improve the memory efficiency (only one compressed copy in driver and executor).
      
      Testing with a 500M broadcast and 4 tasks (excluding the benefit from reused worker in 1.2):
      
               name |   1.1   | 1.2 with this patch |  improvement
      ---------|--------|---------|--------
            python-broadcast-w-bytes  |	25.20  |	9.33   |	170.13% |
              python-broadcast-w-set	  |     4.13	   |    4.50  |	-8.35%  |
      
      Testing with 100 tasks (16 CPUs):
      
               name |   1.1   | 1.2 with this patch |  improvement
      ---------|--------|---------|--------
           python-broadcast-w-bytes	| 38.16	| 8.40	 | 353.98%
              python-broadcast-w-set	| 23.29	| 9.59 |	142.80%
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3417 from davies/pybroadcast and squashes the following commits:
      
      50a58e0 [Davies Liu] address comments
      b98de1d [Davies Liu] disable gc while unpickle
      e5ee6b9 [Davies Liu] support large string
      09303b8 [Davies Liu] read all data into memory
      dde02dd [Davies Liu] improve performance of python broadcast
      6cf50768
    • Davies Liu's avatar
      [SPARK-4578] fix asDict() with nested Row() · 050616b4
      Davies Liu authored
      The Row object is created on the fly once the field is accessed, so we should access them by getattr() in asDict(0
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3434 from davies/fix_asDict and squashes the following commits:
      
      b20f1e7 [Davies Liu] fix asDict() with nested Row()
      050616b4
  16. Nov 18, 2014
    • Davies Liu's avatar
      [SPARK-3721] [PySpark] broadcast objects larger than 2G · 4a377aff
      Davies Liu authored
      This patch will bring support for broadcasting objects larger than 2G.
      
      pickle, zlib, FrameSerializer and Array[Byte] all can not support objects larger than 2G, so this patch introduce LargeObjectSerializer to serialize broadcast objects, the object will be serialized and compressed into small chunks, it also change the type of Broadcast[Array[Byte]]] into Broadcast[Array[Array[Byte]]]].
      
      Testing for support broadcast objects larger than 2G is slow and memory hungry, so this is tested manually, could be added into SparkPerf.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2659 from davies/huge and squashes the following commits:
      
      7b57a14 [Davies Liu] add more tests for broadcast
      28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      a2f6a02 [Davies Liu] bug fix
      4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      5875c73 [Davies Liu] address comments
      10a349b [Davies Liu] address comments
      0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      6182c8f [Davies Liu] Merge branch 'master' into huge
      d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      2514848 [Davies Liu] address comments
      fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into huge
      1c2d928 [Davies Liu] fix scala style
      091b107 [Davies Liu] broadcast objects larger than 2G
      4a377aff
  17. Nov 07, 2014
    • Davies Liu's avatar
      [SPARK-4304] [PySpark] Fix sort on empty RDD · 77791097
      Davies Liu authored
      This PR fix sortBy()/sortByKey() on empty RDD.
      
      This should be back ported into 1.1/1.2
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3162 from davies/fix_sort and squashes the following commits:
      
      84f64b7 [Davies Liu] add tests
      52995b5 [Davies Liu] fix sortByKey() on empty RDD
      77791097
  18. Nov 06, 2014
    • Davies Liu's avatar
      [SPARK-4186] add binaryFiles and binaryRecords in Python · b41a39e2
      Davies Liu authored
      add binaryFiles() and binaryRecords() in Python
      ```
      binaryFiles(self, path, minPartitions=None):
          :: Developer API ::
      
          Read a directory of binary files from HDFS, a local file system
          (available on all nodes), or any Hadoop-supported file system URI
          as a byte array. Each file is read as a single record and returned
          in a key-value pair, where the key is the path of each file, the
          value is the content of each file.
      
          Note: Small files are preferred, large file is also allowable, but
          may cause bad performance.
      
      binaryRecords(self, path, recordLength):
          Load data from a flat binary file, assuming each record is a set of numbers
          with the specified numerical format (see ByteBuffer), and the number of
          bytes per record is constant.
      
          :param path: Directory to the input data files
          :param recordLength: The length at which to split the records
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #3078 from davies/binary and squashes the following commits:
      
      cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
      3aa349b [Davies Liu] add experimental notes
      24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
      5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
      1900085 [Davies Liu] bugfix
      bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python
      b41a39e2
  19. Nov 04, 2014
    • Davies Liu's avatar
      [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default. · e4f42631
      Davies Liu authored
      This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.
      
      Author: Davies Liu <davies@databricks.com>
      
      This patch had conflicts when merged, resolved by
      Committer: Josh Rosen <joshrosen@databricks.com>
      
      Closes #2920 from davies/fix_autobatch and squashes the following commits:
      
      e544ef9 [Davies Liu] revert unrelated change
      6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      1d557fc [Davies Liu] fix tests
      8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      76abdce [Davies Liu] clean up
      53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
      b4292ce [Davies Liu] fix bug in master
      d79744c [Davies Liu] recover hive tests
      be37ece [Davies Liu] refactor
      eb3938d [Davies Liu] refactor serializer in scala
      8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
      e4f42631
  20. Nov 03, 2014
    • Xiangrui Meng's avatar
      [SPARK-4192][SQL] Internal API for Python UDT · 04450d11
      Xiangrui Meng authored
      Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.
      
      marmbrus jkbradley davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:
      
      acff637 [Xiangrui Meng] merge master
      dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
      2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
      7c4a6a9 [Xiangrui Meng] address comments
      75223db [Xiangrui Meng] minor update
      f740379 [Xiangrui Meng] remove UDT from default imports
      e98d9d0 [Xiangrui Meng] fix py style
      4e84fce [Xiangrui Meng] remove local hive tests and add more tests
      39f19e0 [Xiangrui Meng] add tests
      b7f666d [Xiangrui Meng] add Python UDT
      04450d11
    • Davies Liu's avatar
      [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling · 24544fbc
      Davies Liu authored
      This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.
      
      If sampling is presented, it will infer schema from all the rows after sampling.
      
      Also, add samplingRatio for jsonFile() and jsonRDD()
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2716 from davies/infer and squashes the following commits:
      
      e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      567dc60 [Davies Liu] update docs
      9767b27 [Davies Liu] Merge branch 'master' into infer
      e48d7fb [Davies Liu] fix tests
      29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
      ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
      540d1d5 [Davies Liu] merge fields for StructType
      f93fd84 [Davies Liu] add more tests
      3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
      24544fbc
    • Xiangrui Meng's avatar
      [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample · 3cca1962
      Xiangrui Meng authored
      The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.
      
      ~~~
      In [14]: import random
      
      In [15]: r1 = random.Random(10)
      
      In [16]: r1.randint(0, 1)
      Out[16]: 1
      
      In [17]: r1.random()
      Out[17]: 0.4288890546751146
      
      In [18]: r1.random()
      Out[18]: 0.5780913011344704
      
      In [19]: r2 = random.Random(10)
      
      In [20]: r2.randint(0, 1)
      Out[20]: 1
      
      In [21]: r2.randint(0, 1)
      Out[21]: 0
      
      In [22]: r2.random()
      Out[22]: 0.5780913011344704
      ~~~
      
      Note: The new tests are not for this bug fix.
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:
      
      869ae4b [Xiangrui Meng] move tests tests.py
      c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample
      3cca1962
  21. Oct 28, 2014
    • Davies Liu's avatar
      [SPARK-4133] [SQL] [PySpark] type conversionfor python udf · 8c0bfd08
      Davies Liu authored
      Call Python UDF on ArrayType/MapType/PrimitiveType, the returnType can also be ArrayType/MapType/PrimitiveType.
      
      For StructType, it will act as tuple (without attributes). If returnType is StructType, it also should be tuple.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2973 from davies/udf_array and squashes the following commits:
      
      306956e [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
      2c00e43 [Davies Liu] fix merge
      11395fa [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
      9df50a2 [Davies Liu] address comments
      79afb4e [Davies Liu] type conversionfor python udf
      8c0bfd08
  22. Oct 24, 2014
    • Davies Liu's avatar
      [SPARK-4051] [SQL] [PySpark] Convert Row into dictionary · d60a9d44
      Davies Liu authored
      Added a method to Row to turn row into dict:
      
      ```
      >>> row = Row(a=1)
      >>> row.asDict()
      {'a': 1}
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2896 from davies/dict and squashes the following commits:
      
      8d97366 [Davies Liu] convert Row into dict
      d60a9d44
  23. Oct 23, 2014
    • Davies Liu's avatar
      [SPARK-3993] [PySpark] fix bug while reuse worker after take() · e595c8d0
      Davies Liu authored
      After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted data.
      
      We should make sure the socket is clean before reuse it, write END_OF_STREAM at the end, and check it after read out all result from python.
      
      Author: Davies Liu <davies.liu@gmail.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #2838 from davies/fix_reuse and squashes the following commits:
      
      8872914 [Davies Liu] fix tests
      660875b [Davies Liu] fix bug while reuse worker after take()
      e595c8d0
  24. Oct 22, 2014
    • freeman's avatar
      Fix for sampling error in NumPy v1.9 [SPARK-3995][PYSPARK] · 97cf19f6
      freeman authored
      Change maximum value for default seed during RDD sampling so that it is strictly less than 2 ** 32. This prevents a bug in the most recent version of NumPy, which cannot accept random seeds above this bound.
      
      Adds an extra test that uses the default seed (instead of setting it manually, as in the docstrings).
      
      mengxr
      
      Author: freeman <the.freeman.lab@gmail.com>
      
      Closes #2889 from freeman-lab/pyspark-sampling and squashes the following commits:
      
      dc385ef [freeman] Change maximum value for default seed
      97cf19f6
  25. Oct 17, 2014
    • Michael Armbrust's avatar
      [SPARK-3855][SQL] Preserve the result attribute of python UDFs though transformations · adcb7d33
      Michael Armbrust authored
      In the current implementation it was possible for the reference to change after analysis.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #2717 from marmbrus/pythonUdfResults and squashes the following commits:
      
      da14879 [Michael Armbrust] Fix test
      6343bcb [Michael Armbrust] add test
      9533286 [Michael Armbrust] Correctly preserve the result attribute of python UDFs though transformations
      adcb7d33
  26. Oct 11, 2014
    • cocoatomo's avatar
      [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6... · 81015a2b
      cocoatomo authored
      [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      
      ./python/run-tests search a Python 2.6 executable on PATH and use it if available.
      When using Python 2.6, it is going to import unittest2 module which is not a standard library in Python 2.6, so it fails with ImportError.
      
      Author: cocoatomo <cocoatomo77@gmail.com>
      
      Closes #2759 from cocoatomo/issues/3867-unittest2-import-error and squashes the following commits:
      
      f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
      81015a2b
  27. Oct 06, 2014
    • Davies Liu's avatar
      [SPARK-3786] [PySpark] speedup tests · 4f01265f
      Davies Liu authored
      This patch try to speed up tests of PySpark, re-use the SparkContext in tests.py and mllib/tests.py to reduce the overhead of create SparkContext, remove some test cases, which did not make sense. It also improve the performance of some cases, such as MergerTests and SortTests.
      
      before this patch:
      
      real	21m27.320s
      user	4m42.967s
      sys	0m17.343s
      
      after this patch:
      
      real	9m47.541s
      user	2m12.947s
      sys	0m14.543s
      
      It almost cut the time by half.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2646 from davies/tests and squashes the following commits:
      
      c54de60 [Davies Liu] revert change about memory limit
      6a2a4b0 [Davies Liu] refactor of tests, speedup 100%
      4f01265f
  28. Oct 01, 2014
    • Davies Liu's avatar
      [SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD · abf588f4
      Davies Liu authored
      1. broadcast is triggle unexpected
      2. fd is leaked in JVM (also leak in parallelize())
      3. broadcast is not unpersisted in JVM after RDD is not be used any more.
      
      cc JoshRosen , sorry for these stupid bugs.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2603 from davies/fix_broadcast and squashes the following commits:
      
      080a743 [Davies Liu] fix bugs in broadcast large closure of RDD
      abf588f4
  29. Sep 30, 2014
    • Davies Liu's avatar
      [SPARK-3478] [PySpark] Profile the Python tasks · c5414b68
      Davies Liu authored
      This patch add profiling support for PySpark, it will show the profiling results
      before the driver exits, here is one example:
      
      ```
      ============================================================
      Profile of RDD<id=3>
      ============================================================
               5146507 function calls (5146487 primitive calls) in 71.094 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
             20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
             20    0.017    0.001    0.017    0.001 {cPickle.dumps}
           1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
             20    0.001    0.000    0.001    0.000 {reduce}
             21    0.001    0.000    0.001    0.000 {cPickle.loads}
             20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
             41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
             40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
             62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
             20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
             20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
          40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
             41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
             40    0.000    0.000   71.072    1.777 rdd.py:304(func)
             20    0.000    0.000   71.094    3.555 worker.py:82(process)
      ```
      
      Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
      by `sc.dump_profiles(path)`, such as
      
      ```python
      >>> sc._conf.set("spark.python.profile", "true")
      >>> rdd = sc.parallelize(range(100)).map(str)
      >>> rdd.count()
      100
      >>> sc.show_profiles()
      ============================================================
      Profile of RDD<id=1>
      ============================================================
               284 function calls (276 primitive calls) in 0.001 seconds
      
         Ordered by: internal time, cumulative time
      
         ncalls  tottime  percall  cumtime  percall filename:lineno(function)
              4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
              4    0.000    0.000    0.000    0.000 {reduce}
           12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
              4    0.000    0.000    0.000    0.000 {cPickle.loads}
              4    0.000    0.000    0.000    0.000 {cPickle.dumps}
            104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
              8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
             12    0.000    0.000    0.000    0.000 rdd.py:303(func)
      ```
      The profiling is disabled by default, can be enabled by "spark.python.profile=true".
      
      Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"
      
      This is bugfix of #2351 cc JoshRosen
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2556 from davies/profiler and squashes the following commits:
      
      e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      858e74c [Davies Liu] compatitable with python 2.6
      7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
      2b0daf2 [Davies Liu] fix docs
      7a56c24 [Davies Liu] bugfix
      cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
      fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
      09d02c3 [Davies Liu] Merge branch 'master' into profiler
      c23865c [Davies Liu] Merge branch 'master' into profiler
      15d6f18 [Davies Liu] add docs for two configs
      dadee1a [Davies Liu] add docs string and clear profiles after show or dump
      4f8309d [Davies Liu] address comment, add tests
      0a5b6eb [Davies Liu] fix Python UDF
      4b20494 [Davies Liu] add profile for python
      c5414b68
  30. Sep 27, 2014
    • Davies Liu's avatar
      [SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD · 0d8cdf0e
      Davies Liu authored
      Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce issues while serialization or accessing nested objects.
      
      This patch will apply schema to the objects of ArrayType or MapType immediately when accessing them, will be a little bit slower, but much robust.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #2526 from davies/nested and squashes the following commits:
      
      2399ae5 [Davies Liu] fix serialization of List and Map in SchemaRDD
      0d8cdf0e
  31. Sep 26, 2014
Loading