Skip to content
Snippets Groups Projects
  1. Dec 11, 2015
    • Yanbo Liang's avatar
      [SPARK-11978][ML] Move dataset_example.py to examples/ml and rename to dataframe_example.py · a0ff6d16
      Yanbo Liang authored
      Since ```Dataset``` has a new meaning in Spark 1.6, we should rename it to avoid confusion.
      #9873 finished the work of Scala example, here we focus on the Python one.
      Move dataset_example.py to ```examples/ml``` and rename to ```dataframe_example.py```.
      BTW, fix minor missing issues of #9873.
      cc mengxr
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #9957 from yanboliang/SPARK-11978.
      a0ff6d16
  2. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  3. Apr 08, 2015
    • Davies Liu's avatar
      [SPARK-6781] [SQL] use sqlContext in python shell · 6ada4f6f
      Davies Liu authored
      Use `sqlContext` in PySpark shell, make it consistent with SQL programming guide. `sqlCtx` is also kept for compatibility.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5425 from davies/sqlCtx and squashes the following commits:
      
      af67340 [Davies Liu] sqlCtx -> sqlContext
      15a278f [Davies Liu] use sqlContext in python shell
      6ada4f6f
  4. Jan 27, 2015
    • Reynold Xin's avatar
      [SPARK-5097][SQL] DataFrame · 119f45d6
      Reynold Xin authored
      This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.
      
      TODOs:
      With the exception of Python support, other tasks can be done in separate, follow-up PRs.
      - [ ] Audit of the API
      - [ ] Documentation
      - [ ] More test cases to cover the new API
      - [x] Python support
      - [ ] Type alias SchemaRDD
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4173 from rxin/df1 and squashes the following commits:
      
      0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1
      23b4427 [Reynold Xin] Mima.
      828f70d [Reynold Xin] Merge pull request #7 from davies/df
      257b9e6 [Davies Liu] add repartition
      6bf2b73 [Davies Liu] fix collect with UDT and tests
      e971078 [Reynold Xin] Missing quotes.
      b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now.
      a728bf2 [Reynold Xin] Example rename.
      e8aa3d3 [Reynold Xin] groupby -> groupBy.
      9662c9e [Davies Liu] improve DataFrame Python API
      4ae51ea [Davies Liu] python API for dataframe
      1e5e454 [Reynold Xin] Fixed a bug with symbol conversion.
      2ca74db [Reynold Xin] Couple minor fixes.
      ea98ea1 [Reynold Xin] Documentation & literal expressions.
      2b22684 [Reynold Xin] Got rid of IntelliJ problems.
      02bbfbc [Reynold Xin] Tightening imports.
      ffbce66 [Reynold Xin] Fixed compilation error.
      59b6d8b [Reynold Xin] Style violation.
      b85edfb [Reynold Xin] ALS.
      8c37f0a [Reynold Xin] Made MLlib and examples compile
      6d53134 [Reynold Xin] Hive module.
      d35efd5 [Reynold Xin] Fixed compilation error.
      ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite.
      66d5ef1 [Reynold Xin] SQLContext minor patch.
      c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
      119f45d6
  5. Nov 04, 2014
    • Xiangrui Meng's avatar
      [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD · 1a9c6cdd
      Xiangrui Meng authored
      Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.
      
      ~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~
      
      marmbrus jkbradley
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:
      
      3a0b6e5 [Xiangrui Meng] organize imports
      236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
      1a9c6cdd
Loading