Skip to content
Snippets Groups Projects
  1. Jul 20, 2015
  2. Jul 19, 2015
  3. Jul 10, 2015
  4. Jul 09, 2015
    • Davies Liu's avatar
      [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of... · c9e2ef52
      Davies Liu authored
      [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of serialization for Python DataFrame
      
      This PR fix the long standing issue of serialization between Python RDD and DataFrame, it change to using a customized Pickler for InternalRow to enable customized unpickling (type conversion, especially for UDT), now we can support UDT for UDF, cc mengxr .
      
      There is no generated `Row` anymore.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7301 from davies/sql_ser and squashes the following commits:
      
      81bef71 [Davies Liu] address comments
      e9217bd [Davies Liu] add regression tests
      db34167 [Davies Liu] Refactor of serialization for Python DataFrame
      c9e2ef52
  5. Jul 08, 2015
    • Davies Liu's avatar
      [SPARK-8450] [SQL] [PYSARK] cleanup type converter for Python DataFrame · 74d8d3d9
      Davies Liu authored
      This PR fixes the converter for Python DataFrame, especially for DecimalType
      
      Closes #7106
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7131 from davies/decimal_python and squashes the following commits:
      
      4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      7d73168 [Davies Liu] fix conflit
      6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
      7104e97 [Davies Liu] improve type infer
      9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES
      829a05b [Davies Liu] fix UDT in python
      c99e8c5 [Davies Liu] fix mima
      c46814a [Davies Liu] convert decimal for Python DataFrames
      74d8d3d9
  6. Jul 01, 2015
    • Davies Liu's avatar
      [SPARK-8766] support non-ascii character in column names · f958f27e
      Davies Liu authored
      Use UTF-8 to encode the name of column in Python 2, or it may failed to encode with default encoding ('ascii').
      
      This PR also fix a bug when there is Java exception without error message.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7165 from davies/non_ascii and squashes the following commits:
      
      02cb61a [Davies Liu] fix tests
      3b09d31 [Davies Liu] add encoding in header
      867754a [Davies Liu] support non-ascii character in column names
      f958f27e
  7. Jun 30, 2015
    • Davies Liu's avatar
      [SPARK-8738] [SQL] [PYSPARK] capture SQL AnalysisException in Python API · 58ee2a2e
      Davies Liu authored
      Capture the AnalysisException in SQL, hide the long java stack trace, only show the error message.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #7135 from davies/ananylis and squashes the following commits:
      
      dad7ae7 [Davies Liu] add comment
      ec0c0e8 [Davies Liu] Update utils.py
      cdd7edd [Davies Liu] add doc
      7b044c2 [Davies Liu] fix python 3
      f84d3bd [Davies Liu] capture SQL AnalysisException in Python API
      58ee2a2e
  8. Jun 29, 2015
    • Ilya Ganelin's avatar
      [SPARK-8056][SQL] Design an easier way to construct schema for both Scala and Python · f6fc254e
      Ilya Ganelin authored
      I've added functionality to create new StructType similar to how we add parameters to a new SparkContext.
      
      I've also added tests for this type of creation.
      
      Author: Ilya Ganelin <ilya.ganelin@capitalone.com>
      
      Closes #6686 from ilganeli/SPARK-8056B and squashes the following commits:
      
      27c1de1 [Ilya Ganelin] Rename
      467d836 [Ilya Ganelin] Removed from_string in favor of _parse_Datatype_json_value
      5fef5a4 [Ilya Ganelin] Updates for type parsing
      4085489 [Ilya Ganelin] Style errors
      3670cf5 [Ilya Ganelin] added string to DataType conversion
      8109e00 [Ilya Ganelin] Fixed error in tests
      41ab686 [Ilya Ganelin] Fixed style errors
      e7ba7e0 [Ilya Ganelin] Moved some python tests to tests.py. Added cleaner handling of null data type and added test for correctness of input format
      15868fa [Ilya Ganelin] Fixed python errors
      b79b992 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-8056B
      a3369fc [Ilya Ganelin] Fixing space errors
      e240040 [Ilya Ganelin] Style
      bab7823 [Ilya Ganelin] Constructor error
      73d4677 [Ilya Ganelin] Style
      4ed00d9 [Ilya Ganelin] Fixed default arg
      67df57a [Ilya Ganelin] Removed Foo
      04cbf0c [Ilya Ganelin] Added comments for single object
      0484d7a [Ilya Ganelin] Restored second method
      6aeb740 [Ilya Ganelin] Style
      689e54d [Ilya Ganelin] Style
      f497e9e [Ilya Ganelin] Got rid of old code
      e3c7a88 [Ilya Ganelin] Fixed doctest failure
      a62ccde [Ilya Ganelin] Style
      966ac06 [Ilya Ganelin] style checks
      dabb7e6 [Ilya Ganelin] Added Python tests
      a3f4152 [Ilya Ganelin] added python bindings and better comments
      e6e536c [Ilya Ganelin] Added extra space
      7529a2e [Ilya Ganelin] Fixed formatting
      d388f86 [Ilya Ganelin] Fixed small bug
      c4e3bf5 [Ilya Ganelin] Reverted to using parse. Updated parse to support long
      d7634b6 [Ilya Ganelin] Reverted to fromString to properly support types
      22c39d5 [Ilya Ganelin] replaced FromString with DataTypeParser.parse. Replaced empty constructor initializing a null to have it instead create a new array to allow appends to it.
      faca398 [Ilya Ganelin] [SPARK-8056] Replaced default argument usage. Updated usage and code for DataType.fromString
      1acf76e [Ilya Ganelin] Scala style
      e31c674 [Ilya Ganelin] Fixed bug in test
      8dc0795 [Ilya Ganelin] Added tests for creation of StructType object with new methods
      fdf7e9f [Ilya Ganelin] [SPARK-8056] Created add methods to facilitate building new StructType objects.
      f6fc254e
    • Cheolsoo Park's avatar
      [SPARK-8355] [SQL] Python DataFrameReader/Writer should mirror Scala · ac2e17b0
      Cheolsoo Park authored
      I compared PySpark DataFrameReader/Writer against Scala ones. `Option` function is missing in both reader and writer, but the rest seems to all match.
      
      I added `Option` to reader and writer and updated the `pyspark-sql` test.
      
      Author: Cheolsoo Park <cheolsoop@netflix.com>
      
      Closes #7078 from piaozhexiu/SPARK-8355 and squashes the following commits:
      
      c63d419 [Cheolsoo Park] Fix version
      524e0aa [Cheolsoo Park] Add option function to df reader and writer
      ac2e17b0
  9. Jun 23, 2015
    • Davies Liu's avatar
      [SPARK-8573] [SPARK-8568] [SQL] [PYSPARK] raise Exception if column is used in booelan expression · 7fb5ae50
      Davies Liu authored
      It's a common mistake that user will put Column in a boolean expression (together with `and` , `or`), which does not work as expected, we should raise a exception in that case, and suggest user to use `&`, `|` instead.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6961 from davies/column_bool and squashes the following commits:
      
      9f19beb [Davies Liu] update message
      af74bd6 [Davies Liu] fix tests
      07dff84 [Davies Liu] address comments, fix tests
      f70c08e [Davies Liu] raise Exception if column is used in booelan expression
      7fb5ae50
  10. Jun 22, 2015
    • Yin Huai's avatar
      [SPARK-8532] [SQL] In Python's DataFrameWriter,... · 5ab9fcfb
      Yin Huai authored
      [SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode
      
      https://issues.apache.org/jira/browse/SPARK-8532
      
      This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #6937 from yhuai/SPARK-8532 and squashes the following commits:
      
      f972d5d [Yin Huai] davies's comment.
      d37abd2 [Yin Huai] style.
      d21290a [Yin Huai] Python doc.
      889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet.
      7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value.
      d696dff [Yin Huai] Python style.
      88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
      c40c461 [Yin Huai] Regression test.
      5ab9fcfb
  11. Jun 11, 2015
    • Davies Liu's avatar
      [SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python · 424b0075
      Davies Liu authored
      Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time.
      
      The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time.
      
      Because of Pyrolite, datetime from SQL only has precision as 1 millisecond.
      
      This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6250 from davies/tzone and squashes the following commits:
      
      44d8497 [Davies Liu] add timezone support for DateType
      99d9d9c [Davies Liu] use int for timestamp
      10aa7ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into tzone
      6a29aa4 [Davies Liu] support datetime with timezone
      424b0075
  12. Jun 03, 2015
    • animesh's avatar
      [SPARK-7980] [SQL] Support SQLContext.range(end) · d053a31b
      animesh authored
      1. range() overloaded in SQLContext.scala
      2. range() modified in python sql context.py
      3. Tests added accordingly in DataFrameSuite.scala and python sql tests.py
      
      Author: animesh <animesh@apache.spark>
      
      Closes #6609 from animeshbaranawal/SPARK-7980 and squashes the following commits:
      
      935899c [animesh] SPARK-7980:python+scala changes
      d053a31b
    • Reynold Xin's avatar
      [SPARK-8060] Improve DataFrame Python test coverage and documentation. · ce320cb2
      Reynold Xin authored
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits:
      
      baa8ad5 [Reynold Xin] Code review feedback.
      f081d47 [Reynold Xin] More documentation updates.
      c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.
      ce320cb2
  13. May 31, 2015
  14. May 23, 2015
    • Davies Liu's avatar
      [SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates · efe3bfdf
      Davies Liu authored
      1. ntile should take an integer as parameter.
      2. Added Python API (based on #6364)
      3. Update documentation of various DataFrame Python functions.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6374 from rxin/window-final and squashes the following commits:
      
      69004c7 [Reynold Xin] Style fix.
      288cea9 [Reynold Xin] Update documentaiton.
      7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window
      66092b4 [Davies Liu] update docs
      ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation.
      ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4
      8936ade [Davies Liu] fix maxint in python 3
      2649358 [Davies Liu] update docs
      778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
      efe3bfdf
  15. May 19, 2015
    • Davies Liu's avatar
      [SPARK-7738] [SQL] [PySpark] add reader and writer API in Python · 4de74d26
      Davies Liu authored
      cc rxin, please take a quick look, I'm working on tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6238 from davies/readwrite and squashes the following commits:
      
      c7200eb [Davies Liu] update tests
      9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
      f0c5a04 [Davies Liu] use sqlContext.read.load
      5f68bc8 [Davies Liu] update tests
      6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
      bcc6668 [Davies Liu] add reader amd writer API in Python
      4de74d26
  16. May 18, 2015
    • Daoyuan Wang's avatar
      [SPARK-7150] SparkContext.range() and SQLContext.range() · c2437de1
      Daoyuan Wang authored
      This PR is based on #6081, thanks adrian-wang.
      
      Closes #6081
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6230 from davies/range and squashes the following commits:
      
      d3ce5fe [Davies Liu] add tests
      789eda5 [Davies Liu] add range() in Python
      4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
      cbf5200 [Daoyuan Wang] let's add python support in a separate PR
      f45e3b2 [Daoyuan Wang] remove redundant toLong
      617da76 [Daoyuan Wang] fix safe marge for corner cases
      867c417 [Daoyuan Wang] fix
      13dbe84 [Daoyuan Wang] update
      bd998ba [Daoyuan Wang] update comments
      d3a0c1b [Daoyuan Wang] add range api()
      c2437de1
  17. May 14, 2015
    • Michael Armbrust's avatar
      [SPARK-7548] [SQL] Add explode function for DataFrames · 6d0633e3
      Michael Armbrust authored
      Add an `explode` function for dataframes and modify the analyzer so that single table generating functions can be present in a select clause along with other expressions.   There are currently the following restrictions:
       - only top level TGFs are allowed (i.e. no `select(explode('list) + 1)`)
       - only one may be present in a single select to avoid potentially confusing implicit Cartesian products.
      
      TODO:
       - [ ] Python
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #6107 from marmbrus/explodeFunction and squashes the following commits:
      
      7ee2c87 [Michael Armbrust] whitespace
      6f80ba3 [Michael Armbrust] Update dataframe.py
      c176c89 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
      81b5da3 [Michael Armbrust] style
      d3faa05 [Michael Armbrust] fix self join case
      f9e1e3e [Michael Armbrust] fix python, add since
      4f0d0a9 [Michael Armbrust] Merge remote-tracking branch 'origin/master' into explodeFunction
      e710fe4 [Michael Armbrust] add java and python
      52ca0dc [Michael Armbrust] [SPARK-7548][SQL] Add explode function for dataframes.
      6d0633e3
  18. May 12, 2015
    • Daoyuan Wang's avatar
      [SPARK-6876] [PySpark] [SQL] add DataFrame na.replace in pyspark · d86ce845
      Daoyuan Wang authored
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #6003 from adrian-wang/pynareplace and squashes the following commits:
      
      672efba [Daoyuan Wang] remove py2.7 feature
      4a148f7 [Daoyuan Wang] to_replace support dict, value support single value, and add full tests
      9e232e7 [Daoyuan Wang] rename scala map
      af0268a [Daoyuan Wang] remove na
      63ac579 [Daoyuan Wang] add na.replace in pyspark
      d86ce845
  19. May 08, 2015
    • Wenchen Fan's avatar
      [SPARK-7133] [SQL] Implement struct, array, and map field accessor · 2d05f325
      Wenchen Fan authored
      It's the first step: generalize UnresolvedGetField to support all map, struct, and array
      TODO: add `apply` in Scala and `__getitem__` in Python, and unify the `getItem` and `getField` methods to one single API(or should we keep them for compatibility?).
      
      Author: Wenchen Fan <cloud0fan@outlook.com>
      
      Closes #5744 from cloud-fan/generalize and squashes the following commits:
      
      715c589 [Wenchen Fan] address comments
      7ea5b31 [Wenchen Fan] fix python test
      4f0833a [Wenchen Fan] add python test
      f515d69 [Wenchen Fan] add apply method and test cases
      8df6199 [Wenchen Fan] fix python test
      239730c [Wenchen Fan] fix test compile
      2a70526 [Wenchen Fan] use _bin_op in dataframe.py
      6bf72bc [Wenchen Fan] address comments
      3f880c3 [Wenchen Fan] add java doc
      ab35ab5 [Wenchen Fan] fix python test
      b5961a9 [Wenchen Fan] fix style
      c9d85f5 [Wenchen Fan] generalize UnresolvedGetField to support all map, struct, and array
      2d05f325
  20. May 07, 2015
    • Shiti's avatar
      [SPARK-7295][SQL] bitwise operations for DataFrame DSL · fa8fddff
      Shiti authored
      Author: Shiti <ssaxena.ece@gmail.com>
      
      Closes #5867 from Shiti/spark-7295 and squashes the following commits:
      
      71a9913 [Shiti] implementation for bitwise and,or, not and xor on Column with tests and docs
      fa8fddff
  21. May 06, 2015
    • Burak Yavuz's avatar
      [SPARK-7358][SQL] Move DataFrame mathfunctions into functions · ba2b5661
      Burak Yavuz authored
      After a discussion on the user mailing list, it was decided to put all UDF's under `o.a.s.sql.functions`
      
      cc rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5923 from brkyvz/move-math-funcs and squashes the following commits:
      
      a8dc3f7 [Burak Yavuz] address comments
      cf7a7bb [Burak Yavuz] [SPARK-7358] Move DataFrame mathfunctions into functions
      ba2b5661
  22. May 05, 2015
    • 云峤's avatar
      [SPARK-7294][SQL] ADD BETWEEN · 735bc3d0
      云峤 authored
      Author: 云峤 <chensong.cs@alibaba-inc.com>
      Author: kaka1992 <kaka_1992@163.com>
      
      Closes #5839 from kaka1992/master and squashes the following commits:
      
      b15360d [kaka1992] Fix python unit test in sql/test. =_= I forget to commit this file last time.
      f928816 [kaka1992] Fix python style in sql/test.
      d2e7f72 [kaka1992] Fix python style in sql/test.
      c54d904 [kaka1992] Fix empty map bug.
      7e64d1e [云峤] Update
      7b9b858 [云峤] undo
      f080f8d [云峤] update pep8
      76f0c51 [云峤] Merge remote-tracking branch 'remotes/upstream/master'
      7d62368 [云峤] [SPARK-7294] ADD BETWEEN
      baf839b [云峤] [SPARK-7294] ADD BETWEEN
      d11d5b9 [云峤] [SPARK-7294] ADD BETWEEN
      735bc3d0
  23. May 04, 2015
    • Burak Yavuz's avatar
      [SPARK-7243][SQL] Contingency Tables for DataFrames · 80554111
      Burak Yavuz authored
      Computes a pair-wise frequency table of the given columns. Also known as cross-tabulation.
      cc mengxr rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5842 from brkyvz/df-cont and squashes the following commits:
      
      a07c01e [Burak Yavuz] addressed comments v4.1
      ae9e01d [Burak Yavuz] fix test
      9106585 [Burak Yavuz] addressed comments v4.0
      bced829 [Burak Yavuz] fix merge conflicts
      a63ad00 [Burak Yavuz] addressed comments v3.0
      a0cad97 [Burak Yavuz] addressed comments v3.0
      6805df8 [Burak Yavuz] addressed comments and fixed test
      939b7c4 [Burak Yavuz] lint python
      7f098bc [Burak Yavuz] add crosstab pyTest
      fd53b00 [Burak Yavuz] added python support for crosstab
      27a5a81 [Burak Yavuz] implemented crosstab
      80554111
  24. May 03, 2015
    • Burak Yavuz's avatar
      [SPARK-7241] Pearson correlation for DataFrames · 9646018b
      Burak Yavuz authored
      submitting this PR from a phone, excuse the brevity.
      adds Pearson correlation to Dataframes, reusing the covariance calculation code
      
      cc mengxr rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5858 from brkyvz/df-corr and squashes the following commits:
      
      285b838 [Burak Yavuz] addressed comments v2.0
      d10babb [Burak Yavuz] addressed comments v0.2
      4b74b24 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-corr
      4fe693b [Burak Yavuz] addressed comments v0.1
      a682d06 [Burak Yavuz] ready for PR
      9646018b
  25. May 02, 2015
    • Burak Yavuz's avatar
      [SPARK-7242] added python api for freqItems in DataFrames · 2e0f3579
      Burak Yavuz authored
      The python api for DataFrame's plus addressed your comments from previous PR.
      rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5859 from brkyvz/df-freq-py2 and squashes the following commits:
      
      f9aa9ce [Burak Yavuz] addressed comments v0.1
      4b25056 [Burak Yavuz] added python api for freqItems
      2e0f3579
  26. May 01, 2015
    • Burak Yavuz's avatar
      [SPARK-7240][SQL] Single pass covariance calculation for dataframes · 4dc8d744
      Burak Yavuz authored
      Added the calculation of covariance between two columns to DataFrames.
      
      cc mengxr rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5825 from brkyvz/df-cov and squashes the following commits:
      
      cb18046 [Burak Yavuz] changed to sample covariance
      f2e862b [Burak Yavuz] fixed failed test
      51e39b8 [Burak Yavuz] moved implementation
      0c6a759 [Burak Yavuz] addressed math comments
      8456eca [Burak Yavuz] fix pyStyle3
      aa2ad29 [Burak Yavuz] fix pyStyle2
      4e97a50 [Burak Yavuz] Merge branch 'master' of github.com:apache/spark into df-cov
      e3b0b85 [Burak Yavuz] addressed comments v0.1
      a7115f1 [Burak Yavuz] fix python style
      7dc6dbc [Burak Yavuz] reorder imports
      408cb77 [Burak Yavuz] initial commit
      4dc8d744
  27. Apr 30, 2015
    • Burak Yavuz's avatar
      [SPARK-7248] implemented random number generators for DataFrames · b5347a46
      Burak Yavuz authored
      Adds the functions `rand` (Uniform Dist) and `randn` (Normal Dist.) as expressions to DataFrames.
      
      cc mengxr rxin
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5819 from brkyvz/df-rng and squashes the following commits:
      
      50d69d4 [Burak Yavuz] add seed for test that failed
      4234c3a [Burak Yavuz] fix Rand expression
      13cad5c [Burak Yavuz] couple fixes
      7d53953 [Burak Yavuz] waiting for hive tests
      b453716 [Burak Yavuz] move radn with seed down
      03637f0 [Burak Yavuz] fix broken hive func
      c5909eb [Burak Yavuz] deleted old implementation of Rand
      6d43895 [Burak Yavuz] implemented random generators
      b5347a46
  28. Apr 29, 2015
    • Burak Yavuz's avatar
      [SPARK-7188] added python support for math DataFrame functions · fe917f5e
      Burak Yavuz authored
      Adds support for the math functions for DataFrames in PySpark.
      
      rxin I love Davies.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #5750 from brkyvz/python-math-udfs and squashes the following commits:
      
      7c4f563 [Burak Yavuz] removed is_math
      3c4adde [Burak Yavuz] cleanup imports
      d5dca3f [Burak Yavuz] moved math functions to mathfunctions
      25e6534 [Burak Yavuz] addressed comments v2.0
      d3f7e0f [Burak Yavuz] addressed comments and added tests
      7b7d7c4 [Burak Yavuz] remove tests for removed methods
      33c2c15 [Burak Yavuz] fixed python style
      3ee0c05 [Burak Yavuz] added python functions
      fe917f5e
  29. Apr 21, 2015
    • Reynold Xin's avatar
      [SPARK-6953] [PySpark] speed up python tests · 3134c3fe
      Reynold Xin authored
      This PR try to speed up some python tests:
      
      ```
      tests.py                       144s -> 103s      -41s
      mllib/classification.py         24s -> 17s        -7s
      mllib/regression.py             27s -> 15s       -12s
      mllib/tree.py                   27s -> 13s       -14s
      mllib/tests.py                  64s -> 31s       -33s
      streaming/tests.py             185s -> 84s      -101s
      ```
      Considering python3, the total saving will be 558s (almost 10 minutes) (core, and streaming run three times, mllib runs twice).
      
      During testing, it will show used time for each test file:
      ```
      Run core tests ...
      Running test: pyspark/rdd.py ... ok (22s)
      Running test: pyspark/context.py ... ok (16s)
      Running test: pyspark/conf.py ... ok (4s)
      Running test: pyspark/broadcast.py ... ok (4s)
      Running test: pyspark/accumulators.py ... ok (4s)
      Running test: pyspark/serializers.py ... ok (6s)
      Running test: pyspark/profiler.py ... ok (5s)
      Running test: pyspark/shuffle.py ... ok (1s)
      Running test: pyspark/tests.py ... ok (103s)   144s
      ```
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #5605 from rxin/python-tests-speed and squashes the following commits:
      
      d08542d [Reynold Xin] Merge pull request #14 from mengxr/SPARK-6953
      89321ee [Xiangrui Meng] fix seed in tests
      3ad2387 [Reynold Xin] Merge pull request #5427 from davies/python_tests
      3134c3fe
    • Davies Liu's avatar
      [SPARK-6949] [SQL] [PySpark] Support Date/Timestamp in Column expression · ab9128fb
      Davies Liu authored
      This PR enable auto_convert in JavaGateway, then we could register a converter for a given types, for example, date and datetime.
      
      There are two bugs related to auto_convert, see [1] and [2], we workaround it in this PR.
      
      [1]  https://github.com/bartdag/py4j/issues/160
      [2] https://github.com/bartdag/py4j/issues/161
      
      cc rxin JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5570 from davies/py4j_date and squashes the following commits:
      
      eb4fa53 [Davies Liu] fix tests in python 3
      d17d634 [Davies Liu] rollback changes in mllib
      2e7566d [Davies Liu] convert tuple into ArrayList
      ceb3779 [Davies Liu] Update rdd.py
      3c373f3 [Davies Liu] support date and datetime by auto_convert
      cb094ff [Davies Liu] enable auto convert
      ab9128fb
  30. Apr 17, 2015
    • Davies Liu's avatar
      [SPARK-6957] [SPARK-6958] [SQL] improve API compatibility to pandas · c84d9169
      Davies Liu authored
      ```
      select(['cola', 'colb'])
      
      groupby(['colA', 'colB'])
      groupby([df.colA, df.colB])
      
      df.sort('A', ascending=True)
      df.sort(['A', 'B'], ascending=True)
      df.sort(['A', 'B'], ascending=[1, 0])
      ```
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5544 from davies/compatibility and squashes the following commits:
      
      4944058 [Davies Liu] add docstrings
      adb2816 [Davies Liu] Merge branch 'master' of github.com:apache/spark into compatibility
      bcbbcab [Davies Liu] support ascending as list
      8dabdf0 [Davies Liu] improve API compatibility to pandas
      c84d9169
  31. Apr 16, 2015
    • Davies Liu's avatar
      [SPARK-6911] [SQL] improve accessor for nested types · 6183b5e2
      Davies Liu authored
      Support access columns by index in Python:
      ```
      >>> df[df[0] > 3].collect()
      [Row(age=5, name=u'Bob')]
      ```
      
      Access items in ArrayType or MapType
      ```
      >>> df.select(df.l.getItem(0), df.d.getItem("key")).show()
      >>> df.select(df.l[0], df.d["key"]).show()
      ```
      
      Access field in StructType
      ```
      >>> df.select(df.r.getField("b")).show()
      >>> df.select(df.r.a).show()
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #5513 from davies/access and squashes the following commits:
      
      e04d5a0 [Davies Liu] Update run-tests-jenkins
      7ada9eb [Davies Liu] update timeout
      d125ac4 [Davies Liu] check column name, improve scala tests
      6b62540 [Davies Liu] fix test
      db15b42 [Davies Liu] Merge branch 'master' of github.com:apache/spark into access
      6c32e79 [Davies Liu] add scala tests
      11f1df3 [Davies Liu] improve accessor for nested types
      6183b5e2
    • Davies Liu's avatar
      [SPARK-4897] [PySpark] Python 3 support · 04e44b37
      Davies Liu authored
      This PR update PySpark to support Python 3 (tested with 3.4).
      
      Known issue: unpickle array from Pyrolite is broken in Python 3, those tests are skipped.
      
      TODO: ec2/spark-ec2.py is not fully tested with python3.
      
      Author: Davies Liu <davies@databricks.com>
      Author: twneale <twneale@gmail.com>
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #5173 from davies/python3 and squashes the following commits:
      
      d7d6323 [Davies Liu] fix tests
      6c52a98 [Davies Liu] fix mllib test
      99e334f [Davies Liu] update timeout
      b716610 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      cafd5ec [Davies Liu] adddress comments from @mengxr
      bf225d7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      179fc8d [Davies Liu] tuning flaky tests
      8c8b957 [Davies Liu] fix ResourceWarning in Python 3
      5c57c95 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      4006829 [Davies Liu] fix test
      2fc0066 [Davies Liu] add python3 path
      71535e9 [Davies Liu] fix xrange and divide
      5a55ab4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      125f12c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ed498c8 [Davies Liu] fix compatibility with python 3
      820e649 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      e8ce8c9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ad7c374 [Davies Liu] fix mllib test and warning
      ef1fc2f [Davies Liu] fix tests
      4eee14a [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      20112ff [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      59bb492 [Davies Liu] fix tests
      1da268c [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      ca0fdd3 [Davies Liu] fix code style
      9563a15 [Davies Liu] add imap back for python 2
      0b1ec04 [Davies Liu] make python examples work with Python 3
      d2fd566 [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      a716d34 [Davies Liu] test with python 3.4
      f1700e8 [Davies Liu] fix test in python3
      671b1db [Davies Liu] fix test in python3
      692ff47 [Davies Liu] fix flaky test
      7b9699f [Davies Liu] invalidate import cache for Python 3.3+
      9c58497 [Davies Liu] fix kill worker
      309bfbf [Davies Liu] keep compatibility
      5707476 [Davies Liu] cleanup, fix hash of string in 3.3+
      8662d5b [Davies Liu] Merge branch 'master' of github.com:apache/spark into python3
      f53e1f0 [Davies Liu] fix tests
      70b6b73 [Davies Liu] compile ec2/spark_ec2.py in python 3
      a39167e [Davies Liu] support customize class in __main__
      814c77b [Davies Liu] run unittests with python 3
      7f4476e [Davies Liu] mllib tests passed
      d737924 [Davies Liu] pass ml tests
      375ea17 [Davies Liu] SQL tests pass
      6cc42a9 [Davies Liu] rename
      431a8de [Davies Liu] streaming tests pass
      78901a7 [Davies Liu] fix hash of serializer in Python 3
      24b2f2e [Davies Liu] pass all RDD tests
      35f48fe [Davies Liu] run future again
      1eebac2 [Davies Liu] fix conflict in ec2/spark_ec2.py
      6e3c21d [Davies Liu] make cloudpickle work with Python3
      2fb2db3 [Josh Rosen] Guard more changes behind sys.version; still doesn't run
      1aa5e8f [twneale] Turned out `pickle.DictionaryType is dict` == True, so swapped it out
      7354371 [twneale] buffer --> memoryview  I'm not super sure if this a valid change, but the 2.7 docs recommend using memoryview over buffer where possible, so hoping it'll work.
      b69ccdf [twneale] Uses the pure python pickle._Pickler instead of c-extension _pickle.Pickler. It appears pyspark 2.7 uses the pure python pickler as well, so this shouldn't degrade pickling performance (?).
      f40d925 [twneale] xrange --> range
      e104215 [twneale] Replaces 2.7 types.InstsanceType with 3.4 `object`....could be horribly wrong depending on how types.InstanceType is used elsewhere in the package--see http://bugs.python.org/issue8206
      79de9d0 [twneale] Replaces python2.7 `file` with 3.4 _io.TextIOWrapper
      2adb42d [Josh Rosen] Fix up some import differences between Python 2 and 3
      854be27 [Josh Rosen] Run `futurize` on Python code:
      7c5b4ce [Josh Rosen] Remove Python 3 check in shell.py.
      04e44b37
  32. Apr 01, 2015
    • ksonj's avatar
      [SPARK-6553] [pyspark] Support functools.partial as UDF · 757b2e91
      ksonj authored
      
      Use `f.__repr__()` instead of `f.__name__` when instantiating `UserDefinedFunction`s, so `functools.partial`s may be used.
      
      Author: ksonj <kson@siberie.de>
      
      Closes #5206 from ksonj/partials and squashes the following commits:
      
      ea66f3d [ksonj] Inserted blank lines for PEP8 compliance
      d81b02b [ksonj] added tests for udf with partial function and callable object
      2c76100 [ksonj] Makes UDFs work with all types of callables
      b814a12 [ksonj] support functools.partial as udf
      
      (cherry picked from commit 98f72dfc)
      Signed-off-by: default avatarJosh Rosen <joshrosen@databricks.com>
      757b2e91
  33. Mar 30, 2015
    • Reynold Xin's avatar
      [SPARK-6119][SQL] DataFrame support for missing data handling · b8ff2bc6
      Reynold Xin authored
      This pull request adds variants of DataFrame.na.drop and DataFrame.na.fill to the Scala/Java API, and DataFrame.fillna and DataFrame.dropna to the Python API.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #5274 from rxin/df-missing-value and squashes the following commits:
      
      4ee1b98 [Reynold Xin] Improve error reporting in Python.
      33a330c [Reynold Xin] Remove replace for now.
      bc4fdbb [Reynold Xin] Added documentation for replace.
      d56f5a5 [Reynold Xin] Added replace for Scala/Java.
      2385d00 [Reynold Xin] Feedback from Xiangrui on "how".
      914a374 [Reynold Xin] fill with map.
      185c67e [Reynold Xin] Allow specifying column subsets in fill.
      749eb47 [Reynold Xin] fillna
      249b94e [Reynold Xin] Removing undefined functions.
      6a73c68 [Reynold Xin] Missing file.
      67d7003 [Reynold Xin] [SPARK-6119][SQL] DataFrame.na.drop (Scala/Java) and DataFrame.dropna (Python)
      b8ff2bc6
  34. Feb 27, 2015
    • Davies Liu's avatar
      [SPARK-6055] [PySpark] fix incorrect __eq__ of DataType · e0e64ba4
      Davies Liu authored
      The _eq_ of DataType is not correct, class cache is not use correctly (created class can not be find by dataType), then it will create lots of classes (saved in _cached_cls), never released.
      
      Also, all same DataType have same hash code, there will be many object in a dict with the same hash code, end with hash attach, it's very slow to access this dict (depends on the implementation of CPython).
      
      This PR also improve the performance of inferSchema (avoid the unnecessary converter of object).
      
      cc pwendell  JoshRosen
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4808 from davies/leak and squashes the following commits:
      
      6a322a4 [Davies Liu] tests refactor
      3da44fc [Davies Liu] fix __eq__ of Singleton
      534ac90 [Davies Liu] add more checks
      46999dc [Davies Liu] fix tests
      d9ae973 [Davies Liu] fix memory leak in sql
      e0e64ba4
  35. Feb 24, 2015
    • Davies Liu's avatar
      [SPARK-5994] [SQL] Python DataFrame documentation fixes · d641fbb3
      Davies Liu authored
      select empty should NOT be the same as select. make sure selectExpr is behaving the same.
      join param documentation
      link to source doesn't work in jekyll generated file
      cross reference of columns (i.e. enabling linking)
      show(): move df example before df.show()
      move tests in SQLContext out of docstring otherwise doc is too long
      Column.desc and .asc doesn't have any documentation
      in documentation, sort functions.*)
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4756 from davies/df_docs and squashes the following commits:
      
      f30502c [Davies Liu] fix doc
      32f0d46 [Davies Liu] fix DataFrame docs
      d641fbb3
Loading