Skip to content
Snippets Groups Projects
  1. Apr 29, 2017
    • hyukjinkwon's avatar
      [SPARK-20442][PYTHON][DOCS] Fill up documentations for functions in Column API in PySpark · d228cd0b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to fill up the documentation with examples for `bitwiseOR`, `bitwiseAND`, `bitwiseXOR`. `contains`, `asc` and `desc` in `Column` API.
      
      Also, this PR fixes minor typos in the documentation and matches some of the contents between Scala doc and Python doc.
      
      Lastly, this PR suggests to use `spark` rather than `sc` in doc tests in `Column` for Python documentation.
      
      ## How was this patch tested?
      
      Doc tests were added and manually tested with the commands below:
      
      `./python/run-tests.py --module pyspark-sql`
      `./python/run-tests.py --module pyspark-sql --python-executable python3`
      `./dev/lint-python`
      
      Output was checked via `make html` under `./python/docs`. The snapshots will be left on the codes with comments.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17737 from HyukjinKwon/SPARK-20442.
      d228cd0b
  2. Apr 22, 2017
    • Michael Patterson's avatar
      [SPARK-20132][DOCS] Add documentation for column string functions · 8765bc17
      Michael Patterson authored
      ## What changes were proposed in this pull request?
      Add docstrings to column.py for the Column functions `rlike`, `like`, `startswith`, and `endswith`. Pass these docstrings through `_bin_op`
      
      There may be a better place to put the docstrings. I put them immediately above the Column class.
      
      ## How was this patch tested?
      
      I ran `make html` on my local computer to remake the documentation, and verified that the html pages were displaying the docstrings correctly. I tried running `dev-tests`, and the formatting tests passed. However, my mvn build didn't work I think due to issues on my computer.
      
      These docstrings are my original work and free license.
      
      davies has done the most recent work reorganizing `_bin_op`
      
      Author: Michael Patterson <map222@gmail.com>
      
      Closes #17469 from map222/patterson-documentation.
      8765bc17
  3. Mar 05, 2017
    • hyukjinkwon's avatar
      [SPARK-19701][SQL][PYTHON] Throws a correct exception for 'in' operator against column · 224e0e78
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to remove incorrect implementation that has been not executed so far (at least from Spark 1.5.2) for `in` operator and throw a correct exception rather than saying it is a bool. I tested the codes above in 1.5.2, 1.6.3, 2.1.0 and in the master branch as below:
      
      **1.5.2**
      
      ```python
      >>> df = sqlContext.createDataFrame([[1]])
      >>> 1 in df._1
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-1.5.2-bin-hadoop2.6/python/pyspark/sql/column.py", line 418, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **1.6.3**
      
      ```python
      >>> 1 in sqlContext.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-1.6.3-bin-hadoop2.6/python/pyspark/sql/column.py", line 447, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **2.1.0**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/column.py", line 426, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **Current Master**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 452, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      **After**
      
      ```python
      >>> 1 in spark.range(1).id
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 184, in __contains__
          raise ValueError("Cannot apply 'in' operator against a column: please use 'contains' "
      ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.
      ```
      
      In more details,
      
      It seems the implementation intended to support this
      
      ```python
      1 in df.column
      ```
      
      However, currently, it throws an exception as below:
      
      ```python
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/column.py", line 426, in __nonzero__
          raise ValueError("Cannot convert column into bool: please use '&' for 'and', '|' for 'or', "
      ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.
      ```
      
      What happens here is as below:
      
      ```python
      class Column(object):
          def __contains__(self, item):
              print "I am contains"
              return Column()
          def __nonzero__(self):
              raise Exception("I am nonzero.")
      
      >>> 1 in Column()
      I am contains
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File "<stdin>", line 6, in __nonzero__
      Exception: I am nonzero.
      ```
      
      It seems it calls `__contains__` first and then `__nonzero__` or `__bool__` is being called against `Column()` to make this a bool (or int to be specific).
      
      It seems `__nonzero__` (for Python 2), `__bool__` (for Python 3) and `__contains__` forcing the the return into a bool unlike other operators. There are few references about this as below:
      
      https://bugs.python.org/issue16011
      http://stackoverflow.com/questions/12244074/python-source-code-for-built-in-in-operator/12244378#12244378
      http://stackoverflow.com/questions/38542543/functionality-of-python-in-vs-contains/38542777
      
      It seems we can't overwrite `__nonzero__` or `__bool__` as a workaround to make this working because these force the return type as a bool as below:
      
      ```python
      class Column(object):
          def __contains__(self, item):
              print "I am contains"
              return Column()
          def __nonzero__(self):
              return "a"
      
      >>> 1 in Column()
      I am contains
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
      TypeError: __nonzero__ should return bool or int, returned str
      ```
      
      ## How was this patch tested?
      
      Added unit tests in `tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17160 from HyukjinKwon/SPARK-19701.
      224e0e78
  4. Feb 23, 2017
    • Wenchen Fan's avatar
      [SPARK-19706][PYSPARK] add Column.contains in pyspark · 4fa4cf1d
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      to be consistent with the scala API, we should also add `contains` to `Column` in pyspark.
      
      ## How was this patch tested?
      
      updated unit test
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #17036 from cloud-fan/pyspark.
      4fa4cf1d
  5. Feb 14, 2017
    • Sheamus K. Parkes's avatar
      [SPARK-18541][PYTHON] Add metadata parameter to pyspark.sql.Column.alias() · 7b64f7aa
      Sheamus K. Parkes authored
      ## What changes were proposed in this pull request?
      
      Add a `metadata` keyword parameter to `pyspark.sql.Column.alias()` to allow users to mix-in metadata while manipulating `DataFrame`s in `pyspark`.  Without this, I believe it was necessary to pass back through `SparkSession.createDataFrame` each time a user wanted to manipulate `StructField.metadata` in `pyspark`.
      
      This pull request also improves consistency between the Scala and Python APIs (i.e. I did not add any functionality that was not already in the Scala API).
      
      Discussed ahead of time on JIRA with marmbrus
      
      ## How was this patch tested?
      
      Added unit tests (and doc tests).  Ran the pertinent tests manually.
      
      Author: Sheamus K. Parkes <shea.parkes@milliman.com>
      
      Closes #16094 from shea-parkes/pyspark-column-alias-metadata.
      7b64f7aa
  6. Feb 13, 2017
    • zero323's avatar
      [SPARK-19429][PYTHON][SQL] Support slice arguments in Column.__getitem__ · e02ac303
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add support for `slice` arguments in `Column.__getitem__`.
      - Remove obsolete `__getslice__` bindings.
      
      ## How was this patch tested?
      
      Existing unit tests, additional tests covering `[]` with `slice`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16771 from zero323/SPARK-19429.
      e02ac303
  7. Jan 30, 2017
    • zero323's avatar
      [SPARK-19403][PYTHON][SQL] Correct pyspark.sql.column.__all__ list. · 06fbc355
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This removes from the `__all__` list class names that are not defined (visible) in the `pyspark.sql.column`.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16742 from zero323/SPARK-19403.
      06fbc355
  8. Aug 25, 2016
  9. May 23, 2016
    • WeichenXu's avatar
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with... · a15ca553
      WeichenXu authored
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code
      
      ## What changes were proposed in this pull request?
      
      Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
      a15ca553
  10. May 11, 2016
    • Reynold Xin's avatar
      [SPARK-15278] [SQL] Remove experimental tag from Python DataFrame · 40ba87f7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Earlier we removed experimental tag for Scala/Java DataFrames, but haven't done so for Python. This patch removes the experimental flag for Python and declares them stable.
      
      ## How was this patch tested?
      N/A.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13062 from rxin/SPARK-15278.
      40ba87f7
  11. May 04, 2016
    • Andrew Or's avatar
      [SPARK-14896][SQL] Deprecate HiveContext in python · fa79d346
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      See title.
      
      ## How was this patch tested?
      
      PySpark tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12917 from andrewor14/deprecate-hive-context-python.
      fa79d346
  12. Mar 23, 2016
    • Reynold Xin's avatar
      [SPARK-14088][SQL] Some Dataset API touch-up · 926a93e5
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL.
      2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups.
      3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users.
      4. Remove "subtract" function since it is just an alias for "except".
      
      ## How was this patch tested?
      All changes should be covered by existing tests. Also added couple test cases to cover "name".
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11908 from rxin/SPARK-14088.
      926a93e5
  13. Mar 14, 2016
    • Reynold Xin's avatar
      [SPARK-10380][SQL] Fix confusing documentation examples for astype/drop_duplicates. · 8e0b0306
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We have seen users getting confused by the documentation for astype and drop_duplicates, because the examples in them do not use these functions (but do uses their aliases). This patch simply removes all examples for these functions, and say that they are aliases.
      
      ## How was this patch tested?
      Existing PySpark unit tests.
      
      Closes #11543.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11698 from rxin/SPARK-10380.
      8e0b0306
  14. Feb 21, 2016
    • Cheng Lian's avatar
      [SPARK-12799] Simplify various string output for expressions · d9efe63e
      Cheng Lian authored
      This PR introduces several major changes:
      
      1. Replacing `Expression.prettyString` with `Expression.sql`
      
         The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
      
      1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
      
         Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:
      
         Expression         | `prettyString` | `sql`      | Note
         ------------------ | -------------- | ---------- | ---------------
         `a && b`           | `a && b`       | `a AND b`  |
         `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct
      
      1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
      
         `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
      d9efe63e
  15. Jan 13, 2016
    • Reynold Xin's avatar
      [SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values" · cbbcd8e4
      Reynold Xin authored
      This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.
      
      Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10734 from rxin/simplify-case.
      cbbcd8e4
  16. Jan 04, 2016
  17. Nov 23, 2015
  18. Sep 11, 2015
  19. Sep 08, 2015
  20. Sep 02, 2015
    • 0x0FFF's avatar
      [SPARK-10417] [SQL] Iterating through Column results in infinite loop · 6cd98c18
      0x0FFF authored
      `pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)
      
      Issue reproduction:
      ```
      df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
      for i in df["name"]: print i
      ```
      
      Author: 0x0FFF <programmerag@gmail.com>
      
      Closes #8574 from 0x0FFF/SPARK-10417.
      6cd98c18
  21. Aug 25, 2015
  22. Aug 06, 2015
  23. Jun 23, 2015
    • Davies Liu's avatar
      [SPARK-8573] [SPARK-8568] [SQL] [PYSPARK] raise Exception if column is used in booelan expression · 7fb5ae50
      Davies Liu authored
      It's a common mistake that user will put Column in a boolean expression (together with `and` , `or`), which does not work as expected, we should raise a exception in that case, and suggest user to use `&`, `|` instead.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6961 from davies/column_bool and squashes the following commits:
      
      9f19beb [Davies Liu] update message
      af74bd6 [Davies Liu] fix tests
      07dff84 [Davies Liu] address comments, fix tests
      f70c08e [Davies Liu] raise Exception if column is used in booelan expression
      7fb5ae50
  24. Jun 02, 2015
  25. May 23, 2015
    • Davies Liu's avatar
      [SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates · efe3bfdf
      Davies Liu authored
      1. ntile should take an integer as parameter.
      2. Added Python API (based on #6364)
      3. Update documentation of various DataFrame Python functions.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6374 from rxin/window-final and squashes the following commits:
      
      69004c7 [Reynold Xin] Style fix.
      288cea9 [Reynold Xin] Update documentaiton.
      7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window
      66092b4 [Davies Liu] update docs
      ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation.
      ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4
      8936ade [Davies Liu] fix maxint in python 3
      2649358 [Davies Liu] update docs
      778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
      efe3bfdf
  26. May 21, 2015
    • kaka1992's avatar
      [SPARK-7394][SQL] Add Pandas style cast (astype) · 699906e5
      kaka1992 authored
      Author: kaka1992 <kaka_1992@163.com>
      
      Closes #6313 from kaka1992/astype and squashes the following commits:
      
      73dfd0b [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      ad8feb2 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      4f328b7 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      699906e5
    • Davies Liu's avatar
      [SPARK-7606] [SQL] [PySpark] add version to Python SQL API docs · 8ddcb25b
      Davies Liu authored
      Add version info for public Python SQL API.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6295 from davies/versions and squashes the following commits:
      
      cfd91e6 [Davies Liu] add more version for DataFrame API
      600834d [Davies Liu] add version to SQL API docs
      8ddcb25b
  27. May 15, 2015
    • Davies Liu's avatar
      [SPARK-7543] [SQL] [PySpark] split dataframe.py into multiple files · d7b69946
      Davies Liu authored
      dataframe.py is splited into column.py, group.py and dataframe.py:
      ```
         360 column.py
        1223 dataframe.py
         183 group.py
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6201 from davies/split_df and squashes the following commits:
      
      fc8f5ab [Davies Liu] split dataframe.py into multiple files
      d7b69946
Loading