Skip to content
Snippets Groups Projects
  1. Aug 25, 2016
  2. May 23, 2016
    • WeichenXu's avatar
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with... · a15ca553
      WeichenXu authored
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code
      
      ## What changes were proposed in this pull request?
      
      Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
      a15ca553
  3. May 11, 2016
    • Reynold Xin's avatar
      [SPARK-15278] [SQL] Remove experimental tag from Python DataFrame · 40ba87f7
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Earlier we removed experimental tag for Scala/Java DataFrames, but haven't done so for Python. This patch removes the experimental flag for Python and declares them stable.
      
      ## How was this patch tested?
      N/A.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #13062 from rxin/SPARK-15278.
      40ba87f7
  4. May 04, 2016
    • Andrew Or's avatar
      [SPARK-14896][SQL] Deprecate HiveContext in python · fa79d346
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      See title.
      
      ## How was this patch tested?
      
      PySpark tests.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #12917 from andrewor14/deprecate-hive-context-python.
      fa79d346
  5. Mar 23, 2016
    • Reynold Xin's avatar
      [SPARK-14088][SQL] Some Dataset API touch-up · 926a93e5
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL.
      2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups.
      3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users.
      4. Remove "subtract" function since it is just an alias for "except".
      
      ## How was this patch tested?
      All changes should be covered by existing tests. Also added couple test cases to cover "name".
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11908 from rxin/SPARK-14088.
      926a93e5
  6. Mar 14, 2016
    • Reynold Xin's avatar
      [SPARK-10380][SQL] Fix confusing documentation examples for astype/drop_duplicates. · 8e0b0306
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      We have seen users getting confused by the documentation for astype and drop_duplicates, because the examples in them do not use these functions (but do uses their aliases). This patch simply removes all examples for these functions, and say that they are aliases.
      
      ## How was this patch tested?
      Existing PySpark unit tests.
      
      Closes #11543.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11698 from rxin/SPARK-10380.
      8e0b0306
  7. Feb 21, 2016
    • Cheng Lian's avatar
      [SPARK-12799] Simplify various string output for expressions · d9efe63e
      Cheng Lian authored
      This PR introduces several major changes:
      
      1. Replacing `Expression.prettyString` with `Expression.sql`
      
         The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
      
      1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
      
         Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:
      
         Expression         | `prettyString` | `sql`      | Note
         ------------------ | -------------- | ---------- | ---------------
         `a && b`           | `a && b`       | `a AND b`  |
         `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct
      
      1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
      
         `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
      d9efe63e
  8. Jan 13, 2016
    • Reynold Xin's avatar
      [SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values" · cbbcd8e4
      Reynold Xin authored
      This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.
      
      Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #10734 from rxin/simplify-case.
      cbbcd8e4
  9. Jan 04, 2016
  10. Nov 23, 2015
  11. Sep 11, 2015
  12. Sep 08, 2015
  13. Sep 02, 2015
    • 0x0FFF's avatar
      [SPARK-10417] [SQL] Iterating through Column results in infinite loop · 6cd98c18
      0x0FFF authored
      `pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)
      
      Issue reproduction:
      ```
      df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
      for i in df["name"]: print i
      ```
      
      Author: 0x0FFF <programmerag@gmail.com>
      
      Closes #8574 from 0x0FFF/SPARK-10417.
      6cd98c18
  14. Aug 25, 2015
  15. Aug 06, 2015
  16. Jun 23, 2015
    • Davies Liu's avatar
      [SPARK-8573] [SPARK-8568] [SQL] [PYSPARK] raise Exception if column is used in booelan expression · 7fb5ae50
      Davies Liu authored
      It's a common mistake that user will put Column in a boolean expression (together with `and` , `or`), which does not work as expected, we should raise a exception in that case, and suggest user to use `&`, `|` instead.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6961 from davies/column_bool and squashes the following commits:
      
      9f19beb [Davies Liu] update message
      af74bd6 [Davies Liu] fix tests
      07dff84 [Davies Liu] address comments, fix tests
      f70c08e [Davies Liu] raise Exception if column is used in booelan expression
      7fb5ae50
  17. Jun 02, 2015
  18. May 23, 2015
    • Davies Liu's avatar
      [SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates · efe3bfdf
      Davies Liu authored
      1. ntile should take an integer as parameter.
      2. Added Python API (based on #6364)
      3. Update documentation of various DataFrame Python functions.
      
      Author: Davies Liu <davies@databricks.com>
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #6374 from rxin/window-final and squashes the following commits:
      
      69004c7 [Reynold Xin] Style fix.
      288cea9 [Reynold Xin] Update documentaiton.
      7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window
      66092b4 [Davies Liu] update docs
      ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation.
      ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4
      8936ade [Davies Liu] fix maxint in python 3
      2649358 [Davies Liu] update docs
      778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
      efe3bfdf
  19. May 21, 2015
    • kaka1992's avatar
      [SPARK-7394][SQL] Add Pandas style cast (astype) · 699906e5
      kaka1992 authored
      Author: kaka1992 <kaka_1992@163.com>
      
      Closes #6313 from kaka1992/astype and squashes the following commits:
      
      73dfd0b [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      ad8feb2 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      4f328b7 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
      699906e5
    • Davies Liu's avatar
      [SPARK-7606] [SQL] [PySpark] add version to Python SQL API docs · 8ddcb25b
      Davies Liu authored
      Add version info for public Python SQL API.
      
      cc rxin
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6295 from davies/versions and squashes the following commits:
      
      cfd91e6 [Davies Liu] add more version for DataFrame API
      600834d [Davies Liu] add version to SQL API docs
      8ddcb25b
  20. May 15, 2015
    • Davies Liu's avatar
      [SPARK-7543] [SQL] [PySpark] split dataframe.py into multiple files · d7b69946
      Davies Liu authored
      dataframe.py is splited into column.py, group.py and dataframe.py:
      ```
         360 column.py
        1223 dataframe.py
         183 group.py
      ```
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #6201 from davies/split_df and squashes the following commits:
      
      fc8f5ab [Davies Liu] split dataframe.py into multiple files
      d7b69946
Loading