Skip to content
Snippets Groups Projects
  1. Jul 08, 2017
    • Michael Patterson's avatar
      [SPARK-20456][DOCS] Add examples for functions collection for pyspark · f5f02d21
      Michael Patterson authored
      ## What changes were proposed in this pull request?
      
      This adds documentation to many functions in pyspark.sql.functions.py:
      `upper`, `lower`, `reverse`, `unix_timestamp`, `from_unixtime`, `rand`, `randn`, `collect_list`, `collect_set`, `lit`
      Add units to the trigonometry functions.
      Renames columns in datetime examples to be more informative.
      Adds links between some functions.
      
      ## How was this patch tested?
      
      `./dev/lint-python`
      `python python/pyspark/sql/functions.py`
      `./python/run-tests.py --module pyspark-sql`
      
      Author: Michael Patterson <map222@gmail.com>
      
      Closes #17865 from map222/spark-20456.
      f5f02d21
  2. Jul 07, 2017
    • Takuya UESHIN's avatar
      [SPARK-21327][SQL][PYSPARK] ArrayConstructor should handle an array of... · 53c2eb59
      Takuya UESHIN authored
      [SPARK-21327][SQL][PYSPARK] ArrayConstructor should handle an array of typecode 'l' as long rather than int in Python 2.
      
      ## What changes were proposed in this pull request?
      
      Currently `ArrayConstructor` handles an array of typecode `'l'` as `int` when converting Python object in Python 2 into Java object, so if the value is larger than `Integer.MAX_VALUE` or smaller than `Integer.MIN_VALUE` then the overflow occurs.
      
      ```python
      import array
      data = [Row(longarray=array.array('l', [-9223372036854775808, 0, 9223372036854775807]))]
      df = spark.createDataFrame(data)
      df.show(truncate=False)
      ```
      
      ```
      +----------+
      |longarray |
      +----------+
      |[0, 0, -1]|
      +----------+
      ```
      
      This should be:
      
      ```
      +----------------------------------------------+
      |longarray                                     |
      +----------------------------------------------+
      |[-9223372036854775808, 0, 9223372036854775807]|
      +----------------------------------------------+
      ```
      
      ## How was this patch tested?
      
      Added a test and existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18553 from ueshin/issues/SPARK-21327.
      53c2eb59
  3. Jul 05, 2017
    • Dongjoon Hyun's avatar
      [SPARK-21278][PYSPARK] Upgrade to Py4J 0.10.6 · c8d0aba1
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR aims to bump Py4J in order to fix the following float/double bug.
      Py4J 0.10.5 fixes this (https://github.com/bartdag/py4j/issues/272) and the latest Py4J is 0.10.6.
      
      **BEFORE**
      ```
      >>> df = spark.range(1)
      >>> df.select(df['id'] + 17.133574204226083).show()
      +--------------------+
      |(id + 17.1335742042)|
      +--------------------+
      |       17.1335742042|
      +--------------------+
      ```
      
      **AFTER**
      ```
      >>> df = spark.range(1)
      >>> df.select(df['id'] + 17.133574204226083).show()
      +-------------------------+
      |(id + 17.133574204226083)|
      +-------------------------+
      |       17.133574204226083|
      +-------------------------+
      ```
      
      ## How was this patch tested?
      
      Manual.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #18546 from dongjoon-hyun/SPARK-21278.
      c8d0aba1
    • Jeff Zhang's avatar
      [SPARK-19439][PYSPARK][SQL] PySpark's registerJavaFunction Should Support UDAFs · 742da086
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Support register Java UDAFs in PySpark so that user can use Java UDAF in PySpark. Besides that I also add api in `UDFRegistration`
      
      ## How was this patch tested?
      
      Unit test is added
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #17222 from zjffdu/SPARK-19439.
      742da086
    • actuaryzhang's avatar
      [SPARK-21310][ML][PYSPARK] Expose offset in PySpark · 4852b7d4
      actuaryzhang authored
      ## What changes were proposed in this pull request?
      Add offset to PySpark in GLM as in #16699.
      
      ## How was this patch tested?
      Python test
      
      Author: actuaryzhang <actuaryzhang10@gmail.com>
      
      Closes #18534 from actuaryzhang/pythonOffset.
      4852b7d4
  4. Jul 04, 2017
  5. Jul 03, 2017
    • hyukjinkwon's avatar
      [SPARK-21264][PYTHON] Call cross join path in join without 'on' and with 'how' · a848d552
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, it throws a NPE when missing columns but join type is speicified in join at PySpark as below:
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "false")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      
      ```
      Traceback (most recent call last):
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling o66.join.
      : java.lang.NullPointerException
      	at org.apache.spark.sql.Dataset.join(Dataset.scala:931)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      ...
      ```
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "true")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      
      ```
      ...
      py4j.protocol.Py4JJavaError: An error occurred while calling o84.join.
      : java.lang.NullPointerException
      	at org.apache.spark.sql.Dataset.join(Dataset.scala:931)
      	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
      ...
      ```
      
      This PR suggests to follow Scala's one as below:
      
      ```scala
      scala> spark.conf.set("spark.sql.crossJoin.enabled", "false")
      
      scala> spark.range(1).join(spark.range(1), Seq.empty[String], "inner").show()
      ```
      
      ```
      org.apache.spark.sql.AnalysisException: Detected cartesian product for INNER join between logical plans
      Range (0, 1, step=1, splits=Some(8))
      and
      Range (0, 1, step=1, splits=Some(8))
      Join condition is missing or trivial.
      Use the CROSS JOIN syntax to allow cartesian products between these relations.;
      ...
      ```
      
      ```scala
      scala> spark.conf.set("spark.sql.crossJoin.enabled", "true")
      
      scala> spark.range(1).join(spark.range(1), Seq.empty[String], "inner").show()
      ```
      ```
      +---+---+
      | id| id|
      +---+---+
      |  0|  0|
      +---+---+
      ```
      
      **After**
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "false")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      
      ```
      Traceback (most recent call last):
      ...
      pyspark.sql.utils.AnalysisException: u'Detected cartesian product for INNER join between logical plans\nRange (0, 1, step=1, splits=Some(8))\nand\nRange (0, 1, step=1, splits=Some(8))\nJoin condition is missing or trivial.\nUse the CROSS JOIN syntax to allow cartesian products between these relations.;'
      ```
      
      ```python
      spark.conf.set("spark.sql.crossJoin.enabled", "true")
      spark.range(1).join(spark.range(1), how="inner").show()
      ```
      ```
      +---+---+
      | id| id|
      +---+---+
      |  0|  0|
      +---+---+
      ```
      
      ## How was this patch tested?
      
      Added tests in `python/pyspark/sql/tests.py`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18484 from HyukjinKwon/SPARK-21264.
      a848d552
  6. Jul 02, 2017
    • Yanbo Liang's avatar
      [SPARK-19852][PYSPARK][ML] Python StringIndexer supports 'keep' to handle invalid data · c19680be
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      This PR is to maintain API parity with changes made in SPARK-17498 to support a new option
      'keep' in StringIndexer to handle unseen labels or NULL values with PySpark.
      
      Note: This is updated version of #17237 , the primary author of this PR is VinceShieh .
      ## How was this patch tested?
      Unit tests.
      
      Author: VinceShieh <vincent.xie@intel.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #18453 from yanboliang/spark-19852.
      c19680be
  7. Jul 01, 2017
    • Ruifeng Zheng's avatar
      [SPARK-18518][ML] HasSolver supports override · e0b047ea
      Ruifeng Zheng authored
      ## What changes were proposed in this pull request?
      1, make param support non-final with `finalFields` option
      2, generate `HasSolver` with `finalFields = false`
      3, override `solver` in LiR, GLR, and make MLPC inherit `HasSolver`
      
      ## How was this patch tested?
      existing tests
      
      Author: Ruifeng Zheng <ruifengz@foxmail.com>
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16028 from zhengruifeng/param_non_final.
      e0b047ea
  8. Jun 28, 2017
  9. Jun 23, 2017
    • hyukjinkwon's avatar
      [SPARK-20431][SS][FOLLOWUP] Specify a schema by using a DDL-formatted string in DataStreamReader · 7525ce98
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This pr supported a DDL-formatted string in `DataStreamReader.schema`.
      This fix could make users easily define a schema without importing the type classes.
      
      For example,
      
      ```scala
      scala> spark.readStream.schema("col0 INT, col1 DOUBLE").load("/tmp/abc").printSchema()
      root
       |-- col0: integer (nullable = true)
       |-- col1: double (nullable = true)
      ```
      
      ## How was this patch tested?
      
      Added tests in `DataStreamReaderWriterSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18373 from HyukjinKwon/SPARK-20431.
      7525ce98
    • hyukjinkwon's avatar
      [SPARK-21193][PYTHON] Specify Pandas version in setup.py · 5dca10b8
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It looks we missed specifying the Pandas version. This PR proposes to fix it. For the current state, it should be Pandas 0.13.0 given my test. This PR propose to fix it as 0.13.0.
      
      Running the codes below:
      
      ```python
      from pyspark.sql.types import *
      
      schema = StructType().add("a", IntegerType()).add("b", StringType())\
                           .add("c", BooleanType()).add("d", FloatType())
      data = [
          (1, "foo", True, 3.0,), (2, "foo", True, 5.0),
          (3, "bar", False, -1.0), (4, "bar", False, 6.0),
      ]
      spark.createDataFrame(data, schema).toPandas().dtypes
      ```
      
      prints ...
      
      **With Pandas 0.13.0** - released, 2014-01
      
      ```
      a      int32
      b     object
      c       bool
      d    float32
      dtype: object
      ```
      
      **With Pandas 0.12.0** -  - released, 2013-06
      
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
          pdf[f] = pdf[f].astype(t, copy=False)
      TypeError: astype() got an unexpected keyword argument 'copy'
      ```
      
      without `copy`
      
      ```
      a      int32
      b     object
      c       bool
      d    float32
      dtype: object
      ```
      
      **With Pandas 0.11.0** - released, 2013-03
      
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
          pdf[f] = pdf[f].astype(t, copy=False)
      TypeError: astype() got an unexpected keyword argument 'copy'
      ```
      
      without `copy`
      
      ```
      a      int32
      b     object
      c       bool
      d    float32
      dtype: object
      ```
      
      **With Pandas 0.10.0** -  released, 2012-12
      
      ```
      Traceback (most recent call last):
        File "<stdin>", line 1, in <module>
        File ".../spark/python/pyspark/sql/dataframe.py", line 1734, in toPandas
          pdf[f] = pdf[f].astype(t, copy=False)
      TypeError: astype() got an unexpected keyword argument 'copy'
      ```
      
      without `copy`
      
      ```
      a      int64  # <- this should be 'int32'
      b     object
      c       bool
      d    float64  # <- this should be 'float32'
      ```
      
      ## How was this patch tested?
      
      Manually tested with Pandas from 0.10.0 to 0.13.0.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18403 from HyukjinKwon/SPARK-21193.
      5dca10b8
  10. Jun 22, 2017
    • Bryan Cutler's avatar
      [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas · e4469760
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  All non-complex data types are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
      
      Additions to Spark include a Scala package private method `Dataset.toArrowPayloadBytes` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a public method `DataFrame.collectAsArrow` is added to collect Arrow payloads and an optional flag in `toPandas(useArrow=False)` to enable using Arrow (uses the old conversion by default).
      
      ## How was this patch tested?
      Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.
      
      Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Li Jin <ice.xelloss@gmail.com>
      Author: Li Jin <li.jin@twosigma.com>
      Author: Wes McKinney <wes.mckinney@twosigma.com>
      
      Closes #15821 from BryanCutler/wip-toPandas_with_arrow-SPARK-13534.
      e4469760
    • hyukjinkwon's avatar
      [SPARK-21163][SQL] DataFrame.toPandas should respect the data type · 67c75021
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently we convert a spark DataFrame to Pandas Dataframe by `pd.DataFrame.from_records`. It infers the data type from the data and doesn't respect the spark DataFrame Schema. This PR fixes it.
      
      ## How was this patch tested?
      
      a new regression test
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Wenchen Fan <cloud0fan@gmail.com>
      
      Closes #18378 from cloud-fan/to_pandas.
      67c75021
  11. Jun 21, 2017
    • zero323's avatar
      [SPARK-20830][PYSPARK][SQL] Add posexplode and posexplode_outer · 215281d8
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add Python wrappers for `o.a.s.sql.functions.explode_outer` and `o.a.s.sql.functions.posexplode_outer`.
      
      ## How was this patch tested?
      
      Unit tests, doctests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #18049 from zero323/SPARK-20830.
      215281d8
    • sjarvie's avatar
      [SPARK-21125][PYTHON] Extend setJobDescription to PySpark and JavaSpark APIs · ba78514d
      sjarvie authored
      ## What changes were proposed in this pull request?
      
      Extend setJobDescription to PySpark and JavaSpark APIs
      
      SPARK-21125
      
      ## How was this patch tested?
      
      Testing was done by running a local Spark shell on the built UI. I originally had added a unit test but the PySpark context cannot easily access the Scala Spark Context's private variable with the Job Description key so I omitted the test, due to the simplicity of this addition.
      
      Also ran the existing tests.
      
      # Misc
      
      This contribution is my original work and that I license the work to the project under the project's open source license.
      
      Author: sjarvie <sjarvie@uber.com>
      
      Closes #18332 from sjarvie/add_python_set_job_description.
      ba78514d
  12. Jun 20, 2017
    • Joseph K. Bradley's avatar
      [SPARK-20929][ML] LinearSVC should use its own threshold param · cc67bd57
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      LinearSVC should use its own threshold param, rather than the shared one, since it applies to rawPrediction instead of probability.  This PR changes the param in the Scala, Python and R APIs.
      
      ## How was this patch tested?
      
      New unit test to make sure the threshold can be set to any Double value.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #18151 from jkbradley/ml-2.2-linearsvc-cleanup.
      cc67bd57
  13. Jun 19, 2017
  14. Jun 15, 2017
    • Xiao Li's avatar
      [SPARK-20980][SQL] Rename `wholeFile` to `multiLine` for both CSV and JSON · 20514281
      Xiao Li authored
      ### What changes were proposed in this pull request?
      The current option name `wholeFile` is misleading for CSV users. Currently, it is not representing a record per file. Actually, one file could have multiple records. Thus, we should rename it. Now, the proposal is `multiLine`.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #18202 from gatorsmile/renameCVSOption.
      20514281
  15. Jun 09, 2017
    • Reynold Xin's avatar
      [SPARK-21042][SQL] Document Dataset.union is resolution by position · b78e3849
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Document Dataset.union is resolution by position, not by name, since this has been a confusing point for a lot of users.
      
      ## How was this patch tested?
      N/A - doc only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18256 from rxin/SPARK-21042.
      b78e3849
  16. Jun 03, 2017
    • Ruben Berenguel Montoro's avatar
      [SPARK-19732][SQL][PYSPARK] Add fill functions for nulls in bool fields of datasets · 6cbc61d1
      Ruben Berenguel Montoro authored
      ## What changes were proposed in this pull request?
      
      Allow fill/replace of NAs with booleans, both in Python and Scala
      
      ## How was this patch tested?
      
      Unit tests, doctests
      
      This PR is original work from me and I license this work to the Spark project
      
      Author: Ruben Berenguel Montoro <ruben@mostlymaths.net>
      Author: Ruben Berenguel <ruben@mostlymaths.net>
      
      Closes #18164 from rberenguel/SPARK-19732-fillna-bools.
      6cbc61d1
  17. May 31, 2017
    • gatorsmile's avatar
      [SPARK-19236][SQL][FOLLOW-UP] Added createOrReplaceGlobalTempView method · de934e67
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This PR does the following tasks:
      - Added  since
      - Added the Python API
      - Added test cases
      
      ### How was this patch tested?
      Added test cases to both Scala and Python
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18147 from gatorsmile/createOrReplaceGlobalTempView.
      de934e67
  18. May 30, 2017
  19. May 26, 2017
    • Michael Armbrust's avatar
      [SPARK-20844] Remove experimental from Structured Streaming APIs · d935e0a9
      Michael Armbrust authored
      Now that Structured Streaming has been out for several Spark release and has large production use cases, the `Experimental` label is no longer appropriate.  I've left `InterfaceStability.Evolving` however, as I think we may make a few changes to the pluggable Source & Sink API in Spark 2.3.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #18065 from marmbrus/streamingGA.
      d935e0a9
  20. May 25, 2017
  21. May 24, 2017
    • Bago Amirbekian's avatar
      [SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel · bc66a77b
      Bago Amirbekian authored
      ## What changes were proposed in this pull request?
      
      Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float.
      
      ## How was this patch tested?
      
      Existing tests run using python3 and numpy 1.12.
      
      Author: Bago Amirbekian <bago@databricks.com>
      
      Closes #18081 from MrBago/BF-py3floatbug.
      bc66a77b
    • zero323's avatar
      [SPARK-20631][FOLLOW-UP] Fix incorrect tests. · 1816eb3b
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Fix incorrect tests for `_check_thresholds`.
      - Move test to `ParamTests`.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #18085 from zero323/SPARK-20631-FOLLOW-UP.
      1816eb3b
    • Peng's avatar
      [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with... · 9afcf127
      Peng authored
      [SPARK-20764][ML][PYSPARK][FOLLOWUP] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
      
      ## What changes were proposed in this pull request?
      Add test cases for PR-18062
      
      ## How was this patch tested?
      The existing UT
      
      Author: Peng <peng.meng@intel.com>
      
      Closes #18068 from mpjlu/moreTest.
      9afcf127
  22. May 23, 2017
    • Bago Amirbekian's avatar
      [SPARK-20861][ML][PYTHON] Delegate looping over paramMaps to estimators · 9434280c
      Bago Amirbekian authored
      Changes:
      
      pyspark.ml Estimators can take either a list of param maps or a dict of params. This change allows the CrossValidator and TrainValidationSplit Estimators to pass through lists of param maps to the underlying estimators so that those estimators can handle parallelization when appropriate (eg distributed hyper parameter tuning).
      
      Testing:
      
      Existing unit tests.
      
      Author: Bago Amirbekian <bago@databricks.com>
      
      Closes #18077 from MrBago/delegate_params.
      9434280c
  23. May 22, 2017
    • Peng's avatar
      [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and... · cfca0113
      Peng authored
      [SPARK-20764][ML][PYSPARK] Fix visibility discrepancy with numInstances and degreesOfFreedom in LR and GLR - Python version
      
      ## What changes were proposed in this pull request?
      
      SPARK-20097 exposed degreesOfFreedom in LinearRegressionSummary and numInstances in GeneralizedLinearRegressionSummary. Python API should be updated to reflect these changes.
      
      ## How was this patch tested?
      The existing UT
      
      Author: Peng <peng.meng@intel.com>
      
      Closes #18062 from mpjlu/spark-20764.
      cfca0113
  24. May 21, 2017
  25. May 15, 2017
  26. May 12, 2017
    • hyukjinkwon's avatar
      [SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with... · 720708cc
      hyukjinkwon authored
      [SPARK-20639][SQL] Add single argument support for to_timestamp in SQL with documentation improvement
      
      ## What changes were proposed in this pull request?
      
      This PR proposes three things as below:
      
      - Use casting rules to a timestamp in `to_timestamp` by default (it was `yyyy-MM-dd HH:mm:ss`).
      
      - Support single argument for `to_timestamp` similarly with APIs in other languages.
      
        For example, the one below works
      
        ```
        import org.apache.spark.sql.functions._
        Seq("2016-12-31 00:12:00.00").toDF("a").select(to_timestamp(col("a"))).show()
        ```
      
        prints
      
        ```
        +----------------------------------------+
        |to_timestamp(`a`, 'yyyy-MM-dd HH:mm:ss')|
        +----------------------------------------+
        |                     2016-12-31 00:12:00|
        +----------------------------------------+
        ```
      
        whereas this does not work in SQL.
      
        **Before**
      
        ```
        spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
        Error in query: Invalid number of arguments for function to_timestamp; line 1 pos 7
        ```
      
        **After**
      
        ```
        spark-sql> SELECT to_timestamp('2016-12-31 00:12:00');
        2016-12-31 00:12:00
        ```
      
      - Related document improvement for SQL function descriptions and other API descriptions accordingly.
      
        **Before**
      
        ```
        spark-sql> DESCRIBE FUNCTION extended to_date;
        ...
        Usage: to_date(date_str, fmt) - Parses the `left` expression with the `fmt` expression. Returns null with invalid input.
        Extended Usage:
            Examples:
              > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
               2016-12-31
        ```
      
        ```
        spark-sql> DESCRIBE FUNCTION extended to_timestamp;
        ...
        Usage: to_timestamp(timestamp, fmt) - Parses the `left` expression with the `format` expression to a timestamp. Returns null with invalid input.
        Extended Usage:
            Examples:
              > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
               2016-12-31 00:00:00.0
        ```
      
        **After**
      
        ```
        spark-sql> DESCRIBE FUNCTION extended to_date;
        ...
        Usage:
            to_date(date_str[, fmt]) - Parses the `date_str` expression with the `fmt` expression to
              a date. Returns null with invalid input. By default, it follows casting rules to a date if
              the `fmt` is omitted.
      
        Extended Usage:
            Examples:
              > SELECT to_date('2009-07-30 04:17:52');
               2009-07-30
              > SELECT to_date('2016-12-31', 'yyyy-MM-dd');
               2016-12-31
        ```
      
        ```
        spark-sql> DESCRIBE FUNCTION extended to_timestamp;
        ...
         Usage:
            to_timestamp(timestamp[, fmt]) - Parses the `timestamp` expression with the `fmt` expression to
              a timestamp. Returns null with invalid input. By default, it follows casting rules to
              a timestamp if the `fmt` is omitted.
      
        Extended Usage:
            Examples:
              > SELECT to_timestamp('2016-12-31 00:12:00');
               2016-12-31 00:12:00
              > SELECT to_timestamp('2016-12-31', 'yyyy-MM-dd');
               2016-12-31 00:00:00
        ```
      
      ## How was this patch tested?
      
      Added tests in `datetime.sql`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17901 from HyukjinKwon/to_timestamp_arg.
      720708cc
  27. May 11, 2017
  28. May 10, 2017
    • Josh Rosen's avatar
      [SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ repeated arg. · 8ddbc431
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      There's a latent corner-case bug in PySpark UDF evaluation where executing a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one argument value is repeated_ will crash at execution with a confusing error.
      
      This problem was introduced in #12057: the code there has a fast path for handling a "batch UDF evaluation consisting of a single Python UDF", but that branch incorrectly assumes that a single UDF won't have repeated arguments and therefore skips the code for unpacking arguments from the input row (whose schema may not necessarily match the UDF inputs due to de-duplication of repeated arguments which occurred in the JVM before sending UDF inputs to Python).
      
      This fix here is simply to remove this special-casing: it turns out that the code in the "multiple UDFs" branch just so happens to work for the single-UDF case because Python treats `(x)` as equivalent to `x`, not as a single-argument tuple.
      
      ## How was this patch tested?
      
      New regression test in `pyspark.python.sql.tests` module (tested and confirmed that it fails before my fix).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #17927 from JoshRosen/SPARK-20685.
      8ddbc431
Loading