Skip to content
Snippets Groups Projects
  1. May 07, 2017
    • zero323's avatar
      [SPARK-18777][PYTHON][SQL] Return UDF from udf.register · 63d90e7d
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Move udf wrapping code from `functions.udf` to `functions.UserDefinedFunction`.
      - Return wrapped udf from `catalog.registerFunction` and dependent methods.
      - Update docstrings in `catalog.registerFunction` and `SQLContext.registerFunction`.
      - Unit tests.
      
      ## How was this patch tested?
      
      - Existing unit tests and docstests.
      - Additional tests covering new feature.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #17831 from zero323/SPARK-18777.
      63d90e7d
  2. Mar 26, 2017
  3. Mar 20, 2017
    • hyukjinkwon's avatar
      [SPARK-19849][SQL] Support ArrayType in to_json to produce JSON array · 0cdcf911
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support an array of struct type in `to_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      
      val df = Seq(Tuple1(Tuple1(1) :: Nil)).toDF("a")
      df.select(to_json($"a").as("json")).show()
      ```
      
      ```
      +----------+
      |      json|
      +----------+
      |[{"_1":1}]|
      +----------+
      ```
      
      Currently, it throws an exception as below (a newline manually inserted for readability):
      
      ```
      org.apache.spark.sql.AnalysisException: cannot resolve 'structtojson(`array`)' due to data type
      mismatch: structtojson requires that the expression is a struct expression.;;
      ```
      
      This allows the roundtrip with `from_json` as below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      val df = Seq("""[{"a":1}, {"a":2}]""").toDF("json").select(from_json($"json", schema).as("array"))
      df.show()
      
      // Read back.
      df.select(to_json($"array").as("json")).show()
      ```
      
      ```
      +----------+
      |     array|
      +----------+
      |[[1], [2]]|
      +----------+
      
      +-----------------+
      |             json|
      +-----------------+
      |[{"a":1},{"a":2}]|
      +-----------------+
      ```
      
      Also, this PR proposes to rename from `StructToJson` to `StructsToJson ` and `JsonToStruct` to `JsonToStructs`.
      
      ## How was this patch tested?
      
      Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite` for Scala, doctest for Python and test in `test_sparkSQL.R` for R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17192 from HyukjinKwon/SPARK-19849.
      0cdcf911
  4. Mar 05, 2017
    • hyukjinkwon's avatar
      [SPARK-19595][SQL] Support json array in from_json · 369a148e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to both,
      
      **Do not allow json arrays with multiple elements and return null in `from_json` with `StructType` as the schema.**
      
      Currently, it only reads the single row when the input is a json array. So, the codes below:
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      val schema = StructType(StructField("a", IntegerType) :: Nil)
      Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("struct").select(from_json(col("struct"), schema)).show()
      ```
      prints
      
      ```
      +--------------------+
      |jsontostruct(struct)|
      +--------------------+
      |                 [1]|
      +--------------------+
      ```
      
      This PR simply suggests to print this as `null` if the schema is `StructType` and input is json array.with multiple elements
      
      ```
      +--------------------+
      |jsontostruct(struct)|
      +--------------------+
      |                null|
      +--------------------+
      ```
      
      **Support json arrays in `from_json` with `ArrayType` as the schema.**
      
      ```scala
      import org.apache.spark.sql.functions._
      import org.apache.spark.sql.types._
      val schema = ArrayType(StructType(StructField("a", IntegerType) :: Nil))
      Seq(("""[{"a": 1}, {"a": 2}]""")).toDF("array").select(from_json(col("array"), schema)).show()
      ```
      
      prints
      
      ```
      +-------------------+
      |jsontostruct(array)|
      +-------------------+
      |         [[1], [2]]|
      +-------------------+
      ```
      
      ## How was this patch tested?
      
      Unit test in `JsonExpressionsSuite`, `JsonFunctionsSuite`, Python doctests and manual test.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16929 from HyukjinKwon/disallow-array.
      369a148e
  5. Feb 24, 2017
    • zero323's avatar
      [SPARK-19161][PYTHON][SQL] Improving UDF Docstrings · 4a5e38f5
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Replaces `UserDefinedFunction` object returned from `udf` with a function wrapper providing docstring and arguments information as proposed in [SPARK-19161](https://issues.apache.org/jira/browse/SPARK-19161).
      
      ### Backward incompatible changes:
      
      - `pyspark.sql.functions.udf` will return a `function` instead of `UserDefinedFunction`. To ensure backward compatible public API we use function attributes to mimic  `UserDefinedFunction` API (`func` and `returnType` attributes).  This should have a minimal impact on the user code.
      
        An alternative implementation could use dynamical sub-classing. This would ensure full backward compatibility but is more fragile in practice.
      
      ### Limitations:
      
      Full functionality (retained docstring and argument list) is achieved only in the recent Python version. Legacy Python version will preserve only docstrings, but not argument list. This should be an acceptable trade-off between achieved improvements and overall complexity.
      
      ### Possible impact on other tickets:
      
      This can affect [SPARK-18777](https://issues.apache.org/jira/browse/SPARK-18777).
      
      ## How was this patch tested?
      
      Existing unit tests to ensure backward compatibility, additional tests targeting proposed changes.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16534 from zero323/SPARK-19161.
      4a5e38f5
  6. Feb 15, 2017
    • zero323's avatar
      [SPARK-19160][PYTHON][SQL] Add udf decorator · c97f4e17
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This PR adds `udf` decorator syntax as proposed in [SPARK-19160](https://issues.apache.org/jira/browse/SPARK-19160).
      
      This allows users to define UDF using simplified syntax:
      
      ```python
      from pyspark.sql.decorators import udf
      
      udf(IntegerType())
      def add_one(x):
          """Adds one"""
          if x is not None:
              return x + 1
       ```
      
      without need to define a separate function and udf.
      
      ## How was this patch tested?
      
      Existing unit tests to ensure backward compatibility and additional unit tests covering new functionality.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16533 from zero323/SPARK-19160.
      c97f4e17
  7. Feb 14, 2017
  8. Feb 13, 2017
    • zero323's avatar
      [SPARK-19427][PYTHON][SQL] Support data type string as a returnType argument of UDF · ab88b241
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add support for data type string as a return type argument of `UserDefinedFunction`:
      
      ```python
      f = udf(lambda x: x, "integer")
       f.returnType
      
      ## IntegerType
      ```
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests covering new feature.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16769 from zero323/SPARK-19427.
      ab88b241
  9. Feb 07, 2017
    • anabranch's avatar
      [SPARK-16609] Add to_date/to_timestamp with format functions · 7a7ce272
      anabranch authored
      ## What changes were proposed in this pull request?
      
      This pull request adds two new user facing functions:
      - `to_date` which accepts an expression and a format and returns a date.
      - `to_timestamp` which accepts an expression and a format and returns a timestamp.
      
      For example, Given a date in format: `2016-21-05`. (YYYY-dd-MM)
      
      ### Date Function
      *Previously*
      ```
      to_date(unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp"))
      ```
      *Current*
      ```
      to_date(lit("2016-21-05"), "yyyy-dd-MM")
      ```
      
      ### Timestamp Function
      *Previously*
      ```
      unix_timestamp(lit("2016-21-05"), "yyyy-dd-MM").cast("timestamp")
      ```
      *Current*
      ```
      to_timestamp(lit("2016-21-05"), "yyyy-dd-MM")
      ```
      ### Tasks
      
      - [X] Add `to_date` to Scala Functions
      - [x] Add `to_date` to Python Functions
      - [x] Add `to_date` to SQL Functions
      - [X] Add `to_timestamp` to Scala Functions
      - [x] Add `to_timestamp` to Python Functions
      - [x] Add `to_timestamp` to SQL Functions
      - [x] Add function to R
      
      ## How was this patch tested?
      
      - [x] Add Functions to `DateFunctionsSuite`
      - Test new `ParseToTimestamp` Expression (*not necessary*)
      - Test new `ParseToDate` Expression (*not necessary*)
      - [x] Add test for R
      - [x] Add test for Python in test.py
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: anabranch <wac.chambers@gmail.com>
      Author: Bill Chambers <bill@databricks.com>
      Author: anabranch <bill@databricks.com>
      
      Closes #16138 from anabranch/SPARK-16609.
      7a7ce272
  10. Jan 31, 2017
  11. Jan 12, 2017
    • zero323's avatar
      [SPARK-19164][PYTHON][SQL] Remove unused UserDefinedFunction._broadcast · 5db35b31
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Removes `UserDefinedFunction._broadcast` and `UserDefinedFunction.__del__` method.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16538 from zero323/SPARK-19164.
      5db35b31
  12. Jan 08, 2017
    • anabranch's avatar
      [SPARK-19127][DOCS] Update Rank Function Documentation · 1f6ded64
      anabranch authored
      ## What changes were proposed in this pull request?
      
      - [X] Fix inconsistencies in function reference for dense rank and dense
      - [X] Make all languages equivalent in their reference to `dense_rank` and `rank`.
      
      ## How was this patch tested?
      
      N/A for docs.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: anabranch <wac.chambers@gmail.com>
      
      Closes #16505 from anabranch/SPARK-19127.
      1f6ded64
  13. Nov 22, 2016
  14. Nov 05, 2016
  15. Nov 04, 2016
    • Felix Cheung's avatar
      [SPARK-14393][SQL][DOC] update doc for python and R · a08463b1
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      minor doc update that should go to master & branch-2.1
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15747 from felixcheung/pySPARK-14393.
      a08463b1
  16. Nov 01, 2016
    • hyukjinkwon's avatar
      [SPARK-17764][SQL] Add `to_json` supporting to convert nested struct column to JSON string · 01dd0083
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to add `to_json` function in contrast with `from_json` in Scala, Java and Python.
      
      It'd be useful if we can convert a same column from/to json. Also, some datasources do not support nested types. If we are forced to save a dataframe into those data sources, we might be able to work around by this function.
      
      The usage is as below:
      
      ``` scala
      val df = Seq(Tuple1(Tuple1(1))).toDF("a")
      df.select(to_json($"a").as("json")).show()
      ```
      
      ``` bash
      +--------+
      |    json|
      +--------+
      |{"_1":1}|
      +--------+
      ```
      ## How was this patch tested?
      
      Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15354 from HyukjinKwon/SPARK-17764.
      01dd0083
  17. Oct 07, 2016
  18. Sep 29, 2016
    • Michael Armbrust's avatar
      [SPARK-17699] Support for parsing JSON string columns · fe33121a
      Michael Armbrust authored
      Spark SQL has great support for reading text files that contain JSON data.  However, in many cases the JSON data is just one column amongst others.  This is particularly true when reading from sources such as Kafka.  This PR adds a new functions `from_json` that converts a string column into a nested `StructType` with a user specified schema.
      
      Example usage:
      ```scala
      val df = Seq("""{"a": 1}""").toDS()
      val schema = new StructType().add("a", IntegerType)
      
      df.select(from_json($"value", schema) as 'json) // => [json: <a: int>]
      ```
      
      This PR adds support for java, scala and python.  I leveraged our existing JSON parsing support by moving it into catalyst (so that we could define expressions using it).  I left SQL out for now, because I'm not sure how users would specify a schema.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #15274 from marmbrus/jsonParser.
      fe33121a
  19. Aug 25, 2016
  20. Aug 10, 2016
  21. Aug 07, 2016
    • Sean Owen's avatar
      [SPARK-16409][SQL] regexp_extract with optional groups causes NPE · 8d872520
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      regexp_extract actually returns null when it shouldn't when a regex matches but the requested optional group did not. This makes it return an empty string, as apparently designed.
      
      ## How was this patch tested?
      
      Additional unit test
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14504 from srowen/SPARK-16409.
      8d872520
  22. Jul 28, 2016
  23. Jul 06, 2016
  24. Jun 30, 2016
    • Dongjoon Hyun's avatar
      [SPARK-16289][SQL] Implement posexplode table generating function · 46395db8
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
      
      **Before**
      ```scala
      scala> sql("select posexplode(map('a', 1, 'b', 2))").show
      org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
      ```
      
      **After**
      ```scala
      scala> sql("select posexplode(map('a', 1, 'b', 2))").show
      +---+---+-----+
      |pos|key|value|
      +---+---+-----+
      |  0|  a|    1|
      |  1|  b|    2|
      +---+---+-----+
      ```
      
      For `array` argument, `after` is the same with `before`.
      ```
      scala> sql("select posexplode(array(1, 2, 3))").show
      +---+---+
      |pos|col|
      +---+---+
      |  0|  1|
      |  1|  2|
      |  2|  3|
      +---+---+
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with newly added testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13971 from dongjoon-hyun/SPARK-16289.
      46395db8
  25. May 27, 2016
    • Zheng RuiFeng's avatar
      [MINOR] Fix Typos 'a -> an' · 6b1a6180
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      `a` -> `an`
      
      I use regex to generate potential error lines:
      `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
      and review them line by line.
      
      ## How was this patch tested?
      
      local build
      `lint-java` checking
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13317 from zhengruifeng/a_an.
      6b1a6180
  26. May 24, 2016
    • Daoyuan Wang's avatar
      [SPARK-15397][SQL] fix string udf locate as hive · d642b273
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 and  `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.
      
      ## How was this patch tested?
      
      tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #13186 from adrian-wang/locate.
      d642b273
  27. May 23, 2016
    • WeichenXu's avatar
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with... · a15ca553
      WeichenXu authored
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code
      
      ## What changes were proposed in this pull request?
      
      Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
      a15ca553
    • Dongjoon Hyun's avatar
      [MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions · 37c617e4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.
      
      ## How was this patch tested?
      
      It's only about docs.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13087 from dongjoon-hyun/SPARK-15282.
      37c617e4
  28. Apr 20, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14639] [PYTHON] [R] Add `bround` function in Python/R. · 14869ae6
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This issue aims to expose Scala `bround` function in Python/R API.
      `bround` function is implemented in SPARK-14614 by extending current `round` function.
      We used the following semantics from Hive.
      ```java
      public static double bround(double input, int scale) {
          if (Double.isNaN(input) || Double.isInfinite(input)) {
            return input;
          }
          return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
      }
      ```
      
      After this PR, `pyspark` and `sparkR` also support `bround` function.
      
      **PySpark**
      ```python
      >>> from pyspark.sql.functions import bround
      >>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
      [Row(r=2.0)]
      ```
      
      **SparkR**
      ```r
      > df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
      > head(collect(select(df, bround(df$x, 0))))
        bround(x, 0)
      1            2
      2            4
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcases).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12509 from dongjoon-hyun/SPARK-14639.
      14869ae6
  29. Apr 05, 2016
    • Burak Yavuz's avatar
      [SPARK-14353] Dataset Time Window `window` API for Python, and SQL · 9ee5c257
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
      This PR adds the Python, and SQL, API for this function.
      
      With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
       - `window(timeColumn, windowDuration)`
       - `window(timeColumn, windowDuration, slideDuration)`
       - `window(timeColumn, windowDuration, slideDuration, startTime)`
      
      In Python, users can access all APIs above, but in addition they can do
       - In Python:
         `window(timeColumn, windowDuration, startTime=...)`
      
      that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
      
      ## How was this patch tested?
      
      Unit tests + manual tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12136 from brkyvz/python-windows.
      9ee5c257
  30. Mar 31, 2016
    • Davies Liu's avatar
      [SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch · f0afafdc
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR support multiple Python UDFs within single batch, also improve the performance.
      
      ```python
      >>> from pyspark.sql.types import IntegerType
      >>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
      >>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
      >>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True)
      == Parsed Logical Plan ==
      'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)]
      +- OneRowRelation$
      
      == Analyzed Logical Plan ==
      double(add(1, 2)): int, add(double(2), 1): int
      Project [double(add(1, 2))#14,add(double(2), 1)#15]
      +- Project [double(add(1, 2))#14,add(double(2), 1)#15]
         +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
            +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
               +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
                  +- OneRowRelation$
      
      == Optimized Logical Plan ==
      Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
      +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
         +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
            +- OneRowRelation$
      
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
      :     +- INPUT
      +- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
         +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
            +- Scan OneRowRelation[]
      ```
      
      ## How was this patch tested?
      
      Added new tests.
      
      Using the following script to benchmark 1, 2 and 3 udfs,
      ```
      df = sqlContext.range(1, 1 << 23, 1, 4)
      double = F.udf(lambda x: x * 2, LongType())
      print df.select(double(df.id)).count()
      print df.select(double(df.id), double(df.id + 1)).count()
      print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
      ```
      Here is the results:
      
      N | Before | After  | speed up
      ---- |------------ | -------------|------
      1 | 22 s | 7 s |  3.1X
      2 | 38 s | 13 s | 2.9X
      3 | 58 s | 16 s | 3.6X
      
      This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12057 from davies/multi_udfs.
      f0afafdc
  31. Mar 29, 2016
    • Davies Liu's avatar
      [SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs · a7a93a11
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR brings the support for chained Python UDFs, for example
      
      ```sql
      select udf1(udf2(a))
      select udf1(udf2(a) + 3)
      select udf1(udf2(a) + udf3(b))
      ```
      
      Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.
      
      For example,
      ```python
      >>> sqlContext.sql("select double(double(1))").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [pythonUDF#10 AS double(double(1))#9]
      :     +- INPUT
      +- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
         +- Scan OneRowRelation[]
      >>> sqlContext.sql("select double(double(1) + double(2))").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
      :     +- INPUT
      +- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
         +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
            +- !BatchPythonEvaluation double(1), [pythonUDF#17]
               +- Scan OneRowRelation[]
      ```
      
      TODO: will support multiple unrelated Python UDFs in one batch (another PR).
      
      ## How was this patch tested?
      
      Added new unit tests for chained UDFs.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12014 from davies/py_udfs.
      a7a93a11
  32. Mar 25, 2016
    • Wenchen Fan's avatar
      [SPARK-14061][SQL] implement CreateMap · 43b15e01
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`.  This PR adds the `CreateMap` expression, and the DataFrame API, and python API.
      
      ## How was this patch tested?
      
      various new tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11879 from cloud-fan/create_map.
      43b15e01
  33. Mar 09, 2016
    • Tristan Reid's avatar
      [MINOR] Fix typo in 'hypot' docstring · 5f7dbdba
      Tristan Reid authored
      Minor typo:  docstring for pyspark.sql.functions: hypot has extra characters
      
      N/A
      
      Author: Tristan Reid <treid@netflix.com>
      
      Closes #11616 from tristanreid/master.
      5f7dbdba
  34. Mar 05, 2016
    • gatorsmile's avatar
      [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets · adce5ee7
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      This PR is for supporting SQL generation for cube, rollup and grouping sets.
      
      For example, a query using rollup:
      ```SQL
      SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
      ```
      Original logical plan:
      ```
        Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
                  [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
                   (key#17L % cast(5 as bigint))#47L AS _c1#45L,
                   grouping__id#46 AS _c2#44]
        +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
                   List(key#17L, value#18, null, 1)],
                  [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
           +- Project [key#17L,
                       value#18,
                       (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
              +- Subquery t1
                 +- Relation[key#17L,value#18] ParquetRelation
      ```
      Converted SQL:
      ```SQL
        SELECT count( 1) AS `cnt`,
               (`t1`.`key` % CAST(5 AS BIGINT)),
               grouping_id() AS `_c2`
        FROM `default`.`t1`
        GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
        GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
      ```
      
      #### How was the this patch tested?
      
      Added eight test cases in `LogicalPlanToSQLSuite`.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #11283 from gatorsmile/groupingSetsToSQL.
      adce5ee7
  35. Mar 02, 2016
  36. Feb 24, 2016
    • Wenchen Fan's avatar
      [SPARK-13467] [PYSPARK] abstract python function to simplify pyspark code · a60f9128
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.
      
      ## How was the this patch tested?
      
      by existing unit tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11342 from cloud-fan/python-clean.
      a60f9128
  37. Feb 22, 2016
  38. Feb 21, 2016
    • Cheng Lian's avatar
      [SPARK-12799] Simplify various string output for expressions · d9efe63e
      Cheng Lian authored
      This PR introduces several major changes:
      
      1. Replacing `Expression.prettyString` with `Expression.sql`
      
         The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
      
      1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
      
         Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:
      
         Expression         | `prettyString` | `sql`      | Note
         ------------------ | -------------- | ---------- | ---------------
         `a && b`           | `a && b`       | `a AND b`  |
         `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct
      
      1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
      
         `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
      d9efe63e
  39. Feb 20, 2016
Loading