Skip to content
Snippets Groups Projects
  1. Aug 25, 2016
  2. Aug 10, 2016
  3. Aug 07, 2016
    • Sean Owen's avatar
      [SPARK-16409][SQL] regexp_extract with optional groups causes NPE · 8d872520
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      regexp_extract actually returns null when it shouldn't when a regex matches but the requested optional group did not. This makes it return an empty string, as apparently designed.
      
      ## How was this patch tested?
      
      Additional unit test
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14504 from srowen/SPARK-16409.
      8d872520
  4. Jul 28, 2016
  5. Jul 06, 2016
  6. Jun 30, 2016
    • Dongjoon Hyun's avatar
      [SPARK-16289][SQL] Implement posexplode table generating function · 46395db8
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR implements `posexplode` table generating function. Currently, master branch raises the following exception for `map` argument. It's different from Hive.
      
      **Before**
      ```scala
      scala> sql("select posexplode(map('a', 1, 'b', 2))").show
      org.apache.spark.sql.AnalysisException: No handler for Hive UDF ... posexplode() takes an array as a parameter; line 1 pos 7
      ```
      
      **After**
      ```scala
      scala> sql("select posexplode(map('a', 1, 'b', 2))").show
      +---+---+-----+
      |pos|key|value|
      +---+---+-----+
      |  0|  a|    1|
      |  1|  b|    2|
      +---+---+-----+
      ```
      
      For `array` argument, `after` is the same with `before`.
      ```
      scala> sql("select posexplode(array(1, 2, 3))").show
      +---+---+
      |pos|col|
      +---+---+
      |  0|  1|
      |  1|  2|
      |  2|  3|
      +---+---+
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests with newly added testcases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13971 from dongjoon-hyun/SPARK-16289.
      46395db8
  7. May 27, 2016
    • Zheng RuiFeng's avatar
      [MINOR] Fix Typos 'a -> an' · 6b1a6180
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      
      `a` -> `an`
      
      I use regex to generate potential error lines:
      `grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
      and review them line by line.
      
      ## How was this patch tested?
      
      local build
      `lint-java` checking
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #13317 from zhengruifeng/a_an.
      6b1a6180
  8. May 24, 2016
    • Daoyuan Wang's avatar
      [SPARK-15397][SQL] fix string udf locate as hive · d642b273
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 and  `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.
      
      ## How was this patch tested?
      
      tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #13186 from adrian-wang/locate.
      d642b273
  9. May 23, 2016
    • WeichenXu's avatar
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with... · a15ca553
      WeichenXu authored
      [SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code
      
      ## What changes were proposed in this pull request?
      
      Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.
      
      ## How was this patch tested?
      
      Existing test.
      
      Author: WeichenXu <WeichenXu123@outlook.com>
      
      Closes #13242 from WeichenXu123/python_doctest_update_sparksession.
      a15ca553
    • Dongjoon Hyun's avatar
      [MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions · 37c617e4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.
      
      ## How was this patch tested?
      
      It's only about docs.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #13087 from dongjoon-hyun/SPARK-15282.
      37c617e4
  10. Apr 20, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14639] [PYTHON] [R] Add `bround` function in Python/R. · 14869ae6
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This issue aims to expose Scala `bround` function in Python/R API.
      `bround` function is implemented in SPARK-14614 by extending current `round` function.
      We used the following semantics from Hive.
      ```java
      public static double bround(double input, int scale) {
          if (Double.isNaN(input) || Double.isInfinite(input)) {
            return input;
          }
          return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
      }
      ```
      
      After this PR, `pyspark` and `sparkR` also support `bround` function.
      
      **PySpark**
      ```python
      >>> from pyspark.sql.functions import bround
      >>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
      [Row(r=2.0)]
      ```
      
      **SparkR**
      ```r
      > df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
      > head(collect(select(df, bround(df$x, 0))))
        bround(x, 0)
      1            2
      2            4
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins tests (including new testcases).
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12509 from dongjoon-hyun/SPARK-14639.
      14869ae6
  11. Apr 05, 2016
    • Burak Yavuz's avatar
      [SPARK-14353] Dataset Time Window `window` API for Python, and SQL · 9ee5c257
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
      This PR adds the Python, and SQL, API for this function.
      
      With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
       - `window(timeColumn, windowDuration)`
       - `window(timeColumn, windowDuration, slideDuration)`
       - `window(timeColumn, windowDuration, slideDuration, startTime)`
      
      In Python, users can access all APIs above, but in addition they can do
       - In Python:
         `window(timeColumn, windowDuration, startTime=...)`
      
      that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.
      
      ## How was this patch tested?
      
      Unit tests + manual tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #12136 from brkyvz/python-windows.
      9ee5c257
  12. Mar 31, 2016
    • Davies Liu's avatar
      [SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch · f0afafdc
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR support multiple Python UDFs within single batch, also improve the performance.
      
      ```python
      >>> from pyspark.sql.types import IntegerType
      >>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
      >>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
      >>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True)
      == Parsed Logical Plan ==
      'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)]
      +- OneRowRelation$
      
      == Analyzed Logical Plan ==
      double(add(1, 2)): int, add(double(2), 1): int
      Project [double(add(1, 2))#14,add(double(2), 1)#15]
      +- Project [double(add(1, 2))#14,add(double(2), 1)#15]
         +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
            +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
               +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
                  +- OneRowRelation$
      
      == Optimized Logical Plan ==
      Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
      +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
         +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
            +- OneRowRelation$
      
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
      :     +- INPUT
      +- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
         +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
            +- Scan OneRowRelation[]
      ```
      
      ## How was this patch tested?
      
      Added new tests.
      
      Using the following script to benchmark 1, 2 and 3 udfs,
      ```
      df = sqlContext.range(1, 1 << 23, 1, 4)
      double = F.udf(lambda x: x * 2, LongType())
      print df.select(double(df.id)).count()
      print df.select(double(df.id), double(df.id + 1)).count()
      print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
      ```
      Here is the results:
      
      N | Before | After  | speed up
      ---- |------------ | -------------|------
      1 | 22 s | 7 s |  3.1X
      2 | 38 s | 13 s | 2.9X
      3 | 58 s | 16 s | 3.6X
      
      This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12057 from davies/multi_udfs.
      f0afafdc
  13. Mar 29, 2016
    • Davies Liu's avatar
      [SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs · a7a93a11
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR brings the support for chained Python UDFs, for example
      
      ```sql
      select udf1(udf2(a))
      select udf1(udf2(a) + 3)
      select udf1(udf2(a) + udf3(b))
      ```
      
      Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.
      
      For example,
      ```python
      >>> sqlContext.sql("select double(double(1))").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [pythonUDF#10 AS double(double(1))#9]
      :     +- INPUT
      +- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
         +- Scan OneRowRelation[]
      >>> sqlContext.sql("select double(double(1) + double(2))").explain()
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
      :     +- INPUT
      +- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
         +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
            +- !BatchPythonEvaluation double(1), [pythonUDF#17]
               +- Scan OneRowRelation[]
      ```
      
      TODO: will support multiple unrelated Python UDFs in one batch (another PR).
      
      ## How was this patch tested?
      
      Added new unit tests for chained UDFs.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #12014 from davies/py_udfs.
      a7a93a11
  14. Mar 25, 2016
    • Wenchen Fan's avatar
      [SPARK-14061][SQL] implement CreateMap · 43b15e01
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`.  This PR adds the `CreateMap` expression, and the DataFrame API, and python API.
      
      ## How was this patch tested?
      
      various new tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11879 from cloud-fan/create_map.
      43b15e01
  15. Mar 09, 2016
    • Tristan Reid's avatar
      [MINOR] Fix typo in 'hypot' docstring · 5f7dbdba
      Tristan Reid authored
      Minor typo:  docstring for pyspark.sql.functions: hypot has extra characters
      
      N/A
      
      Author: Tristan Reid <treid@netflix.com>
      
      Closes #11616 from tristanreid/master.
      5f7dbdba
  16. Mar 05, 2016
    • gatorsmile's avatar
      [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets · adce5ee7
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      This PR is for supporting SQL generation for cube, rollup and grouping sets.
      
      For example, a query using rollup:
      ```SQL
      SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
      ```
      Original logical plan:
      ```
        Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
                  [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
                   (key#17L % cast(5 as bigint))#47L AS _c1#45L,
                   grouping__id#46 AS _c2#44]
        +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
                   List(key#17L, value#18, null, 1)],
                  [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
           +- Project [key#17L,
                       value#18,
                       (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
              +- Subquery t1
                 +- Relation[key#17L,value#18] ParquetRelation
      ```
      Converted SQL:
      ```SQL
        SELECT count( 1) AS `cnt`,
               (`t1`.`key` % CAST(5 AS BIGINT)),
               grouping_id() AS `_c2`
        FROM `default`.`t1`
        GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
        GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
      ```
      
      #### How was the this patch tested?
      
      Added eight test cases in `LogicalPlanToSQLSuite`.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #11283 from gatorsmile/groupingSetsToSQL.
      adce5ee7
  17. Mar 02, 2016
  18. Feb 24, 2016
    • Wenchen Fan's avatar
      [SPARK-13467] [PYSPARK] abstract python function to simplify pyspark code · a60f9128
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.
      
      ## How was the this patch tested?
      
      by existing unit tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11342 from cloud-fan/python-clean.
      a60f9128
  19. Feb 22, 2016
  20. Feb 21, 2016
    • Cheng Lian's avatar
      [SPARK-12799] Simplify various string output for expressions · d9efe63e
      Cheng Lian authored
      This PR introduces several major changes:
      
      1. Replacing `Expression.prettyString` with `Expression.sql`
      
         The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.
      
      1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)
      
         Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:
      
         Expression         | `prettyString` | `sql`      | Note
         ------------------ | -------------- | ---------- | ---------------
         `a && b`           | `a && b`       | `a AND b`  |
         `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct
      
      1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)
      
         `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
      d9efe63e
  21. Feb 20, 2016
  22. Feb 13, 2016
    • Reynold Xin's avatar
      [SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions. · 354d4c24
      Reynold Xin authored
      This pull request has the following changes:
      
      1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.
      
      2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.
      
      3. Move everything in execution/python.scala into the newly created execution.python package.
      
      Most of the diffs are just straight copy-paste.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11181 from rxin/SPARK-13296.
      354d4c24
  23. Feb 12, 2016
  24. Feb 10, 2016
    • Davies Liu's avatar
      [SPARK-12706] [SQL] grouping() and grouping_id() · b5761d15
      Davies Liu authored
      Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.
      
      grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.
      
      The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #10677 from davies/grouping.
      b5761d15
  25. Jan 31, 2016
    • Herman van Hovell's avatar
      [SPARK-13049] Add First/last with ignore nulls to functions.scala · 5a8b978f
      Herman van Hovell authored
      This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g:
      ```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))```
      
      This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6.
      
      cc yhuai
      
      Author: Herman van Hovell <hvanhovell@questtec.nl>
      
      Closes #10957 from hvanhovell/SPARK-13049.
      5a8b978f
  26. Jan 13, 2016
  27. Jan 05, 2016
    • Wenchen Fan's avatar
      [SPARK-12480][FOLLOW-UP] use a single column vararg for hash · 76768337
      Wenchen Fan authored
      address comments in #10435
      
      This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10588 from cloud-fan/hash.
      76768337
  28. Jan 04, 2016
  29. Dec 21, 2015
  30. Nov 26, 2015
    • gatorsmile's avatar
      [SPARK-11980][SPARK-10621][SQL] Fix json_tuple and add test cases for · 068b6438
      gatorsmile authored
      Added Python test cases for the function `isnan`, `isnull`, `nanvl` and `json_tuple`.
      
      Fixed a bug in the function `json_tuple`
      
      rxin , could you help me review my changes? Please let me know anything is missing.
      
      Thank you! Have a good Thanksgiving day!
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #9977 from gatorsmile/json_tuple.
      068b6438
  31. Nov 24, 2015
  32. Nov 23, 2015
  33. Nov 10, 2015
    • felixcheung's avatar
      [SPARK-11567] [PYTHON] Add Python API for corr Aggregate function · 32790fe7
      felixcheung authored
      like `df.agg(corr("col1", "col2")`
      
      davies
      
      Author: felixcheung <felixcheung_m@hotmail.com>
      
      Closes #9536 from felixcheung/pyfunc.
      32790fe7
    • Yin Huai's avatar
      [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to... · e0701c75
      Yin Huai authored
      [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s
      
      https://issues.apache.org/jira/browse/SPARK-9830
      
      This PR contains the following main changes.
      * Removing `AggregateExpression1`.
      * Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
      * Removing planner rule used to plan `Aggregate`.
      * Linking `MultipleDistinctRewriter` to analyzer.
      * Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
      * Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
      * Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #9556 from yhuai/removeAgg1.
      e0701c75
  34. Nov 09, 2015
  35. Nov 03, 2015
  36. Sep 22, 2015
  37. Sep 08, 2015
Loading