Skip to content
Snippets Groups Projects
  1. Nov 03, 2016
    • Kishor Patil's avatar
      [SPARK-18099][YARN] Fail if same files added to distributed cache for --files and --archives · 098e4ca9
      Kishor Patil authored
      ## What changes were proposed in this pull request?
      
      During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives.
      ## How was this patch tested?
      
      Manually tested:
      1. if same jar is mentioned in --jars and --files it will continue to submit the job.
      - basically functionality [SPARK-14423] #12203 is unchanged
        1. if same file is mentioned in --files and --archives it will fail to submit the job.
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
      
      … under archives and files
      
      Author: Kishor Patil <kpatil@yahoo-inc.com>
      
      Closes #15627 from kishorvpatil/spark18099.
      098e4ca9
    • 福星's avatar
      [SPARK-18237][HIVE] hive.exec.stagingdir have no effect · 16293311
      福星 authored
      hive.exec.stagingdir have no effect in spark2.0.1,
      Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf`
      
      Author: 福星 <fuxing@wacai.com>
      
      Closes #15744 from ClassNotFoundExp/master.
      16293311
    • Reynold Xin's avatar
      [SPARK-18244][SQL] Rename partitionProviderIsHive -> tracksPartitionsInCatalog · b17057c0
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames partitionProviderIsHive to tracksPartitionsInCatalog, as the old name was too Hive specific.
      
      ## How was this patch tested?
      Should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15750 from rxin/SPARK-18244.
      b17057c0
    • Cheng Lian's avatar
      [SPARK-17949][SQL] A JVM object based aggregate operator · 27daf6bc
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new hash-based aggregate operator named `ObjectHashAggregateExec` that supports `TypedImperativeAggregate`, which may use arbitrary Java objects as aggregation states. Please refer to the [design doc](https://issues.apache.org/jira/secure/attachment/12834260/%5BDesign%20Doc%5D%20Support%20for%20Arbitrary%20Aggregation%20States.pdf) attached in [SPARK-17949](https://issues.apache.org/jira/browse/SPARK-17949) for more details about it.
      
      The major benefit of this operator is better performance when evaluating `TypedImperativeAggregate` functions, especially when there are relatively few distinct groups. Functions like Hive UDAFs, `collect_list`, and `collect_set` may also benefit from this after being migrated to `TypedImperativeAggregate`.
      
      The following feature flag is introduced to enable or disable the new aggregate operator:
      - Name: `spark.sql.execution.useObjectHashAggregateExec`
      - Default value: `true`
      
      We can also configure the fallback threshold using the following SQL operation:
      - Name: `spark.sql.objectHashAggregate.sortBased.fallbackThreshold`
      - Default value: 128
      
        Fallback to sort-based aggregation when more than 128 distinct groups are accumulated in the aggregation hash map. This number is intentionally made small to avoid GC problems since aggregation buffers of this operator may contain arbitrary Java objects.
      
        This may be improved by implementing size tracking for this operator, but that can be done in a separate PR.
      
      Code generation and size tracking are planned to be implemented in follow-up PRs.
      ## Benchmark results
      ### `ObjectHashAggregateExec` vs `SortAggregateExec`
      
      The first benchmark compares `ObjectHashAggregateExec` and `SortAggregateExec` by evaluating `typed_count`, a testing `TypedImperativeAggregate` version of the SQL `count` function.
      
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      
      object agg v.s. sort agg:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      sort agg w/ group by                        31251 / 31908          3.4         298.0       1.0X
      object agg w/ group by w/o fallback           6903 / 7141         15.2          65.8       4.5X
      object agg w/ group by w/ fallback          20945 / 21613          5.0         199.7       1.5X
      sort agg w/o group by                         4734 / 5463         22.1          45.2       6.6X
      object agg w/o group by w/o fallback          4310 / 4529         24.3          41.1       7.3X
      ```
      
      The next benchmark compares `ObjectHashAggregateExec` and `SortAggregateExec` by evaluating the Spark native version of `percentile_approx`.
      
      Note that `percentile_approx` is so heavy an aggregate function that the bottleneck of the benchmark is evaluating the aggregate function itself rather than the aggregate operator since I couldn't run a large scale benchmark on my laptop. That's why the results are so close and looks counter-intuitive (aggregation with grouping is even faster than that aggregation without grouping).
      
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      
      object agg v.s. sort agg:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      sort agg w/ group by                          3418 / 3530          0.6        1630.0       1.0X
      object agg w/ group by w/o fallback           3210 / 3314          0.7        1530.7       1.1X
      object agg w/ group by w/ fallback            3419 / 3511          0.6        1630.1       1.0X
      sort agg w/o group by                         4336 / 4499          0.5        2067.3       0.8X
      object agg w/o group by w/o fallback          4271 / 4372          0.5        2036.7       0.8X
      ```
      ### Hive UDAF vs Spark AF
      
      This benchmark compares the following two kinds of aggregate functions:
      - "hive udaf": Hive implementation of `percentile_approx`, without partial aggregation supports, evaluated using `SortAggregateExec`.
      - "spark af": Spark native implementation of `percentile_approx`, with partial aggregation support, evaluated using `ObjectHashAggregateExec`
      
      The performance differences are mostly due to faster implementation and partial aggregation support in the Spark native version of `percentile_approx`.
      
      This benchmark basically shows the performance differences between the worst case, where an aggregate function without partial aggregation support is evaluated using `SortAggregateExec`, and the best case, where a `TypedImperativeAggregate` with partial aggregation support is evaluated using `ObjectHashAggregateExec`.
      
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      
      hive udaf vs spark af:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      hive udaf w/o group by                        5326 / 5408          0.0       81264.2       1.0X
      spark af w/o group by                           93 /  111          0.7        1415.6      57.4X
      hive udaf w/ group by                         3804 / 3946          0.0       58050.1       1.4X
      spark af w/ group by w/o fallback               71 /   90          0.9        1085.7      74.8X
      spark af w/ group by w/ fallback                98 /  111          0.7        1501.6      54.1X
      ```
      ### Real world benchmark
      
      We also did a relatively large benchmark using a real world query involving `percentile_approx`:
      - Hive UDAF implementation, sort-based aggregation, w/o partial aggregation support
      
        24.77 minutes
      - Native implementation, sort-based aggregation, w/ partial aggregation support
      
        4.64 minutes
      - Native implementation, object hash aggregator, w/ partial aggregation support
      
        1.80 minutes
      ## How was this patch tested?
      
      New unit tests and randomized test cases are added in `ObjectAggregateFunctionSuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #15590 from liancheng/obj-hash-agg.
      27daf6bc
    • gatorsmile's avatar
      [SPARK-17981][SPARK-17957][SQL] Fix Incorrect Nullability Setting to False in FilterExec · 66a99f4a
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      When `FilterExec` contains `isNotNull`, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions.
      
      For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a null-intolerant expression, but obviously`coalesce` is null-tolerant. Thus, it could eat null.
      
      When the nullability is wrong, we could generate incorrect results in different cases. For example,
      
      ``` Scala
          val df1 = Seq((1, 2), (2, 3)).toDF("a", "b")
          val df2 = Seq((2, 5), (3, 4)).toDF("a", "c")
          val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0)
          val df3 = Seq((3, 1)).toDF("a", "d")
          joinedDf.join(df3, "a").show
      ```
      
      The optimized plan is like
      
      ```
      Project [a#29, b#30, c#31, d#42]
      +- Join Inner, (a#29 = a#41)
         :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31]
         :  +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int))
         :     +- Join FullOuter, (a#5 = a#15)
         :        :- LocalRelation [a#5, b#6]
         :        +- LocalRelation [a#15, c#16]
         +- LocalRelation [a#41, d#42]
      ```
      
      Without the fix, it returns an empty result. With the fix, it can return a correct answer:
      
      ```
      +---+---+---+---+
      |  a|  b|  c|  d|
      +---+---+---+---+
      |  3|  0|  4|  1|
      +---+---+---+---+
      ```
      ### How was this patch tested?
      
      Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15523 from gatorsmile/nullabilityFilterExec.
      66a99f4a
    • Zheng RuiFeng's avatar
      [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark GBTClassifier · 9dc9f9a5
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add missing 'subsamplingRate' of pyspark GBTClassifier
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15692 from zhengruifeng/gbt_subsamplingRate.
      9dc9f9a5
    • Reynold Xin's avatar
      [SQL] minor - internal doc improvement for InsertIntoTable. · 0ea5d5b2
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      I was reading this part of the code and was really confused by the "partition" parameter. This patch adds some documentation for it to reduce confusion in the future.
      
      I also looked around other logical plans but most of them are either already documented, or pretty self-evident to people that know Spark SQL.
      
      ## How was this patch tested?
      N/A - doc change only.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15749 from rxin/doc-improvement.
      0ea5d5b2
    • Reynold Xin's avatar
      [SPARK-18219] Move commit protocol API (internal) from sql/core to core module · 937af592
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves the new commit protocol API from sql/core to core module, so we can use it in the future in the RDD API.
      
      As part of this patch, I also moved the speficiation of the random uuid for the write path out of the commit protocol, and instead pass in a job id.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15731 from rxin/SPARK-18219.
      937af592
    • Daoyuan Wang's avatar
      [SPARK-17122][SQL] support drop current database · 96cc1b56
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      In Spark 1.6 and earlier, we can drop the database we are using. In Spark 2.0, native implementation prevent us from dropping current database, which may break some old queries. This PR would re-enable the feature.
      ## How was this patch tested?
      
      one new unit test in `SessionCatalogSuite`.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #15011 from adrian-wang/dropcurrent.
      96cc1b56
    • Dongjoon Hyun's avatar
      [SPARK-18200][GRAPHX] Support zero as an initial capacity in OpenHashSet · d24e7364
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [SPARK-18200](https://issues.apache.org/jira/browse/SPARK-18200) reports Apache Spark 2.x raises `java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity` while running `triangleCount`. The root cause is that `VertexSet`, a type alias of `OpenHashSet`, does not allow zero as a initial size. This PR loosens the restriction to allow zero.
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a new test case in `OpenHashSetSuite`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15741 from dongjoon-hyun/SPARK-18200.
      d24e7364
  2. Nov 02, 2016
    • gatorsmile's avatar
      [SPARK-18175][SQL] Improve the test case coverage of implicit type casting · 9ddec863
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      So far, we have limited test case coverage about implicit type casting. We need to draw a matrix to find all the possible casting pairs.
      - Reorged the existing test cases
      - Added all the possible type casting pairs
      - Drawed a matrix to show the implicit type casting. The table is very wide. Maybe hard to review. Thus, you also can access the same table via the link to [a google sheet](https://docs.google.com/spreadsheets/d/19PS4ikrs-Yye_mfu-rmIKYGnNe-NmOTt5DDT1fOD3pI/edit?usp=sharing).
      
      SourceType\CastToType | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | BinaryType | BooleanType | StringType | DateType | TimestampType | ArrayType | MapType | StructType | NullType | CalendarIntervalType | DecimalType | NumericType | IntegralType
      ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |  -----------
      **ByteType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(3, 0) | ByteType | ByteType
      **ShortType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(5, 0) | ShortType | ShortType
      **IntegerType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(10, 0) | IntegerType | IntegerType
      **LongType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(20, 0) | LongType | LongType
      **DoubleType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(30, 15) | DoubleType | IntegerType
      **FloatType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(14, 7) | FloatType | IntegerType
      **Dec(10, 2)** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(10, 2) | Dec(10, 2) | IntegerType
      **BinaryType** | X    | X    | X    | X    | X    | X    | X    | BinaryType | X    | StringType | X    | X    | X    | X    | X    | X    | X    | X    | X    | X
      **BooleanType** | X    | X    | X    | X    | X    | X    | X    | X    | BooleanType | StringType | X    | X    | X    | X    | X    | X    | X    | X    | X    | X
      **StringType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | BinaryType | X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    | DecimalType(38, 18) | DoubleType | X
      **DateType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    | X    | X    | X
      **TimestampType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    | X    | X    | X
      **ArrayType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | ArrayType* | X    | X    | X    | X    | X    | X    | X
      **MapType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | MapType* | X    | X    | X    | X    | X    | X
      **StructType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | StructType* | X    | X    | X    | X    | X
      **NullType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | BinaryType | BooleanType | StringType | DateType | TimestampType | ArrayType | MapType | StructType | NullType | CalendarIntervalType | DecimalType(38, 18) | DoubleType | IntegerType
      **CalendarIntervalType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | CalendarIntervalType | X    | X    | X
      Note: ArrayType\*, MapType\*, StructType\* are castable only when the internal child types also match; otherwise, not castable
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15691 from gatorsmile/implicitTypeCasting.
      9ddec863
    • hyukjinkwon's avatar
      [SPARK-17963][SQL][DOCUMENTATION] Add examples (extend) in each expression and... · 7eb2ca8e
      hyukjinkwon authored
      [SPARK-17963][SQL][DOCUMENTATION] Add examples (extend) in each expression and improve documentation
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to change the documentation for functions. Please refer the discussion from https://github.com/apache/spark/pull/15513
      
      The changes include
      - Re-indent the documentation
      - Add examples/arguments in `extended` where the arguments are multiple or specific format (e.g. xml/ json).
      
      For examples, the documentation was updated as below:
      ### Functions with single line usage
      
      **Before**
      - `pow`
      
        ``` sql
        Usage: pow(x1, x2) - Raise x1 to the power of x2.
        Extended Usage:
        > SELECT pow(2, 3);
         8.0
        ```
      - `current_timestamp`
      
        ``` sql
        Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
        Extended Usage:
        No example for current_timestamp.
        ```
      
      **After**
      - `pow`
      
        ``` sql
        Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`.
        Extended Usage:
            Examples:
              > SELECT pow(2, 3);
               8.0
        ```
      
      - `current_timestamp`
      
        ``` sql
        Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
        Extended Usage:
            No example/argument for current_timestamp.
        ```
      ### Functions with (already) multiple line usage
      
      **Before**
      - `approx_count_distinct`
      
        ``` sql
        Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++.
            approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++
              with relativeSD, the maximum estimation error allowed.
      
        Extended Usage:
        No example for approx_count_distinct.
        ```
      - `percentile_approx`
      
        ``` sql
        Usage:
              percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
              column `col` at the given percentage. The value of percentage must be between 0.0
              and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
              controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
              better accuracy, `1.0/accuracy` is the relative error of the approximation.
      
              percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
              percentile array of column `col` at the given percentage array. Each value of the
              percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
              a positive integer literal which controls approximation accuracy at the cost of memory.
              Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
              the approximation.
      
        Extended Usage:
        No example for percentile_approx.
        ```
      
      **After**
      - `approx_count_distinct`
      
        ``` sql
        Usage:
            approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++.
              `relativeSD` defines the maximum estimation error allowed.
      
        Extended Usage:
            No example/argument for approx_count_distinct.
        ```
      
      - `percentile_approx`
      
        ``` sql
        Usage:
            percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
              column `col` at the given percentage. The value of percentage must be between 0.0
              and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
              controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
              better accuracy, `1.0/accuracy` is the relative error of the approximation.
              When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
              In this case, returns the approximate percentile array of column `col` at the given
              percentage array.
      
        Extended Usage:
            Examples:
              > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
               [10.0,10.0,10.0]
              > SELECT percentile_approx(10.0, 0.5, 100);
               10.0
        ```
      ## How was this patch tested?
      
      Manually tested
      
      **When examples are multiple**
      
      ``` sql
      spark-sql> describe function extended reflect;
      Function: reflect
      Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
      Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
      Extended Usage:
          Examples:
            > SELECT reflect('java.util.UUID', 'randomUUID');
             c33fb387-8500-4bfa-81d2-6e0e3e930df2
            > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
             a5cf6c42-0c85-418f-af6c-3e4e5b1328f2
      ```
      
      **When `Usage` is in single line**
      
      ``` sql
      spark-sql> describe function extended min;
      Function: min
      Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min
      Usage: min(expr) - Returns the minimum value of `expr`.
      Extended Usage:
          No example/argument for min.
      ```
      
      **When `Usage` is already in multiple lines**
      
      ``` sql
      spark-sql> describe function extended percentile_approx;
      Function: percentile_approx
      Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
      Usage:
          percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
            column `col` at the given percentage. The value of percentage must be between 0.0
            and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
            controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
            better accuracy, `1.0/accuracy` is the relative error of the approximation.
            When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
            In this case, returns the approximate percentile array of column `col` at the given
            percentage array.
      
      Extended Usage:
          Examples:
            > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
             [10.0,10.0,10.0]
            > SELECT percentile_approx(10.0, 0.5, 100);
             10.0
      ```
      
      **When example/argument is missing**
      
      ``` sql
      spark-sql> describe function extended rank;
      Function: rank
      Class: org.apache.spark.sql.catalyst.expressions.Rank
      Usage:
          rank() - Computes the rank of a value in a group of values. The result is one plus the number
            of rows preceding or equal to the current row in the ordering of the partition. The values
            will produce gaps in the sequence.
      
      Extended Usage:
          No example/argument for rank.
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15677 from HyukjinKwon/SPARK-17963-1.
      7eb2ca8e
    • Wenchen Fan's avatar
      [SPARK-17470][SQL] unify path for data source table and locationUri for hive serde table · 3a1bc6f4
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.
      
      This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.
      
      This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.
      
      For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
      For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.
      
      To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15024 from cloud-fan/path.
      3a1bc6f4
    • Reynold Xin's avatar
      [SPARK-18214][SQL] Simplify RuntimeReplaceable type coercion · fd90541c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      RuntimeReplaceable is used to create aliases for expressions, but the way it deals with type coercion is pretty weird (each expression is responsible for how to handle type coercion, which does not obey the normal implicit type cast rules).
      
      This patch simplifies its handling by allowing the analyzer to traverse into the actual expression of a RuntimeReplaceable.
      
      ## How was this patch tested?
      - Correctness should be guaranteed by existing unit tests already
      - Removed SQLCompatibilityFunctionSuite and moved it sql-compatibility-functions.sql
      - Added a new test case in sql-compatibility-functions.sql for verifying explain behavior.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15723 from rxin/SPARK-18214.
      fd90541c
    • Steve Loughran's avatar
      [SPARK-17058][BUILD] Add maven snapshots-and-staging profile to build/test... · 37d95227
      Steve Loughran authored
      [SPARK-17058][BUILD] Add maven snapshots-and-staging profile to build/test against staging artifacts
      
      ## What changes were proposed in this pull request?
      
      Adds a `snapshots-and-staging profile` so that  RCs of projects like Hadoop and HBase can be used in developer-only build and test runs. There's a comment above the profile telling people not to use this in production.
      
      There's no attempt to do the same for SBT, as Ivy is different.
      ## How was this patch tested?
      
      Tested by building against the Hadoop 2.7.3 RC 1 JARs
      
      without the profile (and without any local copy of the 2.7.3 artifacts), the build failed
      
      ```
      mvn install -DskipTests -Pyarn,hadoop-2.7,hive -Dhadoop.version=2.7.3
      
      ...
      
      [INFO] ------------------------------------------------------------------------
      [INFO] Building Spark Project Launcher 2.1.0-SNAPSHOT
      [INFO] ------------------------------------------------------------------------
      Downloading: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.pom
      [WARNING] The POM for org.apache.hadoop:hadoop-client:jar:2.7.3 is missing, no dependency information available
      Downloading: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.jar
      [INFO] ------------------------------------------------------------------------
      [INFO] Reactor Summary:
      [INFO]
      [INFO] Spark Project Parent POM ........................... SUCCESS [  4.482 s]
      [INFO] Spark Project Tags ................................. SUCCESS [ 17.402 s]
      [INFO] Spark Project Sketch ............................... SUCCESS [ 11.252 s]
      [INFO] Spark Project Networking ........................... SUCCESS [ 13.458 s]
      [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  9.043 s]
      [INFO] Spark Project Unsafe ............................... SUCCESS [ 16.027 s]
      [INFO] Spark Project Launcher ............................. FAILURE [  1.653 s]
      [INFO] Spark Project Core ................................. SKIPPED
      ...
      ```
      
      With the profile, the build completed
      
      ```
      mvn install -DskipTests -Pyarn,hadoop-2.7,hive,snapshots-and-staging -Dhadoop.version=2.7.3
      ```
      
      Author: Steve Loughran <stevel@apache.org>
      
      Closes #14646 from steveloughran/stevel/SPARK-17058-support-asf-snapshots.
      37d95227
    • Jeff Zhang's avatar
      [SPARK-18160][CORE][YARN] spark.files & spark.jars should not be passed to driver in yarn mode · 3c24299b
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      spark.files is still passed to driver in yarn mode, so SparkContext will still handle it which cause the error in the jira desc.
      
      ## How was this patch tested?
      
      Tested manually in a 5 node cluster. As this issue only happens in multiple node cluster, so I didn't write test for it.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #15669 from zjffdu/SPARK-18160.
      3c24299b
    • Xiangrui Meng's avatar
      [SPARK-14393][SQL] values generated by non-deterministic functions shouldn't... · 02f20310
      Xiangrui Meng authored
      [SPARK-14393][SQL] values generated by non-deterministic functions shouldn't change after coalesce or union
      
      ## What changes were proposed in this pull request?
      
      When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following:
      - The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations.
      
      However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column.
      
      See the unit tests below or JIRA for examples.
      
      This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization.
      ## How was this patch tested?
      
      Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...)
      
      cc: rxin davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #15567 from mengxr/SPARK-14393.
      02f20310
    • buzhihuojie's avatar
      [SPARK-17895] Improve doc for rangeBetween and rowsBetween · 742e0fea
      buzhihuojie authored
      ## What changes were proposed in this pull request?
      
      Copied description for row and range based frame boundary from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L56
      
      Added examples to show different behavior of rangeBetween and rowsBetween when involving duplicate values.
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
      
      Author: buzhihuojie <ren.weiluo@gmail.com>
      
      Closes #15727 from david-weiluo-ren/improveDocForRangeAndRowsBetween.
      742e0fea
    • Takeshi YAMAMURO's avatar
      [SPARK-17683][SQL] Support ArrayType in Literal.apply · 4af0ce2d
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      
      This pr is to add pattern-matching entries for array data in `Literal.apply`.
      ## How was this patch tested?
      
      Added tests in `LiteralExpressionSuite`.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #15257 from maropu/SPARK-17683.
      4af0ce2d
    • eyal farago's avatar
      [SPARK-16839][SQL] Simplify Struct creation code path · f151bd1a
      eyal farago authored
      ## What changes were proposed in this pull request?
      
      Simplify struct creation, especially the aspect of `CleanupAliases` which missed some aliases when handling trees created by `CreateStruct`.
      
      This PR includes:
      
      1. A failing test (create struct with nested aliases, some of the aliases survive `CleanupAliases`).
      2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` constructor, effectively eliminating `CreateStruct` from all expression trees.
      3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be extracted from unresolved `NamedExpression`.
      4. A new Analyzer rule that resolves `NamePlaceHolder` into a string literal once the `NamedExpression` is resolved.
      5. `CleanupAliases` code was simplified as it no longer has to deal with `CreateStruct`'s top level columns.
      
      ## How was this patch tested?
      Running all tests-suits in package org.apache.spark.sql, especially including the analysis suite, making sure added test initially fails, after applying suggested fix rerun the entire analysis package successfully.
      
      Modified few tests that expected `CreateStruct` which is now transformed into `CreateNamedStruct`.
      
      Author: eyal farago <eyal farago>
      Author: Herman van Hovell <hvanhovell@databricks.com>
      Author: eyal farago <eyal.farago@gmail.com>
      Author: Eyal Farago <eyal.farago@actimize.com>
      Author: Hyukjin Kwon <gurwls223@gmail.com>
      Author: eyalfa <eyal.farago@gmail.com>
      
      Closes #15718 from hvanhovell/SPARK-16839-2.
      f151bd1a
    • Sean Owen's avatar
      [SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, NumberFormat to Locale.US · 9c8deef6
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix `Locale.US` for all usages of `DateFormat`, `NumberFormat`
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15610 from srowen/SPARK-18076.
      Unverified
      9c8deef6
    • Jacek Laskowski's avatar
      [SPARK-18204][WEBUI] Remove SparkUI.appUIAddress · 70a5db7b
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Removing `appUIAddress` attribute since it is no longer in use.
      ## How was this patch tested?
      
      Local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #15603 from jaceklaskowski/sparkui-fixes.
      Unverified
      70a5db7b
    • Liwei Lin's avatar
      [SPARK-18198][DOC][STREAMING] Highlight code snippets · 98ede494
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      This patch uses `{% highlight lang %}...{% endhighlight %}` to highlight code snippets in the `Structured Streaming Kafka010 integration doc` and the `Spark Streaming Kafka010 integration doc`.
      
      This patch consists of two commits:
      - the first commit fixes only the leading spaces -- this is large
      - the second commit adds the highlight instructions -- this is much simpler and easier to review
      
      ## How was this patch tested?
      
      SKIP_API=1 jekyll build
      
      ## Screenshots
      
      **Before**
      
      ![snip20161101_3](https://cloud.githubusercontent.com/assets/15843379/19894258/47746524-a087-11e6-9a2a-7bff2d428d44.png)
      
      **After**
      
      ![snip20161101_1](https://cloud.githubusercontent.com/assets/15843379/19894324/8bebcd1e-a087-11e6-835b-88c4d2979cfa.png)
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #15715 from lw-lin/doc-highlight-code-snippet.
      Unverified
      98ede494
    • Maria Rydzy's avatar
      [MINOR] Use <= for clarity in Pi examples' Monte Carlo process · bcbe4444
      Maria Rydzy authored
      ## What changes were proposed in this pull request?
      
      If my understanding is correct we should be rather looking at closed disk than the opened one.
      ## How was this patch tested?
      
      Run simple comparison, of the mean squared error of approaches with closed and opened disk.
      https://gist.github.com/mrydzy/1cf0e5c316ef9d6fbd91426b91f1969f
      The closed one performed slightly better, but the tested sample wasn't too big, so I rely mostly on the algorithm  understanding.
      
      Author: Maria Rydzy <majrydzy+gh@gmail.com>
      
      Closes #15687 from mrydzy/master.
      Unverified
      bcbe4444
    • Ryan Blue's avatar
      [SPARK-17532] Add lock debugging info to thread dumps. · 2dc04808
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This adds information to the web UI thread dump page about the JVM locks
      held by threads and the locks that threads are blocked waiting to
      acquire. This should help find cases where lock contention is causing
      Spark applications to run slowly.
      ## How was this patch tested?
      
      Tested by applying this patch and viewing the change in the web UI.
      
      ![thread-lock-info](https://cloud.githubusercontent.com/assets/87915/18493057/6e5da870-79c3-11e6-8c20-f54c18a37544.png)
      
      Additions:
      - A "Thread Locking" column with the locks held by the thread or that are blocking the thread
      - Links from the a blocked thread to the thread holding the lock
      - Stack frames show where threads are inside `synchronized` blocks, "holding Monitor(...)"
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #15088 from rdblue/SPARK-17532-add-thread-lock-info.
      2dc04808
    • CodingCat's avatar
      [SPARK-18144][SQL] logging StreamingQueryListener$QueryStartedEvent · 85c5424d
      CodingCat authored
      ## What changes were proposed in this pull request?
      
      The PR fixes the bug that the QueryStartedEvent is not logged
      
      the postToAll() in the original code is actually calling StreamingQueryListenerBus.postToAll() which has no listener at all....we shall post by sparkListenerBus.postToAll(s) and this.postToAll() to trigger local listeners as well as the listeners registered in LiveListenerBus
      
      zsxwing
      ## How was this patch tested?
      
      The following snapshot shows that QueryStartedEvent has been logged correctly
      
      ![image](https://cloud.githubusercontent.com/assets/678008/19821553/007a7d28-9d2d-11e6-9f13-49851559cdaa.png)
      
      Author: CodingCat <zhunansjtu@gmail.com>
      
      Closes #15675 from CodingCat/SPARK-18144.
      85c5424d
    • Reynold Xin's avatar
      [SPARK-18192] Support all file formats in structured streaming · a36653c5
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch adds support for all file formats in structured streaming sinks. This is actually a very small change thanks to all the previous refactoring done using the new internal commit protocol API.
      
      ## How was this patch tested?
      Updated FileStreamSinkSuite to add test cases for json, text, and parquet.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15711 from rxin/SPARK-18192.
      a36653c5
    • Eric Liang's avatar
      [SPARK-18183][SPARK-18184] Fix INSERT [INTO|OVERWRITE] TABLE ... PARTITION for Datasource tables · abefe2ec
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      There are a couple issues with the current 2.1 behavior when inserting into Datasource tables with partitions managed by Hive.
      
      (1) OVERWRITE TABLE ... PARTITION will actually overwrite the entire table instead of just the specified partition.
      (2) INSERT|OVERWRITE does not work with partitions that have custom locations.
      
      This PR fixes both of these issues for Datasource tables managed by Hive. The behavior for legacy tables or when `manageFilesourcePartitions = false` is unchanged.
      
      There is one other issue in that INSERT OVERWRITE with dynamic partitions will overwrite the entire table instead of just the updated partitions, but this behavior is pretty complicated to implement for Datasource tables. We should address that in a future release.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15705 from ericl/sc-4942.
      abefe2ec
    • frreiss's avatar
      [SPARK-17475][STREAMING] Delete CRC files if the filesystem doesn't use checksum files · 620da3b4
      frreiss authored
      ## What changes were proposed in this pull request?
      
      When the metadata logs for various parts of Structured Streaming are stored on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves hidden HDFS-style checksum (CRC) files in the log directory, one file per batch. This PR modifies HDFSMetadataLog so that it detects the use of a filesystem that doesn't use CRC files and removes the CRC files.
      ## How was this patch tested?
      
      Modified an existing test case in HDFSMetadataLogSuite to check whether HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem.  Ran the entire regression suite.
      
      Author: frreiss <frreiss@us.ibm.com>
      
      Closes #15027 from frreiss/fred-17475.
      620da3b4
    • Michael Allman's avatar
      [SPARK-17992][SQL] Return all partitions from HiveShim when Hive throws a... · 1bbf9ff6
      Michael Allman authored
      [SPARK-17992][SQL] Return all partitions from HiveShim when Hive throws a metastore exception when attempting to fetch partitions by filter
      
      (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-17992)
      ## What changes were proposed in this pull request?
      
      We recently added table partition pruning for partitioned Hive tables converted to using `TableFileCatalog`. When the Hive configuration option `hive.metastore.try.direct.sql` is set to `false`, Hive will throw an exception for unsupported filter expressions. For example, attempting to filter on an integer partition column will throw a `org.apache.hadoop.hive.metastore.api.MetaException`.
      
      I discovered this behavior because VideoAmp uses the CDH version of Hive with a Postgresql metastore DB. In this configuration, CDH sets `hive.metastore.try.direct.sql` to `false` by default, and queries that filter on a non-string partition column will fail.
      
      Rather than throw an exception in query planning, this patch catches this exception, logs a warning and returns all table partitions instead. Clients of this method are already expected to handle the possibility that the filters will not be honored.
      ## How was this patch tested?
      
      A unit test was added.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #15673 from mallman/spark-17992-catch_hive_partition_filter_exception.
      1bbf9ff6
    • hyukjinkwon's avatar
      [SPARK-17838][SPARKR] Check named arguments for options and use formatted R... · 1ecfafa0
      hyukjinkwon authored
      [SPARK-17838][SPARKR] Check named arguments for options and use formatted R friendly message from JVM exception message
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to
      - improve the R-friendly error messages rather than raw JVM exception one.
      
        As `read.json`, `read.text`, `read.orc`, `read.parquet` and `read.jdbc` are executed in the same  path with `read.df`, and `write.json`, `write.text`, `write.orc`, `write.parquet` and `write.jdbc` shares the same path with `write.df`, it seems it is safe to call `handledCallJMethod` to handle
        JVM messages.
      -  prevent `zero-length variable name` and prints the ignored options as an warning message.
      
      **Before**
      
      ``` r
      > read.json("path", a = 1, 2, 3, "a")
      Error in env[[name]] <- value :
        zero-length variable name
      ```
      
      ``` r
      > read.json("arbitrary_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
        ...
      
      > read.orc("arbitrary_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
        ...
      
      > read.text("arbitrary_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
        ...
      
      > read.parquet("arbitrary_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: Path does not exist: file:/...;
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$12.apply(DataSource.scala:398)
        ...
      ```
      
      ``` r
      > write.json(df, "existing_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: path file:/... already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
      
      > write.orc(df, "existing_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: path file:/... already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
      
      > write.text(df, "existing_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: path file:/... already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
      
      > write.parquet(df, "existing_path")
      Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
        org.apache.spark.sql.AnalysisException: path file:/... already exists.;
        at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.run(InsertIntoHadoopFsRelationCommand.scala:68)
      ```
      
      **After**
      
      ``` r
      read.json("arbitrary_path", a = 1, 2, 3, "a")
      Unnamed arguments ignored: 2, 3, a.
      ```
      
      ``` r
      > read.json("arbitrary_path")
      Error in json : analysis error - Path does not exist: file:/...
      
      > read.orc("arbitrary_path")
      Error in orc : analysis error - Path does not exist: file:/...
      
      > read.text("arbitrary_path")
      Error in text : analysis error - Path does not exist: file:/...
      
      > read.parquet("arbitrary_path")
      Error in parquet : analysis error - Path does not exist: file:/...
      ```
      
      ``` r
      > write.json(df, "existing_path")
      Error in json : analysis error - path file:/... already exists.;
      
      > write.orc(df, "existing_path")
      Error in orc : analysis error - path file:/... already exists.;
      
      > write.text(df, "existing_path")
      Error in text : analysis error - path file:/... already exists.;
      
      > write.parquet(df, "existing_path")
      Error in parquet : analysis error - path file:/... already exists.;
      ```
      ## How was this patch tested?
      
      Unit tests in `test_utils.R` and `test_sparkSQL.R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15608 from HyukjinKwon/SPARK-17838.
      1ecfafa0
  3. Nov 01, 2016
    • Reynold Xin's avatar
      [SPARK-18216][SQL] Make Column.expr public · ad4832a9
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      Column.expr is private[sql], but it's an actually really useful field to have for debugging. We should open it up, similar to how we use QueryExecution.
      
      ## How was this patch tested?
      N/A - this is a simple visibility change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15724 from rxin/SPARK-18216.
      ad4832a9
    • Reynold Xin's avatar
      [SPARK-18025] Use commit protocol API in structured streaming · 77a98162
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch adds a new commit protocol implementation ManifestFileCommitProtocol that follows the existing streaming flow, and uses it in FileStreamSink to consolidate the write path in structured streaming with the batch mode write path.
      
      This deletes a lot of code, and would make it trivial to support other functionalities that are currently available in batch but not in streaming, including all file formats and bucketing.
      
      ## How was this patch tested?
      Should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15710 from rxin/SPARK-18025.
      77a98162
    • Joseph K. Bradley's avatar
      [SPARK-18088][ML] Various ChiSqSelector cleanups · 91c33a0c
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      - Renamed kbest to numTopFeatures
      - Renamed alpha to fpr
      - Added missing Since annotations
      - Doc cleanups
      ## How was this patch tested?
      
      Added new standardized unit tests for spark.ml.
      Improved existing unit test coverage a bit.
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #15647 from jkbradley/chisqselector-follow-ups.
      91c33a0c
    • Josh Rosen's avatar
      [SPARK-18182] Expose ReplayListenerBus.read() overload which takes string iterator · b929537b
      Josh Rosen authored
      The `ReplayListenerBus.read()` method is used when implementing a custom `ApplicationHistoryProvider`. The current interface only exposes a `read()` method which takes an `InputStream` and performs stream-to-lines conversion itself, but it would also be useful to expose an overloaded method which accepts an iterator of strings, thereby enabling events to be provided from non-`InputStream` sources.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15698 from JoshRosen/replay-listener-bus-interface.
      b929537b
    • Josh Rosen's avatar
      [SPARK-17350][SQL] Disable default use of KryoSerializer in Thrift Server · 6e629815
      Josh Rosen authored
      In SPARK-4761 / #3621 (December 2014) we enabled Kryo serialization by default in the Spark Thrift Server. However, I don't think that the original rationale for doing this still holds now that most Spark SQL serialization is now performed via encoders and our UnsafeRow format.
      
      In addition, the use of Kryo as the default serializer can introduce performance problems because the creation of new KryoSerializer instances is expensive and we haven't performed instance-reuse optimizations in several code paths (including DirectTaskResult deserialization).
      
      Given all of this, I propose to revert back to using JavaSerializer as the default serializer in the Thrift Server.
      
      /cc liancheng
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14906 from JoshRosen/disable-kryo-in-thriftserver.
      6e629815
    • hyukjinkwon's avatar
      [SPARK-17764][SQL] Add `to_json` supporting to convert nested struct column to JSON string · 01dd0083
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to add `to_json` function in contrast with `from_json` in Scala, Java and Python.
      
      It'd be useful if we can convert a same column from/to json. Also, some datasources do not support nested types. If we are forced to save a dataframe into those data sources, we might be able to work around by this function.
      
      The usage is as below:
      
      ``` scala
      val df = Seq(Tuple1(Tuple1(1))).toDF("a")
      df.select(to_json($"a").as("json")).show()
      ```
      
      ``` bash
      +--------+
      |    json|
      +--------+
      |{"_1":1}|
      +--------+
      ```
      ## How was this patch tested?
      
      Unit tests in `JsonFunctionsSuite` and `JsonExpressionsSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15354 from HyukjinKwon/SPARK-17764.
      01dd0083
    • Eric Liang's avatar
      [SPARK-18167] Disable flaky SQLQuerySuite test · cfac17ee
      Eric Liang authored
      We now know it's a persistent environmental issue that is causing this test to sometimes fail. One hypothesis is that some configuration is leaked from another suite, and depending on suite ordering this can cause this test to fail.
      
      I am planning on mining the jenkins logs to try to narrow down which suite could be causing this. For now, disable the test.
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15720 from ericl/disable-flaky-test.
      cfac17ee
    • jiangxingbo's avatar
      [SPARK-18148][SQL] Misleading Error Message for Aggregation Without Window/GroupBy · d0272b43
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      Aggregation Without Window/GroupBy expressions will fail in `checkAnalysis`, the error message is a bit misleading, we should generate a more specific error message for this case.
      
      For example,
      
      ```
      spark.read.load("/some-data")
        .withColumn("date_dt", to_date($"date"))
        .withColumn("year", year($"date_dt"))
        .withColumn("week", weekofyear($"date_dt"))
        .withColumn("user_count", count($"userId"))
        .withColumn("daily_max_in_week", max($"user_count").over(weeklyWindow))
      )
      ```
      
      creates the following output:
      
      ```
      org.apache.spark.sql.AnalysisException: expression '`randomColumn`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
      ```
      
      In the error message above, `randomColumn` doesn't appear in the query(acturally it's added by function `withColumn`), so the message is not enough for the user to address the problem.
      ## How was this patch tested?
      
      Manually test
      
      Before:
      
      ```
      scala> spark.sql("select col, count(col) from tbl")
      org.apache.spark.sql.AnalysisException: expression 'tbl.`col`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
      ```
      
      After:
      
      ```
      scala> spark.sql("select col, count(col) from tbl")
      org.apache.spark.sql.AnalysisException: grouping expressions sequence is empty, and 'tbl.`col`' is not an aggregate function. Wrap '(count(col#231L) AS count(col)#239L)' in windowing function(s) or wrap 'tbl.`col`' in first() (or first_value) if you don't care which value you get.;;
      ```
      
      Also add new test sqls in `group-by.sql`.
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #15672 from jiangxb1987/groupBy-empty.
      d0272b43
    • Ergin Seyfe's avatar
      [SPARK-18189][SQL] Fix serialization issue in KeyValueGroupedDataset · 8a538c97
      Ergin Seyfe authored
      ## What changes were proposed in this pull request?
      Likewise [DataSet.scala](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/Dataset.scala#L156) KeyValueGroupedDataset should mark the queryExecution as transient.
      
      As mentioned in the Jira ticket, without transient we saw serialization issues like
      
      ```
      Caused by: java.io.NotSerializableException: org.apache.spark.sql.execution.QueryExecution
      Serialization stack:
              - object not serializable (class: org.apache.spark.sql.execution.QueryExecution, value: ==
      ```
      
      ## How was this patch tested?
      
      Run the query which is specified in the Jira ticket before and after:
      ```
      val a = spark.createDataFrame(sc.parallelize(Seq((1,2),(3,4)))).as[(Int,Int)]
      val grouped = a.groupByKey(
      {x:(Int,Int)=>x._1}
      )
      val mappedGroups = grouped.mapGroups((k,x)=>
      {(k,1)}
      )
      val yyy = sc.broadcast(1)
      val last = mappedGroups.rdd.map(xx=>
      { val simpley = yyy.value 1 }
      )
      ```
      
      Author: Ergin Seyfe <eseyfe@fb.com>
      
      Closes #15706 from seyfe/keyvaluegrouped_serialization.
      8a538c97
Loading