Skip to content
Snippets Groups Projects
  1. Nov 05, 2016
  2. Nov 04, 2016
    • Josh Rosen's avatar
      [SPARK-18256] Improve the performance of event log replay in HistoryServer · 0e3312ee
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch significantly improves the performance of event log replay in the HistoryServer via two simple changes:
      
      - **Don't use `extractOpt`**: it turns out that `json4s`'s `extractOpt` method uses exceptions for control flow, causing huge performance bottlenecks due to the overhead of initializing exceptions. To avoid this overhead, we can simply use our own` Utils.jsonOption` method. This patch replaces all uses of `extractOpt` with `Utils.jsonOption` and adds a style checker rule to ban the use of the slow `extractOpt` method.
      - **Don't call `Utils.getFormattedClassName` for every event**: the old code called` Utils.getFormattedClassName` dozens of times per replayed event in order to match up class names in events with SparkListener event names. By simply storing the results of these calls in constants rather than recomputing them, we're able to eliminate a huge performance hotspot by removing thousands of expensive `Class.getSimpleName` calls.
      
      ## How was this patch tested?
      
      Tested by profiling the replay of a long event log using YourKit. For an event log containing 1000+ jobs, each of which had thousands of tasks, the changes in this patch cut the replay time in half:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/19980953/31154622-a1bd-11e6-9be4-21fbb9b3f9a7.png)
      
      Prior to this patch's changes, the two slowest methods in log replay were internal exceptions thrown by `Json4S` and calls to `Class.getSimpleName()`:
      
      ![image](https://cloud.githubusercontent.com/assets/50748/19981052/87416cce-a1bd-11e6-9f25-06a7cd391822.png)
      
      After this patch, these hotspots are completely eliminated.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15756 from JoshRosen/speed-up-jsonprotocol.
      0e3312ee
    • Eric Liang's avatar
      [SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite · 4cee2ce2
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034
      
      Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well.
      
      The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness.
      
      **Update:** we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production).
      
      ## How was this patch tested?
      
      N/A
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15725 from ericl/print-confs-out.
      4cee2ce2
    • Herman van Hovell's avatar
      [SPARK-17337][SQL] Do not pushdown predicates through filters with predicate subqueries · 550cd56e
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      The `PushDownPredicate` rule can create a wrong result if we try to push a filter containing a predicate subquery through a project when the subquery and the project share attributes (have the same source).
      
      The current PR fixes this by making sure that we do not push down when there is a predicate subquery that outputs the same attributes as the filters new child plan.
      
      ## How was this patch tested?
      Added a test to `SubquerySuite`. nsyca has done previous work this. I have taken test from his initial PR.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15761 from hvanhovell/SPARK-17337.
      550cd56e
    • Adam Roberts's avatar
      [SPARK-18197][CORE] Optimise AppendOnlyMap implementation · a42d738c
      Adam Roberts authored
      ## What changes were proposed in this pull request?
      This improvement works by using the fastest comparison test first and we observed a 1% throughput performance improvement on PageRank (HiBench large profile) with this change.
      
      We used tprof and before the change in AppendOnlyMap.changeValue (where the optimisation occurs) this method was being used for 8053 profiling ticks representing 0.72% of the overall application time.
      
      After this change we observed this method only occurring for 2786 ticks and for 0.25% of the overall time.
      
      ## How was this patch tested?
      Existing unit tests and for performance we used HiBench large, profiling with tprof and IBM Healthcenter.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      
      Closes #15714 from a-roberts/patch-9.
      a42d738c
    • Reynold Xin's avatar
      Closing some stale/invalid pull requests · 14f235d5
      Reynold Xin authored
      Closes #15758
      Closes #15753
      Closes #12708
      14f235d5
    • Dongjoon Hyun's avatar
      [SPARK-18200][GRAPHX][FOLLOW-UP] Support zero as an initial capacity in OpenHashSet · 27602c33
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up PR of #15741 in order to keep `nextPowerOf2` consistent.
      
      **Before**
      ```
      nextPowerOf2(0) => 2
      nextPowerOf2(1) => 1
      nextPowerOf2(2) => 2
      nextPowerOf2(3) => 4
      nextPowerOf2(4) => 4
      nextPowerOf2(5) => 8
      ```
      
      **After**
      ```
      nextPowerOf2(0) => 1
      nextPowerOf2(1) => 1
      nextPowerOf2(2) => 2
      nextPowerOf2(3) => 4
      nextPowerOf2(4) => 4
      nextPowerOf2(5) => 8
      ```
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15754 from dongjoon-hyun/SPARK-18200-2.
      27602c33
    • Felix Cheung's avatar
      [SPARK-14393][SQL][DOC] update doc for python and R · a08463b1
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      minor doc update that should go to master & branch-2.1
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15747 from felixcheung/pySPARK-14393.
      a08463b1
  3. Nov 03, 2016
    • Herman van Hovell's avatar
      [SPARK-18259][SQL] Do not capture Throwable in QueryExecution · aa412c55
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      `QueryExecution.toString` currently captures `java.lang.Throwable`s; this is far from a best practice and can lead to confusing situation or invalid application states. This PR fixes this by only capturing `AnalysisException`s.
      
      ## How was this patch tested?
      Added a `QueryExecutionSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15760 from hvanhovell/SPARK-18259.
      aa412c55
    • Sean Owen's avatar
      [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6... · dc4c6009
      Sean Owen authored
      [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0
      
      ## What changes were proposed in this pull request?
      
      Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it.
      
      ## How was this patch tested?
      
      Doc build
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15733 from srowen/SPARK-18138.
      dc4c6009
    • Reynold Xin's avatar
      [SPARK-18257][SS] Improve error reporting for FileStressSuite · f22954ad
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch improves error reporting for FileStressSuite, when there is an error in Spark itself (not user code). This works by simply tightening the exception verification, and gets rid of the unnecessary thread for starting the stream.
      
      Also renamed the class FileStreamStressSuite to make it more obvious it is a streaming suite.
      
      ## How was this patch tested?
      This is a test only change and I manually verified error reporting by injecting some bug in the addBatch code for FileStreamSink.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15757 from rxin/SPARK-18257.
      f22954ad
    • wm624@hotmail.com's avatar
      [SPARKR][TEST] remove unnecessary suppressWarnings · e8920252
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      In test_mllib.R, there are two unnecessary suppressWarnings. This PR just removes them.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #15697 from wangmiao1981/rtest.
      e8920252
    • cody koeninger's avatar
      [SPARK-18212][SS][KAFKA] increase executor poll timeout · 67659c9a
      cody koeninger authored
      ## What changes were proposed in this pull request?
      
      Increase poll timeout to try and address flaky test
      
      ## How was this patch tested?
      
      Ran existing unit tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15737 from koeninger/SPARK-18212.
      67659c9a
    • Kishor Patil's avatar
      [SPARK-18099][YARN] Fail if same files added to distributed cache for --files and --archives · 098e4ca9
      Kishor Patil authored
      ## What changes were proposed in this pull request?
      
      During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives.
      ## How was this patch tested?
      
      Manually tested:
      1. if same jar is mentioned in --jars and --files it will continue to submit the job.
      - basically functionality [SPARK-14423] #12203 is unchanged
        1. if same file is mentioned in --files and --archives it will fail to submit the job.
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
      
      … under archives and files
      
      Author: Kishor Patil <kpatil@yahoo-inc.com>
      
      Closes #15627 from kishorvpatil/spark18099.
      098e4ca9
    • 福星's avatar
      [SPARK-18237][HIVE] hive.exec.stagingdir have no effect · 16293311
      福星 authored
      hive.exec.stagingdir have no effect in spark2.0.1,
      Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf`
      
      Author: 福星 <fuxing@wacai.com>
      
      Closes #15744 from ClassNotFoundExp/master.
      16293311
    • Reynold Xin's avatar
      [SPARK-18244][SQL] Rename partitionProviderIsHive -> tracksPartitionsInCatalog · b17057c0
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch renames partitionProviderIsHive to tracksPartitionsInCatalog, as the old name was too Hive specific.
      
      ## How was this patch tested?
      Should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15750 from rxin/SPARK-18244.
      b17057c0
    • Cheng Lian's avatar
      [SPARK-17949][SQL] A JVM object based aggregate operator · 27daf6bc
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR adds a new hash-based aggregate operator named `ObjectHashAggregateExec` that supports `TypedImperativeAggregate`, which may use arbitrary Java objects as aggregation states. Please refer to the [design doc](https://issues.apache.org/jira/secure/attachment/12834260/%5BDesign%20Doc%5D%20Support%20for%20Arbitrary%20Aggregation%20States.pdf) attached in [SPARK-17949](https://issues.apache.org/jira/browse/SPARK-17949) for more details about it.
      
      The major benefit of this operator is better performance when evaluating `TypedImperativeAggregate` functions, especially when there are relatively few distinct groups. Functions like Hive UDAFs, `collect_list`, and `collect_set` may also benefit from this after being migrated to `TypedImperativeAggregate`.
      
      The following feature flag is introduced to enable or disable the new aggregate operator:
      - Name: `spark.sql.execution.useObjectHashAggregateExec`
      - Default value: `true`
      
      We can also configure the fallback threshold using the following SQL operation:
      - Name: `spark.sql.objectHashAggregate.sortBased.fallbackThreshold`
      - Default value: 128
      
        Fallback to sort-based aggregation when more than 128 distinct groups are accumulated in the aggregation hash map. This number is intentionally made small to avoid GC problems since aggregation buffers of this operator may contain arbitrary Java objects.
      
        This may be improved by implementing size tracking for this operator, but that can be done in a separate PR.
      
      Code generation and size tracking are planned to be implemented in follow-up PRs.
      ## Benchmark results
      ### `ObjectHashAggregateExec` vs `SortAggregateExec`
      
      The first benchmark compares `ObjectHashAggregateExec` and `SortAggregateExec` by evaluating `typed_count`, a testing `TypedImperativeAggregate` version of the SQL `count` function.
      
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      
      object agg v.s. sort agg:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      sort agg w/ group by                        31251 / 31908          3.4         298.0       1.0X
      object agg w/ group by w/o fallback           6903 / 7141         15.2          65.8       4.5X
      object agg w/ group by w/ fallback          20945 / 21613          5.0         199.7       1.5X
      sort agg w/o group by                         4734 / 5463         22.1          45.2       6.6X
      object agg w/o group by w/o fallback          4310 / 4529         24.3          41.1       7.3X
      ```
      
      The next benchmark compares `ObjectHashAggregateExec` and `SortAggregateExec` by evaluating the Spark native version of `percentile_approx`.
      
      Note that `percentile_approx` is so heavy an aggregate function that the bottleneck of the benchmark is evaluating the aggregate function itself rather than the aggregate operator since I couldn't run a large scale benchmark on my laptop. That's why the results are so close and looks counter-intuitive (aggregation with grouping is even faster than that aggregation without grouping).
      
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      
      object agg v.s. sort agg:                Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      sort agg w/ group by                          3418 / 3530          0.6        1630.0       1.0X
      object agg w/ group by w/o fallback           3210 / 3314          0.7        1530.7       1.1X
      object agg w/ group by w/ fallback            3419 / 3511          0.6        1630.1       1.0X
      sort agg w/o group by                         4336 / 4499          0.5        2067.3       0.8X
      object agg w/o group by w/o fallback          4271 / 4372          0.5        2036.7       0.8X
      ```
      ### Hive UDAF vs Spark AF
      
      This benchmark compares the following two kinds of aggregate functions:
      - "hive udaf": Hive implementation of `percentile_approx`, without partial aggregation supports, evaluated using `SortAggregateExec`.
      - "spark af": Spark native implementation of `percentile_approx`, with partial aggregation support, evaluated using `ObjectHashAggregateExec`
      
      The performance differences are mostly due to faster implementation and partial aggregation support in the Spark native version of `percentile_approx`.
      
      This benchmark basically shows the performance differences between the worst case, where an aggregate function without partial aggregation support is evaluated using `SortAggregateExec`, and the best case, where a `TypedImperativeAggregate` with partial aggregation support is evaluated using `ObjectHashAggregateExec`.
      
      ```
      Java HotSpot(TM) 64-Bit Server VM 1.8.0_92-b14 on Mac OS X 10.10.5
      Intel(R) Core(TM) i7-4960HQ CPU  2.60GHz
      
      hive udaf vs spark af:                   Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
      ------------------------------------------------------------------------------------------------
      hive udaf w/o group by                        5326 / 5408          0.0       81264.2       1.0X
      spark af w/o group by                           93 /  111          0.7        1415.6      57.4X
      hive udaf w/ group by                         3804 / 3946          0.0       58050.1       1.4X
      spark af w/ group by w/o fallback               71 /   90          0.9        1085.7      74.8X
      spark af w/ group by w/ fallback                98 /  111          0.7        1501.6      54.1X
      ```
      ### Real world benchmark
      
      We also did a relatively large benchmark using a real world query involving `percentile_approx`:
      - Hive UDAF implementation, sort-based aggregation, w/o partial aggregation support
      
        24.77 minutes
      - Native implementation, sort-based aggregation, w/ partial aggregation support
      
        4.64 minutes
      - Native implementation, object hash aggregator, w/ partial aggregation support
      
        1.80 minutes
      ## How was this patch tested?
      
      New unit tests and randomized test cases are added in `ObjectAggregateFunctionSuite`.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #15590 from liancheng/obj-hash-agg.
      27daf6bc
    • gatorsmile's avatar
      [SPARK-17981][SPARK-17957][SQL] Fix Incorrect Nullability Setting to False in FilterExec · 66a99f4a
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      When `FilterExec` contains `isNotNull`, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions.
      
      For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a null-intolerant expression, but obviously`coalesce` is null-tolerant. Thus, it could eat null.
      
      When the nullability is wrong, we could generate incorrect results in different cases. For example,
      
      ``` Scala
          val df1 = Seq((1, 2), (2, 3)).toDF("a", "b")
          val df2 = Seq((2, 5), (3, 4)).toDF("a", "c")
          val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0)
          val df3 = Seq((3, 1)).toDF("a", "d")
          joinedDf.join(df3, "a").show
      ```
      
      The optimized plan is like
      
      ```
      Project [a#29, b#30, c#31, d#42]
      +- Join Inner, (a#29 = a#41)
         :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31]
         :  +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int))
         :     +- Join FullOuter, (a#5 = a#15)
         :        :- LocalRelation [a#5, b#6]
         :        +- LocalRelation [a#15, c#16]
         +- LocalRelation [a#41, d#42]
      ```
      
      Without the fix, it returns an empty result. With the fix, it can return a correct answer:
      
      ```
      +---+---+---+---+
      |  a|  b|  c|  d|
      +---+---+---+---+
      |  3|  0|  4|  1|
      +---+---+---+---+
      ```
      ### How was this patch tested?
      
      Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15523 from gatorsmile/nullabilityFilterExec.
      66a99f4a
    • Zheng RuiFeng's avatar
      [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark GBTClassifier · 9dc9f9a5
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Add missing 'subsamplingRate' of pyspark GBTClassifier
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15692 from zhengruifeng/gbt_subsamplingRate.
      9dc9f9a5
    • Reynold Xin's avatar
      [SQL] minor - internal doc improvement for InsertIntoTable. · 0ea5d5b2
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      I was reading this part of the code and was really confused by the "partition" parameter. This patch adds some documentation for it to reduce confusion in the future.
      
      I also looked around other logical plans but most of them are either already documented, or pretty self-evident to people that know Spark SQL.
      
      ## How was this patch tested?
      N/A - doc change only.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15749 from rxin/doc-improvement.
      0ea5d5b2
    • Reynold Xin's avatar
      [SPARK-18219] Move commit protocol API (internal) from sql/core to core module · 937af592
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves the new commit protocol API from sql/core to core module, so we can use it in the future in the RDD API.
      
      As part of this patch, I also moved the speficiation of the random uuid for the write path out of the commit protocol, and instead pass in a job id.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15731 from rxin/SPARK-18219.
      937af592
    • Daoyuan Wang's avatar
      [SPARK-17122][SQL] support drop current database · 96cc1b56
      Daoyuan Wang authored
      ## What changes were proposed in this pull request?
      
      In Spark 1.6 and earlier, we can drop the database we are using. In Spark 2.0, native implementation prevent us from dropping current database, which may break some old queries. This PR would re-enable the feature.
      ## How was this patch tested?
      
      one new unit test in `SessionCatalogSuite`.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #15011 from adrian-wang/dropcurrent.
      96cc1b56
    • Dongjoon Hyun's avatar
      [SPARK-18200][GRAPHX] Support zero as an initial capacity in OpenHashSet · d24e7364
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [SPARK-18200](https://issues.apache.org/jira/browse/SPARK-18200) reports Apache Spark 2.x raises `java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity` while running `triangleCount`. The root cause is that `VertexSet`, a type alias of `OpenHashSet`, does not allow zero as a initial size. This PR loosens the restriction to allow zero.
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a new test case in `OpenHashSetSuite`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15741 from dongjoon-hyun/SPARK-18200.
      d24e7364
  4. Nov 02, 2016
    • gatorsmile's avatar
      [SPARK-18175][SQL] Improve the test case coverage of implicit type casting · 9ddec863
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      So far, we have limited test case coverage about implicit type casting. We need to draw a matrix to find all the possible casting pairs.
      - Reorged the existing test cases
      - Added all the possible type casting pairs
      - Drawed a matrix to show the implicit type casting. The table is very wide. Maybe hard to review. Thus, you also can access the same table via the link to [a google sheet](https://docs.google.com/spreadsheets/d/19PS4ikrs-Yye_mfu-rmIKYGnNe-NmOTt5DDT1fOD3pI/edit?usp=sharing).
      
      SourceType\CastToType | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | BinaryType | BooleanType | StringType | DateType | TimestampType | ArrayType | MapType | StructType | NullType | CalendarIntervalType | DecimalType | NumericType | IntegralType
      ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |  -----------
      **ByteType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(3, 0) | ByteType | ByteType
      **ShortType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(5, 0) | ShortType | ShortType
      **IntegerType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(10, 0) | IntegerType | IntegerType
      **LongType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(20, 0) | LongType | LongType
      **DoubleType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(30, 15) | DoubleType | IntegerType
      **FloatType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(14, 7) | FloatType | IntegerType
      **Dec(10, 2)** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(10, 2) | Dec(10, 2) | IntegerType
      **BinaryType** | X    | X    | X    | X    | X    | X    | X    | BinaryType | X    | StringType | X    | X    | X    | X    | X    | X    | X    | X    | X    | X
      **BooleanType** | X    | X    | X    | X    | X    | X    | X    | X    | BooleanType | StringType | X    | X    | X    | X    | X    | X    | X    | X    | X    | X
      **StringType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | BinaryType | X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    | DecimalType(38, 18) | DoubleType | X
      **DateType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    | X    | X    | X
      **TimestampType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    | X    | X    | X
      **ArrayType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | ArrayType* | X    | X    | X    | X    | X    | X    | X
      **MapType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | MapType* | X    | X    | X    | X    | X    | X
      **StructType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | StructType* | X    | X    | X    | X    | X
      **NullType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | BinaryType | BooleanType | StringType | DateType | TimestampType | ArrayType | MapType | StructType | NullType | CalendarIntervalType | DecimalType(38, 18) | DoubleType | IntegerType
      **CalendarIntervalType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | CalendarIntervalType | X    | X    | X
      Note: ArrayType\*, MapType\*, StructType\* are castable only when the internal child types also match; otherwise, not castable
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15691 from gatorsmile/implicitTypeCasting.
      9ddec863
    • hyukjinkwon's avatar
      [SPARK-17963][SQL][DOCUMENTATION] Add examples (extend) in each expression and... · 7eb2ca8e
      hyukjinkwon authored
      [SPARK-17963][SQL][DOCUMENTATION] Add examples (extend) in each expression and improve documentation
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to change the documentation for functions. Please refer the discussion from https://github.com/apache/spark/pull/15513
      
      The changes include
      - Re-indent the documentation
      - Add examples/arguments in `extended` where the arguments are multiple or specific format (e.g. xml/ json).
      
      For examples, the documentation was updated as below:
      ### Functions with single line usage
      
      **Before**
      - `pow`
      
        ``` sql
        Usage: pow(x1, x2) - Raise x1 to the power of x2.
        Extended Usage:
        > SELECT pow(2, 3);
         8.0
        ```
      - `current_timestamp`
      
        ``` sql
        Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
        Extended Usage:
        No example for current_timestamp.
        ```
      
      **After**
      - `pow`
      
        ``` sql
        Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`.
        Extended Usage:
            Examples:
              > SELECT pow(2, 3);
               8.0
        ```
      
      - `current_timestamp`
      
        ``` sql
        Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
        Extended Usage:
            No example/argument for current_timestamp.
        ```
      ### Functions with (already) multiple line usage
      
      **Before**
      - `approx_count_distinct`
      
        ``` sql
        Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++.
            approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++
              with relativeSD, the maximum estimation error allowed.
      
        Extended Usage:
        No example for approx_count_distinct.
        ```
      - `percentile_approx`
      
        ``` sql
        Usage:
              percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
              column `col` at the given percentage. The value of percentage must be between 0.0
              and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
              controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
              better accuracy, `1.0/accuracy` is the relative error of the approximation.
      
              percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
              percentile array of column `col` at the given percentage array. Each value of the
              percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
              a positive integer literal which controls approximation accuracy at the cost of memory.
              Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
              the approximation.
      
        Extended Usage:
        No example for percentile_approx.
        ```
      
      **After**
      - `approx_count_distinct`
      
        ``` sql
        Usage:
            approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++.
              `relativeSD` defines the maximum estimation error allowed.
      
        Extended Usage:
            No example/argument for approx_count_distinct.
        ```
      
      - `percentile_approx`
      
        ``` sql
        Usage:
            percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
              column `col` at the given percentage. The value of percentage must be between 0.0
              and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
              controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
              better accuracy, `1.0/accuracy` is the relative error of the approximation.
              When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
              In this case, returns the approximate percentile array of column `col` at the given
              percentage array.
      
        Extended Usage:
            Examples:
              > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
               [10.0,10.0,10.0]
              > SELECT percentile_approx(10.0, 0.5, 100);
               10.0
        ```
      ## How was this patch tested?
      
      Manually tested
      
      **When examples are multiple**
      
      ``` sql
      spark-sql> describe function extended reflect;
      Function: reflect
      Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
      Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
      Extended Usage:
          Examples:
            > SELECT reflect('java.util.UUID', 'randomUUID');
             c33fb387-8500-4bfa-81d2-6e0e3e930df2
            > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
             a5cf6c42-0c85-418f-af6c-3e4e5b1328f2
      ```
      
      **When `Usage` is in single line**
      
      ``` sql
      spark-sql> describe function extended min;
      Function: min
      Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min
      Usage: min(expr) - Returns the minimum value of `expr`.
      Extended Usage:
          No example/argument for min.
      ```
      
      **When `Usage` is already in multiple lines**
      
      ``` sql
      spark-sql> describe function extended percentile_approx;
      Function: percentile_approx
      Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
      Usage:
          percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
            column `col` at the given percentage. The value of percentage must be between 0.0
            and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
            controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
            better accuracy, `1.0/accuracy` is the relative error of the approximation.
            When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
            In this case, returns the approximate percentile array of column `col` at the given
            percentage array.
      
      Extended Usage:
          Examples:
            > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
             [10.0,10.0,10.0]
            > SELECT percentile_approx(10.0, 0.5, 100);
             10.0
      ```
      
      **When example/argument is missing**
      
      ``` sql
      spark-sql> describe function extended rank;
      Function: rank
      Class: org.apache.spark.sql.catalyst.expressions.Rank
      Usage:
          rank() - Computes the rank of a value in a group of values. The result is one plus the number
            of rows preceding or equal to the current row in the ordering of the partition. The values
            will produce gaps in the sequence.
      
      Extended Usage:
          No example/argument for rank.
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15677 from HyukjinKwon/SPARK-17963-1.
      7eb2ca8e
    • Wenchen Fan's avatar
      [SPARK-17470][SQL] unify path for data source table and locationUri for hive serde table · 3a1bc6f4
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.
      
      This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.
      
      This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.
      
      For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
      For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.
      
      To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15024 from cloud-fan/path.
      3a1bc6f4
    • Reynold Xin's avatar
      [SPARK-18214][SQL] Simplify RuntimeReplaceable type coercion · fd90541c
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      RuntimeReplaceable is used to create aliases for expressions, but the way it deals with type coercion is pretty weird (each expression is responsible for how to handle type coercion, which does not obey the normal implicit type cast rules).
      
      This patch simplifies its handling by allowing the analyzer to traverse into the actual expression of a RuntimeReplaceable.
      
      ## How was this patch tested?
      - Correctness should be guaranteed by existing unit tests already
      - Removed SQLCompatibilityFunctionSuite and moved it sql-compatibility-functions.sql
      - Added a new test case in sql-compatibility-functions.sql for verifying explain behavior.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15723 from rxin/SPARK-18214.
      fd90541c
    • Steve Loughran's avatar
      [SPARK-17058][BUILD] Add maven snapshots-and-staging profile to build/test... · 37d95227
      Steve Loughran authored
      [SPARK-17058][BUILD] Add maven snapshots-and-staging profile to build/test against staging artifacts
      
      ## What changes were proposed in this pull request?
      
      Adds a `snapshots-and-staging profile` so that  RCs of projects like Hadoop and HBase can be used in developer-only build and test runs. There's a comment above the profile telling people not to use this in production.
      
      There's no attempt to do the same for SBT, as Ivy is different.
      ## How was this patch tested?
      
      Tested by building against the Hadoop 2.7.3 RC 1 JARs
      
      without the profile (and without any local copy of the 2.7.3 artifacts), the build failed
      
      ```
      mvn install -DskipTests -Pyarn,hadoop-2.7,hive -Dhadoop.version=2.7.3
      
      ...
      
      [INFO] ------------------------------------------------------------------------
      [INFO] Building Spark Project Launcher 2.1.0-SNAPSHOT
      [INFO] ------------------------------------------------------------------------
      Downloading: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.pom
      [WARNING] The POM for org.apache.hadoop:hadoop-client:jar:2.7.3 is missing, no dependency information available
      Downloading: https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.jar
      [INFO] ------------------------------------------------------------------------
      [INFO] Reactor Summary:
      [INFO]
      [INFO] Spark Project Parent POM ........................... SUCCESS [  4.482 s]
      [INFO] Spark Project Tags ................................. SUCCESS [ 17.402 s]
      [INFO] Spark Project Sketch ............................... SUCCESS [ 11.252 s]
      [INFO] Spark Project Networking ........................... SUCCESS [ 13.458 s]
      [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  9.043 s]
      [INFO] Spark Project Unsafe ............................... SUCCESS [ 16.027 s]
      [INFO] Spark Project Launcher ............................. FAILURE [  1.653 s]
      [INFO] Spark Project Core ................................. SKIPPED
      ...
      ```
      
      With the profile, the build completed
      
      ```
      mvn install -DskipTests -Pyarn,hadoop-2.7,hive,snapshots-and-staging -Dhadoop.version=2.7.3
      ```
      
      Author: Steve Loughran <stevel@apache.org>
      
      Closes #14646 from steveloughran/stevel/SPARK-17058-support-asf-snapshots.
      37d95227
    • Jeff Zhang's avatar
      [SPARK-18160][CORE][YARN] spark.files & spark.jars should not be passed to driver in yarn mode · 3c24299b
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      spark.files is still passed to driver in yarn mode, so SparkContext will still handle it which cause the error in the jira desc.
      
      ## How was this patch tested?
      
      Tested manually in a 5 node cluster. As this issue only happens in multiple node cluster, so I didn't write test for it.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #15669 from zjffdu/SPARK-18160.
      3c24299b
    • Xiangrui Meng's avatar
      [SPARK-14393][SQL] values generated by non-deterministic functions shouldn't... · 02f20310
      Xiangrui Meng authored
      [SPARK-14393][SQL] values generated by non-deterministic functions shouldn't change after coalesce or union
      
      ## What changes were proposed in this pull request?
      
      When a user appended a column using a "nondeterministic" function to a DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the expected semantic is the following:
      - The value in each row should remain unchanged, as if we materialize the column immediately, regardless of later DataFrame operations.
      
      However, since we use `TaskContext.getPartitionId` to get the partition index from the current thread, the values from nondeterministic columns might change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` returns the partition index of the current Spark task, which might not be the corresponding partition index of the DataFrame where we defined the column.
      
      See the unit tests below or JIRA for examples.
      
      This PR uses the partition index from `RDD.mapPartitionWithIndex` instead of `TaskContext` and fixes the partition initialization logic in whole-stage codegen, normal codegen, and codegen fallback. `initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, `Nondeterministic`, and `Predicate` (codegen) and initialized right after object creation in `mapPartitionWithIndex`. `newPredicate` now returns a `Predicate` instance rather than a function for proper initialization.
      ## How was this patch tested?
      
      Unit tests. (Actually I'm not very confident that this PR fixed all issues without introducing new ones ...)
      
      cc: rxin davies
      
      Author: Xiangrui Meng <meng@databricks.com>
      
      Closes #15567 from mengxr/SPARK-14393.
      02f20310
    • buzhihuojie's avatar
      [SPARK-17895] Improve doc for rangeBetween and rowsBetween · 742e0fea
      buzhihuojie authored
      ## What changes were proposed in this pull request?
      
      Copied description for row and range based frame boundary from https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L56
      
      Added examples to show different behavior of rangeBetween and rowsBetween when involving duplicate values.
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before opening a pull request.
      
      Author: buzhihuojie <ren.weiluo@gmail.com>
      
      Closes #15727 from david-weiluo-ren/improveDocForRangeAndRowsBetween.
      742e0fea
    • Takeshi YAMAMURO's avatar
      [SPARK-17683][SQL] Support ArrayType in Literal.apply · 4af0ce2d
      Takeshi YAMAMURO authored
      ## What changes were proposed in this pull request?
      
      This pr is to add pattern-matching entries for array data in `Literal.apply`.
      ## How was this patch tested?
      
      Added tests in `LiteralExpressionSuite`.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #15257 from maropu/SPARK-17683.
      4af0ce2d
Loading