Skip to content
Snippets Groups Projects
  1. Nov 07, 2016
    • Liang-Chi Hsieh's avatar
      [SPARK-18125][SQL] Fix a compilation error in codegen due to splitExpression · df40ee2b
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      As reported in the jira, sometimes the generated java code in codegen will cause compilation error.
      
      Code snippet to test it:
      
          case class Route(src: String, dest: String, cost: Int)
          case class GroupedRoutes(src: String, dest: String, routes: Seq[Route])
      
          val ds = sc.parallelize(Array(
            Route("a", "b", 1),
            Route("a", "b", 2),
            Route("a", "c", 2),
            Route("a", "d", 10),
            Route("b", "a", 1),
            Route("b", "a", 5),
            Route("b", "c", 6))
          ).toDF.as[Route]
      
          val grped = ds.map(r => GroupedRoutes(r.src, r.dest, Seq(r)))
            .groupByKey(r => (r.src, r.dest))
            .reduceGroups { (g1: GroupedRoutes, g2: GroupedRoutes) =>
              GroupedRoutes(g1.src, g1.dest, g1.routes ++ g2.routes)
            }.map(_._2)
      
      The problem here is, in `ReferenceToExpressions` we evaluate the children vars to local variables. Then the result expression is evaluated to use those children variables. In the above case, the result expression code is too long and will be split by `CodegenContext.splitExpression`. So those local variables cannot be accessed and cause compilation error.
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
      
       before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #15693 from viirya/fix-codege-compilation-error.
      
      (cherry picked from commit a814eeac)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      df40ee2b
    • gatorsmile's avatar
      [SPARK-16904][SQL] Removal of Hive Built-in Hash Functions and TestHiveFunctionRegistry · 41010295
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      
      Currently, the Hive built-in `hash` function is not being used in Spark since Spark 2.0. The public interface does not allow users to unregister the Spark built-in functions. Thus, users will never use Hive's built-in `hash` function.
      
      The only exception here is `TestHiveFunctionRegistry`, which allows users to unregister the built-in functions. Thus, we can load Hive's hash function in the test cases. If we disable it, 10+ test cases will fail because the results are different from the Hive golden answer files.
      
      This PR is to remove `hash` from the list of `hiveFunctions` in `HiveSessionCatalog`. It will also remove `TestHiveFunctionRegistry`. This removal makes us easier to remove `TestHiveSessionState` in the future.
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #14498 from gatorsmile/removeHash.
      
      (cherry picked from commit 57626a55)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      41010295
    • Reynold Xin's avatar
      [SPARK-18296][SQL] Use consistent naming for expression test suites · 2fa1a632
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      We have an undocumented naming convention to call expression unit tests ExpressionsSuite, and the end-to-end tests FunctionsSuite. It'd be great to make all test suites consistent with this naming convention.
      
      ## How was this patch tested?
      This is a test-only naming change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15793 from rxin/SPARK-18296.
      
      (cherry picked from commit 9db06c44)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      2fa1a632
    • Reynold Xin's avatar
      [SPARK-18167][SQL] Disable flaky hive partition pruning test. · 9ebd5e56
      Reynold Xin authored
      
      (cherry picked from commit 07ac3f09)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      9ebd5e56
  2. Nov 06, 2016
    • Wenchen Fan's avatar
      [SPARK-18173][SQL] data source tables should support truncating partition · 9c78d355
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      Previously `TRUNCATE TABLE ... PARTITION` will always truncate the whole table for data source tables, this PR fixes it and improve `InMemoryCatalog` to make this command work with it.
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15688 from cloud-fan/truncate.
      
      (cherry picked from commit 46b2e499)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      9c78d355
    • hyukjinkwon's avatar
      [SPARK-18269][SQL] CSV datasource should read null properly when schema is lager than parsed tokens · a8fbcdbf
      hyukjinkwon authored
      
      ## What changes were proposed in this pull request?
      
      Currently, there are the three cases when reading CSV by datasource when it is `PERMISSIVE` parse mode.
      
      - schema == parsed tokens (from each line)
        No problem to cast the value in the tokens to the field in the schema as they are equal.
      
      - schema < parsed tokens (from each line)
        It slices the tokens into the number of fields in schema.
      
      - schema > parsed tokens (from each line)
        It appends `null` into parsed tokens so that safely values can be casted with the schema.
      
      However, when `null` is appended in the third case, we should take `null` into account when casting the values.
      
      In case of `StringType`, it is fine as `UTF8String.fromString(datum)` produces `null` when the input is `null`. Therefore, this case will happen only when schema is explicitly given and schema includes data types that are not `StringType`.
      
      The codes below:
      
      ```scala
      val path = "/tmp/a"
      Seq("1").toDF().write.text(path.getAbsolutePath)
      val schema = StructType(
        StructField("a", IntegerType, true) ::
        StructField("b", IntegerType, true) :: Nil)
      spark.read.schema(schema).option("header", "false").csv(path).show()
      ```
      
      prints
      
      **Before**
      
      ```
      java.lang.NumberFormatException: null
      at java.lang.Integer.parseInt(Integer.java:542)
      at java.lang.Integer.parseInt(Integer.java:615)
      at scala.collection.immutable.StringLike$class.toInt(StringLike.scala:272)
      at scala.collection.immutable.StringOps.toInt(StringOps.scala:29)
      at org.apache.spark.sql.execution.datasources.csv.CSVTypeCast$.castTo(CSVInferSchema.scala:24)
      ```
      
      **After**
      
      ```
      +---+----+
      |  a|   b|
      +---+----+
      |  1|null|
      +---+----+
      ```
      
      ## How was this patch tested?
      
      Unit test in `CSVSuite.scala` and `CSVTypeCastSuite.scala`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15767 from HyukjinKwon/SPARK-18269.
      
      (cherry picked from commit 556a3b7d)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      a8fbcdbf
    • Wojciech Szymanski's avatar
      [SPARK-18210][ML] Pipeline.copy does not create an instance with the same UID · d2f2cf68
      Wojciech Szymanski authored
      
      ## What changes were proposed in this pull request?
      
      Motivation:
      `org.apache.spark.ml.Pipeline.copy(extra: ParamMap)` does not create an instance with the same UID. It does not conform to the method specification from its base class `org.apache.spark.ml.param.Params.copy(extra: ParamMap)`
      
      Solution:
      - fix for Pipeline UID
      - introduced new tests for `org.apache.spark.ml.Pipeline.copy`
      - minor improvements in test for `org.apache.spark.ml.PipelineModel.copy`
      
      ## How was this patch tested?
      
      Introduced new unit test: `org.apache.spark.ml.PipelineSuite."Pipeline.copy"`
      Improved existing unit test: `org.apache.spark.ml.PipelineSuite."PipelineModel.copy"`
      
      Author: Wojciech Szymanski <wk.szymanski@gmail.com>
      
      Closes #15759 from wojtek-szymanski/SPARK-18210.
      
      (cherry picked from commit b89d0556)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      d2f2cf68
    • hyukjinkwon's avatar
      [SPARK-17854][SQL] rand/randn allows null/long as input seed · dcbf3fd4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes `rand`/`randn` accept `null` as input in Scala/SQL and `LongType` as input in SQL. In this case, it treats the values as `0`.
      
      So, this PR includes both changes below:
      - `null` support
      
        It seems MySQL also accepts this.
      
        ``` sql
        mysql> select rand(0);
        +---------------------+
        | rand(0)             |
        +---------------------+
        | 0.15522042769493574 |
        +---------------------+
        1 row in set (0.00 sec)
      
        mysql> select rand(NULL);
        +---------------------+
        | rand(NULL)          |
        +---------------------+
        | 0.15522042769493574 |
        +---------------------+
        1 row in set (0.00 sec)
        ```
      
        and also Hive does according to [HIVE-14694](https://issues.apache.org/jira/browse/HIVE-14694
      
      )
      
        So the codes below:
      
        ``` scala
        spark.range(1).selectExpr("rand(null)").show()
        ```
      
        prints..
      
        **Before**
      
        ```
          Input argument to rand must be an integer literal.;; line 1 pos 0
        org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
        at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
        at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:444)
        ```
      
        **After**
      
        ```
          +-----------------------+
          |rand(CAST(NULL AS INT))|
          +-----------------------+
          |    0.13385709732307427|
          +-----------------------+
        ```
      - `LongType` support in SQL.
      
        In addition, it make the function allows to take `LongType` consistently within Scala/SQL.
      
        In more details, the codes below:
      
        ``` scala
        spark.range(1).select(rand(1), rand(1L)).show()
        spark.range(1).selectExpr("rand(1)", "rand(1L)").show()
        ```
      
        prints..
      
        **Before**
      
        ```
        +------------------+------------------+
        |           rand(1)|           rand(1)|
        +------------------+------------------+
        |0.2630967864682161|0.2630967864682161|
        +------------------+------------------+
      
        Input argument to rand must be an integer literal.;; line 1 pos 0
        org.apache.spark.sql.AnalysisException: Input argument to rand must be an integer literal.;; line 1 pos 0
        at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$$anonfun$5.apply(FunctionRegistry.scala:465)
        at
        ```
      
        **After**
      
        ```
        +------------------+------------------+
        |           rand(1)|           rand(1)|
        +------------------+------------------+
        |0.2630967864682161|0.2630967864682161|
        +------------------+------------------+
      
        +------------------+------------------+
        |           rand(1)|           rand(1)|
        +------------------+------------------+
        |0.2630967864682161|0.2630967864682161|
        +------------------+------------------+
        ```
      ## How was this patch tested?
      
      Unit tests in `DataFrameSuite.scala` and `RandomSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15432 from HyukjinKwon/SPARK-17854.
      
      (cherry picked from commit 340f09d1)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      dcbf3fd4
    • sethah's avatar
      [SPARK-18276][ML] ML models should copy the training summary and set parent · c42301f1
      sethah authored
      
      ## What changes were proposed in this pull request?
      
      Only some of the models which contain a training summary currently set the summaries in the copy method. Linear/Logistic regression do, GLR, GMM, KM, and BKM do not. Additionally, these copy methods did not set the parent pointer of the copied model. This patch modifies the copy methods of the four models mentioned above to copy the training summary and set the parent.
      
      ## How was this patch tested?
      
      Add unit tests in Linear/Logistic/GeneralizedLinear regression and GaussianMixture/KMeans/BisectingKMeans to check the parent pointer of the copied model and check that the copied model has a summary.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15773 from sethah/SPARK-18276.
      
      (cherry picked from commit 23ce0d1e)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      c42301f1
  3. Nov 05, 2016
  4. Nov 04, 2016
    • Eric Liang's avatar
      [SPARK-18167] Re-enable the non-flaky parts of SQLQuerySuite · 0a303a69
      Eric Liang authored
      ## What changes were proposed in this pull request?
      
      It seems the proximate cause of the test failures is that `cast(str as decimal)` in derby will raise an exception instead of returning NULL. This is a problem since Hive sometimes inserts `__HIVE_DEFAULT_PARTITION__` entries into the partition table as documented here: https://github.com/apache/hive/blob/trunk/metastore/src/java/org/apache/hadoop/hive/metastore/MetaStoreDirectSql.java#L1034
      
      
      
      Basically, when these special default partitions are present, partition pruning pushdown using the SQL-direct mode will fail due this cast exception. As commented on in `MetaStoreDirectSql.java` above, this is normally fine since Hive falls back to JDO pruning, however when the pruning predicate contains an unsupported operator such as `>`, that will fail as well.
      
      The only remaining question is why this behavior is nondeterministic. We know that when the test flakes, retries do not help, therefore the cause must be environmental. The current best hypothesis is that some config is different between different jenkins runs, which is why this PR prints out the Spark SQL and Hive confs for the test. The hope is that by comparing the config state for failure vs success we can isolate the root cause of the flakiness.
      
      **Update:** we could not isolate the issue. It does not seem to be due to configuration differences. As such, I'm going to enable the non-flaky parts of the test since we are fairly confident these issues only occur with Derby (which is not used in production).
      
      ## How was this patch tested?
      
      N/A
      
      Author: Eric Liang <ekl@databricks.com>
      
      Closes #15725 from ericl/print-confs-out.
      
      (cherry picked from commit 4cee2ce2)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      0a303a69
    • Herman van Hovell's avatar
      [SPARK-17337][SQL] Do not pushdown predicates through filters with predicate subqueries · e51978c3
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      The `PushDownPredicate` rule can create a wrong result if we try to push a filter containing a predicate subquery through a project when the subquery and the project share attributes (have the same source).
      
      The current PR fixes this by making sure that we do not push down when there is a predicate subquery that outputs the same attributes as the filters new child plan.
      
      ## How was this patch tested?
      Added a test to `SubquerySuite`. nsyca has done previous work this. I have taken test from his initial PR.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15761 from hvanhovell/SPARK-17337.
      
      (cherry picked from commit 550cd56e)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      e51978c3
    • Adam Roberts's avatar
      [SPARK-18197][CORE] Optimise AppendOnlyMap implementation · a2d7e25e
      Adam Roberts authored
      
      ## What changes were proposed in this pull request?
      This improvement works by using the fastest comparison test first and we observed a 1% throughput performance improvement on PageRank (HiBench large profile) with this change.
      
      We used tprof and before the change in AppendOnlyMap.changeValue (where the optimisation occurs) this method was being used for 8053 profiling ticks representing 0.72% of the overall application time.
      
      After this change we observed this method only occurring for 2786 ticks and for 0.25% of the overall time.
      
      ## How was this patch tested?
      Existing unit tests and for performance we used HiBench large, profiling with tprof and IBM Healthcenter.
      
      Author: Adam Roberts <aroberts@uk.ibm.com>
      
      Closes #15714 from a-roberts/patch-9.
      
      (cherry picked from commit a42d738c)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      a2d7e25e
    • Dongjoon Hyun's avatar
      [SPARK-18200][GRAPHX][FOLLOW-UP] Support zero as an initial capacity in OpenHashSet · cfe76028
      Dongjoon Hyun authored
      
      ## What changes were proposed in this pull request?
      
      This is a follow-up PR of #15741 in order to keep `nextPowerOf2` consistent.
      
      **Before**
      ```
      nextPowerOf2(0) => 2
      nextPowerOf2(1) => 1
      nextPowerOf2(2) => 2
      nextPowerOf2(3) => 4
      nextPowerOf2(4) => 4
      nextPowerOf2(5) => 8
      ```
      
      **After**
      ```
      nextPowerOf2(0) => 1
      nextPowerOf2(1) => 1
      nextPowerOf2(2) => 2
      nextPowerOf2(3) => 4
      nextPowerOf2(4) => 4
      nextPowerOf2(5) => 8
      ```
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15754 from dongjoon-hyun/SPARK-18200-2.
      
      (cherry picked from commit 27602c33)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      cfe76028
    • Felix Cheung's avatar
      [SPARK-14393][SQL][DOC] update doc for python and R · 8e145a94
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      minor doc update that should go to master & branch-2.1
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #15747 from felixcheung/pySPARK-14393.
      
      (cherry picked from commit a08463b1)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      8e145a94
    • Herman van Hovell's avatar
      [SPARK-18259][SQL] Do not capture Throwable in QueryExecution · 91d56715
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      `QueryExecution.toString` currently captures `java.lang.Throwable`s; this is far from a best practice and can lead to confusing situation or invalid application states. This PR fixes this by only capturing `AnalysisException`s.
      
      ## How was this patch tested?
      Added a `QueryExecutionSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15760 from hvanhovell/SPARK-18259.
      
      (cherry picked from commit aa412c55)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      91d56715
  5. Nov 03, 2016
    • Sean Owen's avatar
      [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6... · 37550c49
      Sean Owen authored
      [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0
      
      ## What changes were proposed in this pull request?
      
      Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated in Spark 2.1.0. This does not actually implement any of the change in SPARK-18138, just peppers the documentation with notices about it.
      
      ## How was this patch tested?
      
      Doc build
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15733 from srowen/SPARK-18138.
      
      (cherry picked from commit dc4c6009)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      37550c49
    • Reynold Xin's avatar
      [SPARK-18257][SS] Improve error reporting for FileStressSuite · af60b1eb
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      This patch improves error reporting for FileStressSuite, when there is an error in Spark itself (not user code). This works by simply tightening the exception verification, and gets rid of the unnecessary thread for starting the stream.
      
      Also renamed the class FileStreamStressSuite to make it more obvious it is a streaming suite.
      
      ## How was this patch tested?
      This is a test only change and I manually verified error reporting by injecting some bug in the addBatch code for FileStreamSink.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15757 from rxin/SPARK-18257.
      
      (cherry picked from commit f22954ad)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      af60b1eb
    • cody koeninger's avatar
      [SPARK-18212][SS][KAFKA] increase executor poll timeout · 2daca62c
      cody koeninger authored
      
      ## What changes were proposed in this pull request?
      
      Increase poll timeout to try and address flaky test
      
      ## How was this patch tested?
      
      Ran existing unit tests
      
      Author: cody koeninger <cody@koeninger.org>
      
      Closes #15737 from koeninger/SPARK-18212.
      
      (cherry picked from commit 67659c9a)
      Signed-off-by: default avatarMichael Armbrust <michael@databricks.com>
      2daca62c
    • Kishor Patil's avatar
      [SPARK-18099][YARN] Fail if same files added to distributed cache for --files and --archives · 569f77a1
      Kishor Patil authored
      ## What changes were proposed in this pull request?
      
      During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives.
      ## How was this patch tested?
      
      Manually tested:
      1. if same jar is mentioned in --jars and --files it will continue to submit the job.
      - basically functionality [SPARK-14423] #12203 is unchanged
        1. if same file is mentioned in --files and --archives it will fail to submit the job.
      
      Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
      
       before opening a pull request.
      
      … under archives and files
      
      Author: Kishor Patil <kpatil@yahoo-inc.com>
      
      Closes #15627 from kishorvpatil/spark18099.
      
      (cherry picked from commit 098e4ca9)
      Signed-off-by: default avatarTom Graves <tgraves@yahoo-inc.com>
      569f77a1
    • 福星's avatar
      [SPARK-18237][HIVE] hive.exec.stagingdir have no effect · 3e139e23
      福星 authored
      
      hive.exec.stagingdir have no effect in spark2.0.1,
      Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf`
      
      Author: 福星 <fuxing@wacai.com>
      
      Closes #15744 from ClassNotFoundExp/master.
      
      (cherry picked from commit 16293311)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      3e139e23
    • Reynold Xin's avatar
      [SPARK-18244][SQL] Rename partitionProviderIsHive -> tracksPartitionsInCatalog · 4f91630c
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      This patch renames partitionProviderIsHive to tracksPartitionsInCatalog, as the old name was too Hive specific.
      
      ## How was this patch tested?
      Should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15750 from rxin/SPARK-18244.
      
      (cherry picked from commit b17057c0)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      4f91630c
    • gatorsmile's avatar
      [SPARK-17981][SPARK-17957][SQL] Fix Incorrect Nullability Setting to False in FilterExec · c2876bfb
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      
      When `FilterExec` contains `isNotNull`, which could be inferred and pushed down or users specified, we convert the nullability of the involved columns if the top-layer expression is null-intolerant. However, this is not correct, if the top-layer expression is not a leaf expression, it could still tolerate the null when it has null-tolerant child expressions.
      
      For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a null-intolerant expression, but obviously`coalesce` is null-tolerant. Thus, it could eat null.
      
      When the nullability is wrong, we could generate incorrect results in different cases. For example,
      
      ``` Scala
          val df1 = Seq((1, 2), (2, 3)).toDF("a", "b")
          val df2 = Seq((2, 5), (3, 4)).toDF("a", "c")
          val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0)
          val df3 = Seq((3, 1)).toDF("a", "d")
          joinedDf.join(df3, "a").show
      ```
      
      The optimized plan is like
      
      ```
      Project [a#29, b#30, c#31, d#42]
      +- Join Inner, (a#29 = a#41)
         :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31]
         :  +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as int))
         :     +- Join FullOuter, (a#5 = a#15)
         :        :- LocalRelation [a#5, b#6]
         :        +- LocalRelation [a#15, c#16]
         +- LocalRelation [a#41, d#42]
      ```
      
      Without the fix, it returns an empty result. With the fix, it can return a correct answer:
      
      ```
      +---+---+---+---+
      |  a|  b|  c|  d|
      +---+---+---+---+
      |  3|  0|  4|  1|
      +---+---+---+---+
      ```
      ### How was this patch tested?
      
      Added test cases to verify the nullability changes in FilterExec. Also added a test case for verifying the reported incorrect result.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15523 from gatorsmile/nullabilityFilterExec.
      
      (cherry picked from commit 66a99f4a)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      c2876bfb
    • Zheng RuiFeng's avatar
      [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark GBTClassifier · 99891e56
      Zheng RuiFeng authored
      
      ## What changes were proposed in this pull request?
      Add missing 'subsamplingRate' of pyspark GBTClassifier
      
      ## How was this patch tested?
      existing tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #15692 from zhengruifeng/gbt_subsamplingRate.
      
      (cherry picked from commit 9dc9f9a5)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      99891e56
    • Reynold Xin's avatar
      [SQL] minor - internal doc improvement for InsertIntoTable. · 71104c9c
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      I was reading this part of the code and was really confused by the "partition" parameter. This patch adds some documentation for it to reduce confusion in the future.
      
      I also looked around other logical plans but most of them are either already documented, or pretty self-evident to people that know Spark SQL.
      
      ## How was this patch tested?
      N/A - doc change only.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15749 from rxin/doc-improvement.
      
      (cherry picked from commit 0ea5d5b2)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      71104c9c
    • Reynold Xin's avatar
      [SPARK-18219] Move commit protocol API (internal) from sql/core to core module · bc7f05f5
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      This patch moves the new commit protocol API from sql/core to core module, so we can use it in the future in the RDD API.
      
      As part of this patch, I also moved the speficiation of the random uuid for the write path out of the commit protocol, and instead pass in a job id.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15731 from rxin/SPARK-18219.
      
      (cherry picked from commit 937af592)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      bc7f05f5
    • Daoyuan Wang's avatar
      [SPARK-17122][SQL] support drop current database · c4c5328f
      Daoyuan Wang authored
      
      ## What changes were proposed in this pull request?
      
      In Spark 1.6 and earlier, we can drop the database we are using. In Spark 2.0, native implementation prevent us from dropping current database, which may break some old queries. This PR would re-enable the feature.
      ## How was this patch tested?
      
      one new unit test in `SessionCatalogSuite`.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #15011 from adrian-wang/dropcurrent.
      
      (cherry picked from commit 96cc1b56)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      c4c5328f
    • Dongjoon Hyun's avatar
      [SPARK-18200][GRAPHX] Support zero as an initial capacity in OpenHashSet · 965c964c
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [SPARK-18200](https://issues.apache.org/jira/browse/SPARK-18200
      
      ) reports Apache Spark 2.x raises `java.lang.IllegalArgumentException: requirement failed: Invalid initial capacity` while running `triangleCount`. The root cause is that `VertexSet`, a type alias of `OpenHashSet`, does not allow zero as a initial size. This PR loosens the restriction to allow zero.
      
      ## How was this patch tested?
      
      Pass the Jenkins test with a new test case in `OpenHashSetSuite`.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #15741 from dongjoon-hyun/SPARK-18200.
      
      (cherry picked from commit d24e7364)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      965c964c
  6. Nov 02, 2016
    • gatorsmile's avatar
      [SPARK-18175][SQL] Improve the test case coverage of implicit type casting · 2cf39d63
      gatorsmile authored
      ### What changes were proposed in this pull request?
      
      So far, we have limited test case coverage about implicit type casting. We need to draw a matrix to find all the possible casting pairs.
      - Reorged the existing test cases
      - Added all the possible type casting pairs
      - Drawed a matrix to show the implicit type casting. The table is very wide. Maybe hard to review. Thus, you also can access the same table via the link to [a google sheet](https://docs.google.com/spreadsheets/d/19PS4ikrs-Yye_mfu-rmIKYGnNe-NmOTt5DDT1fOD3pI/edit?usp=sharing
      
      ).
      
      SourceType\CastToType | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | BinaryType | BooleanType | StringType | DateType | TimestampType | ArrayType | MapType | StructType | NullType | CalendarIntervalType | DecimalType | NumericType | IntegralType
      ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ | ------------ |  -----------
      **ByteType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(3, 0) | ByteType | ByteType
      **ShortType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(5, 0) | ShortType | ShortType
      **IntegerType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(10, 0) | IntegerType | IntegerType
      **LongType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(20, 0) | LongType | LongType
      **DoubleType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(30, 15) | DoubleType | IntegerType
      **FloatType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(14, 7) | FloatType | IntegerType
      **Dec(10, 2)** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | X    | X    | X    | DecimalType(10, 2) | Dec(10, 2) | IntegerType
      **BinaryType** | X    | X    | X    | X    | X    | X    | X    | BinaryType | X    | StringType | X    | X    | X    | X    | X    | X    | X    | X    | X    | X
      **BooleanType** | X    | X    | X    | X    | X    | X    | X    | X    | BooleanType | StringType | X    | X    | X    | X    | X    | X    | X    | X    | X    | X
      **StringType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | BinaryType | X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    | DecimalType(38, 18) | DoubleType | X
      **DateType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    | X    | X    | X
      **TimestampType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    | X    | X    | X
      **ArrayType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | ArrayType* | X    | X    | X    | X    | X    | X    | X
      **MapType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | MapType* | X    | X    | X    | X    | X    | X
      **StructType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | StructType* | X    | X    | X    | X    | X
      **NullType** | ByteType | ShortType | IntegerType | LongType | DoubleType | FloatType | Dec(10, 2) | BinaryType | BooleanType | StringType | DateType | TimestampType | ArrayType | MapType | StructType | NullType | CalendarIntervalType | DecimalType(38, 18) | DoubleType | IntegerType
      **CalendarIntervalType** | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | X    | CalendarIntervalType | X    | X    | X
      Note: ArrayType\*, MapType\*, StructType\* are castable only when the internal child types also match; otherwise, not castable
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15691 from gatorsmile/implicitTypeCasting.
      
      (cherry picked from commit 9ddec863)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      2cf39d63
    • hyukjinkwon's avatar
      [SPARK-17963][SQL][DOCUMENTATION] Add examples (extend) in each expression and... · 1e29f0a0
      hyukjinkwon authored
      [SPARK-17963][SQL][DOCUMENTATION] Add examples (extend) in each expression and improve documentation
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to change the documentation for functions. Please refer the discussion from https://github.com/apache/spark/pull/15513
      
      
      
      The changes include
      - Re-indent the documentation
      - Add examples/arguments in `extended` where the arguments are multiple or specific format (e.g. xml/ json).
      
      For examples, the documentation was updated as below:
      ### Functions with single line usage
      
      **Before**
      - `pow`
      
        ``` sql
        Usage: pow(x1, x2) - Raise x1 to the power of x2.
        Extended Usage:
        > SELECT pow(2, 3);
         8.0
        ```
      - `current_timestamp`
      
        ``` sql
        Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
        Extended Usage:
        No example for current_timestamp.
        ```
      
      **After**
      - `pow`
      
        ``` sql
        Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`.
        Extended Usage:
            Examples:
              > SELECT pow(2, 3);
               8.0
        ```
      
      - `current_timestamp`
      
        ``` sql
        Usage: current_timestamp() - Returns the current timestamp at the start of query evaluation.
        Extended Usage:
            No example/argument for current_timestamp.
        ```
      ### Functions with (already) multiple line usage
      
      **Before**
      - `approx_count_distinct`
      
        ``` sql
        Usage: approx_count_distinct(expr) - Returns the estimated cardinality by HyperLogLog++.
            approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated cardinality by HyperLogLog++
              with relativeSD, the maximum estimation error allowed.
      
        Extended Usage:
        No example for approx_count_distinct.
        ```
      - `percentile_approx`
      
        ``` sql
        Usage:
              percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
              column `col` at the given percentage. The value of percentage must be between 0.0
              and 1.0. The `accuracy` parameter (default: 10000) is a positive integer literal which
              controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
              better accuracy, `1.0/accuracy` is the relative error of the approximation.
      
              percentile_approx(col, array(percentage1 [, percentage2]...) [, accuracy]) - Returns the approximate
              percentile array of column `col` at the given percentage array. Each value of the
              percentage array must be between 0.0 and 1.0. The `accuracy` parameter (default: 10000) is
              a positive integer literal which controls approximation accuracy at the cost of memory.
              Higher value of `accuracy` yields better accuracy, `1.0/accuracy` is the relative error of
              the approximation.
      
        Extended Usage:
        No example for percentile_approx.
        ```
      
      **After**
      - `approx_count_distinct`
      
        ``` sql
        Usage:
            approx_count_distinct(expr[, relativeSD]) - Returns the estimated cardinality by HyperLogLog++.
              `relativeSD` defines the maximum estimation error allowed.
      
        Extended Usage:
            No example/argument for approx_count_distinct.
        ```
      
      - `percentile_approx`
      
        ``` sql
        Usage:
            percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
              column `col` at the given percentage. The value of percentage must be between 0.0
              and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
              controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
              better accuracy, `1.0/accuracy` is the relative error of the approximation.
              When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
              In this case, returns the approximate percentile array of column `col` at the given
              percentage array.
      
        Extended Usage:
            Examples:
              > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
               [10.0,10.0,10.0]
              > SELECT percentile_approx(10.0, 0.5, 100);
               10.0
        ```
      ## How was this patch tested?
      
      Manually tested
      
      **When examples are multiple**
      
      ``` sql
      spark-sql> describe function extended reflect;
      Function: reflect
      Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
      Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with reflection.
      Extended Usage:
          Examples:
            > SELECT reflect('java.util.UUID', 'randomUUID');
             c33fb387-8500-4bfa-81d2-6e0e3e930df2
            > SELECT reflect('java.util.UUID', 'fromString', 'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
             a5cf6c42-0c85-418f-af6c-3e4e5b1328f2
      ```
      
      **When `Usage` is in single line**
      
      ``` sql
      spark-sql> describe function extended min;
      Function: min
      Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min
      Usage: min(expr) - Returns the minimum value of `expr`.
      Extended Usage:
          No example/argument for min.
      ```
      
      **When `Usage` is already in multiple lines**
      
      ``` sql
      spark-sql> describe function extended percentile_approx;
      Function: percentile_approx
      Class: org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
      Usage:
          percentile_approx(col, percentage [, accuracy]) - Returns the approximate percentile value of numeric
            column `col` at the given percentage. The value of percentage must be between 0.0
            and 1.0. The `accuracy` parameter (default: 10000) is a positive numeric literal which
            controls approximation accuracy at the cost of memory. Higher value of `accuracy` yields
            better accuracy, `1.0/accuracy` is the relative error of the approximation.
            When `percentage` is an array, each value of the percentage array must be between 0.0 and 1.0.
            In this case, returns the approximate percentile array of column `col` at the given
            percentage array.
      
      Extended Usage:
          Examples:
            > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
             [10.0,10.0,10.0]
            > SELECT percentile_approx(10.0, 0.5, 100);
             10.0
      ```
      
      **When example/argument is missing**
      
      ``` sql
      spark-sql> describe function extended rank;
      Function: rank
      Class: org.apache.spark.sql.catalyst.expressions.Rank
      Usage:
          rank() - Computes the rank of a value in a group of values. The result is one plus the number
            of rows preceding or equal to the current row in the ordering of the partition. The values
            will produce gaps in the sequence.
      
      Extended Usage:
          No example/argument for rank.
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15677 from HyukjinKwon/SPARK-17963-1.
      
      (cherry picked from commit 7eb2ca8e)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      1e29f0a0
    • Wenchen Fan's avatar
      [SPARK-17470][SQL] unify path for data source table and locationUri for hive serde table · 5ea2f9e5
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      Due to a limitation of hive metastore(table location must be directory path, not file path), we always store `path` for data source table in storage properties, instead of the `locationUri` field. However, we should not expose this difference to `CatalogTable` level, but just treat it as a hack in `HiveExternalCatalog`, like we store table schema of data source table in table properties.
      
      This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, both data source table and hive serde table should use the `locationUri` field.
      
      This PR also unifies the way we handle default table location for managed table. Previously, the default table location of hive serde managed table is set by external catalog, but the one of data source table is set by command. After this PR, we follow the hive way and the default table location is always set by external catalog.
      
      For managed non-file-based tables, we will assign a default table location and create an empty directory for it, the table location will be removed when the table is dropped. This is reasonable as metastore doesn't care about whether a table is file-based or not, and an empty table directory has no harm.
      For external non-file-based tables, ideally we can omit the table location, but due to a hive metastore issue, we will assign a random location to it, and remove it right after the table is created. See SPARK-15269 for more details. This is fine as it's well isolated in `HiveExternalCatalog`.
      
      To keep the existing behaviour of the `path` option, in this PR we always add the `locationUri` to storage properties using key `path`, before passing storage properties to `DataSource` as data source options.
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15024 from cloud-fan/path.
      
      (cherry picked from commit 3a1bc6f4)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      5ea2f9e5
    • Reynold Xin's avatar
      [SPARK-18214][SQL] Simplify RuntimeReplaceable type coercion · 2aff2ea8
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      RuntimeReplaceable is used to create aliases for expressions, but the way it deals with type coercion is pretty weird (each expression is responsible for how to handle type coercion, which does not obey the normal implicit type cast rules).
      
      This patch simplifies its handling by allowing the analyzer to traverse into the actual expression of a RuntimeReplaceable.
      
      ## How was this patch tested?
      - Correctness should be guaranteed by existing unit tests already
      - Removed SQLCompatibilityFunctionSuite and moved it sql-compatibility-functions.sql
      - Added a new test case in sql-compatibility-functions.sql for verifying explain behavior.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15723 from rxin/SPARK-18214.
      
      (cherry picked from commit fd90541c)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      2aff2ea8
Loading