Skip to content
Snippets Groups Projects
  1. Feb 16, 2017
  2. Feb 15, 2017
    • Kevin Yu's avatar
      [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN subquery 4th batch · 8487902a
      Kevin Yu authored
      ## What changes were proposed in this pull request?
      
      This is 4th batch of test case for IN/NOT IN subquery. In this PR, it has these test files:
      
      `in-set-operations.sql`
      `in-with-cte.sql`
      `not-in-joins.sql`
      
      Here are the queries and results from running on DB2.
      
      [in-set-operations DB2 version](https://github.com/apache/spark/files/772846/in-set-operations.sql.db2.txt)
      [Output of in-set-operations](https://github.com/apache/spark/files/772848/in-set-operations.sql.db2.out.txt)
      [in-with-cte DB2 version](https://github.com/apache/spark/files/772849/in-with-cte.sql.db2.txt)
      [Output of in-with-cte](https://github.com/apache/spark/files/772856/in-with-cte.sql.db2.out.txt)
      [not-in-joins DB2 version](https://github.com/apache/spark/files/772851/not-in-joins.sql.db2.txt)
      [Output of not-in-joins](https://github.com/apache/spark/files/772852/not-in-joins.sql.db2.out.txt)
      
      ## How was this patch tested?
      
      This pr is adding new test cases. We compare the result from spark with the result from another RDBMS(We used DB2 LUW). If the results are the same, we assume the result is correct.
      
      Author: Kevin Yu <qyu@us.ibm.com>
      
      Closes #16915 from kevinyu98/spark-18871-44.
      8487902a
    • Shixiong Zhu's avatar
      [SPARK-19603][SS] Fix StreamingQuery explain command · fc02ef95
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `StreamingQuery.explain` doesn't show the correct streaming physical plan right now because `ExplainCommand` receives a runtime batch plan and its `logicalPlan.isStreaming` is always false.
      
      This PR adds `streaming` parameter to `ExplainCommand` to allow `StreamExecution` to specify that it's a streaming plan.
      
      Examples of the explain outputs:
      
      - streaming DataFrame.explain()
      ```
      == Physical Plan ==
      *HashAggregate(keys=[value#518], functions=[count(1)])
      +- StateStoreSave [value#518], OperatorStateId(<unknown>,0,0), Append, 0
         +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
            +- StateStoreRestore [value#518], OperatorStateId(<unknown>,0,0)
               +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
                  +- Exchange hashpartitioning(value#518, 5)
                     +- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
                        +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                           +- *MapElements <function1>, obj#517: java.lang.String
                              +- *DeserializeToObject value#513.toString, obj#516: java.lang.String
                                 +- StreamingRelation MemoryStream[value#513], [value#513]
      ```
      
      - StreamingQuery.explain(extended = false)
      ```
      == Physical Plan ==
      *HashAggregate(keys=[value#518], functions=[count(1)])
      +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
         +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
            +- StateStoreRestore [value#518], OperatorStateId(...,0,0)
               +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
                  +- Exchange hashpartitioning(value#518, 5)
                     +- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
                        +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                           +- *MapElements <function1>, obj#517: java.lang.String
                              +- *DeserializeToObject value#543.toString, obj#516: java.lang.String
                                 +- LocalTableScan [value#543]
      ```
      
      - StreamingQuery.explain(extended = true)
      ```
      == Parsed Logical Plan ==
      Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
            +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
               +- LocalRelation [value#543]
      
      == Analyzed Logical Plan ==
      value: string, count(1): bigint
      Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
            +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
               +- LocalRelation [value#543]
      
      == Optimized Logical Plan ==
      Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
            +- DeserializeToObject value#543.toString, obj#516: java.lang.String
               +- LocalRelation [value#543]
      
      == Physical Plan ==
      *HashAggregate(keys=[value#518], functions=[count(1)], output=[value#518, count(1)#524L])
      +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
         +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
            +- StateStoreRestore [value#518], OperatorStateId(...,0,0)
               +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
                  +- Exchange hashpartitioning(value#518, 5)
                     +- *HashAggregate(keys=[value#518], functions=[partial_count(1)], output=[value#518, count#530L])
                        +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                           +- *MapElements <function1>, obj#517: java.lang.String
                              +- *DeserializeToObject value#543.toString, obj#516: java.lang.String
                                 +- LocalTableScan [value#543]
      ```
      
      ## How was this patch tested?
      
      The updated unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16934 from zsxwing/SPARK-19603.
      fc02ef95
    • Yun Ni's avatar
      [SPARK-18080][ML][PYTHON] Python API & Examples for Locality Sensitive Hashing · 08c1972a
      Yun Ni authored
      ## What changes were proposed in this pull request?
      This pull request includes python API and examples for LSH. The API changes was based on yanboliang 's PR #15768 and resolved conflicts and API changes on the Scala API. The examples are consistent with Scala examples of MinHashLSH and BucketedRandomProjectionLSH.
      
      ## How was this patch tested?
      API and examples are tested using spark-submit:
      `bin/spark-submit examples/src/main/python/ml/min_hash_lsh.py`
      `bin/spark-submit examples/src/main/python/ml/bucketed_random_projection_lsh.py`
      
      User guide changes are generated and manually inspected:
      `SKIP_API=1 jekyll build`
      
      Author: Yun Ni <yunn@uber.com>
      Author: Yanbo Liang <ybliang8@gmail.com>
      Author: Yunni <Euler57721@gmail.com>
      
      Closes #16715 from Yunni/spark-18080.
      08c1972a
    • Shixiong Zhu's avatar
      [SPARK-19599][SS] Clean up HDFSMetadataLog · 21b4ba2d
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog.
      
      This PR includes the following changes:
      - ~~Remove the workaround codes for HADOOP-10622.~~ Unfortunately, there is another issue [HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084) that prevents us from removing the workaround codes.
      - Remove unnecessary `writer: (T, OutputStream) => Unit` and just call `serialize` directly.
      - Remove catching FileNotFoundException.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16932 from zsxwing/metadata-cleanup.
      21b4ba2d
    • Yin Huai's avatar
      [SPARK-19604][TESTS] Log the start of every Python test · f6c3bba2
      Yin Huai authored
      ## What changes were proposed in this pull request?
      Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running.
      
      ## How was this patch tested?
      This is a change for python tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16935 from yhuai/SPARK-19604.
      f6c3bba2
    • Takuya UESHIN's avatar
      [SPARK-18937][SQL] Timezone support in CSV/JSON parsing · 865b2fd8
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up pr of #16308.
      
      This pr enables timezone support in CSV/JSON parsing.
      
      We should introduce `timeZone` option for CSV/JSON datasources (the default value of the option is session local timezone).
      
      The datasources should use the `timeZone` option to format/parse to write/read timestamp values.
      Notice that while reading, if the timestampFormat has the timezone info, the timezone will not be used because we should respect the timezone in the values.
      
      For example, if you have timestamp `"2016-01-01 00:00:00"` in `GMT`, the values written with the default timezone option, which is `"GMT"` because session local timezone is `"GMT"` here, are:
      
      ```scala
      scala> spark.conf.set("spark.sql.session.timeZone", "GMT")
      
      scala> val df = Seq(new java.sql.Timestamp(1451606400000L)).toDF("ts")
      df: org.apache.spark.sql.DataFrame = [ts: timestamp]
      
      scala> df.show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      
      scala> df.write.json("/path/to/gmtjson")
      ```
      
      ```sh
      $ cat /path/to/gmtjson/part-*
      {"ts":"2016-01-01T00:00:00.000Z"}
      ```
      
      whereas setting the option to `"PST"`, they are:
      
      ```scala
      scala> df.write.option("timeZone", "PST").json("/path/to/pstjson")
      ```
      
      ```sh
      $ cat /path/to/pstjson/part-*
      {"ts":"2015-12-31T16:00:00.000-08:00"}
      ```
      
      We can properly read these files even if the timezone option is wrong because the timestamp values have timezone info:
      
      ```scala
      scala> val schema = new StructType().add("ts", TimestampType)
      schema: org.apache.spark.sql.types.StructType = StructType(StructField(ts,TimestampType,true))
      
      scala> spark.read.schema(schema).json("/path/to/gmtjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      
      scala> spark.read.schema(schema).option("timeZone", "PST").json("/path/to/gmtjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      ```
      
      And even if `timezoneFormat` doesn't contain timezone info, we can properly read the values with setting correct timezone option:
      
      ```scala
      scala> df.write.option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson")
      ```
      
      ```sh
      $ cat /path/to/jstjson/part-*
      {"ts":"2016-01-01T09:00:00"}
      ```
      
      ```scala
      // wrong result
      scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").json("/path/to/jstjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 09:00:00|
      +-------------------+
      
      // correct result
      scala> spark.read.schema(schema).option("timestampFormat", "yyyy-MM-dd'T'HH:mm:ss").option("timeZone", "JST").json("/path/to/jstjson").show()
      +-------------------+
      |ts                 |
      +-------------------+
      |2016-01-01 00:00:00|
      +-------------------+
      ```
      
      This pr also makes `JsonToStruct` and `StructToJson` `TimeZoneAwareExpression` to be able to evaluate values with timezone option.
      
      ## How was this patch tested?
      
      Existing tests and added some tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #16750 from ueshin/issues/SPARK-18937.
      865b2fd8
    • windpiger's avatar
      [SPARK-19329][SQL] Reading from or writing to a datasource table with a non... · 6a9a85b8
      windpiger authored
      [SPARK-19329][SQL] Reading from or writing to a datasource table with a non pre-existing location should succeed
      
      ## What changes were proposed in this pull request?
      
      when we insert data into a datasource table use `sqlText`, and the table has an not exists location,
      this will throw an Exception.
      
      example:
      
      ```
      spark.sql("create table t(a string, b int) using parquet")
      spark.sql("alter table t set location '/xx'")
      spark.sql("insert into table t select 'c', 1")
      ```
      
      Exception:
      ```
      com.google.common.util.concurrent.UncheckedExecutionException: org.apache.spark.sql.AnalysisException: Path does not exist: /xx;
      at com.google.common.cache.LocalCache$LocalLoadingCache.getUnchecked(LocalCache.java:4814)
      at com.google.common.cache.LocalCache$LocalLoadingCache.apply(LocalCache.java:4830)
      at org.apache.spark.sql.hive.HiveMetastoreCatalog.lookupRelation(HiveMetastoreCatalog.scala:122)
      at org.apache.spark.sql.hive.HiveSessionCatalog.lookupRelation(HiveSessionCatalog.scala:69)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveRelations$$lookupTableFromCatalog(Analyzer.scala:456)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:465)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$$anonfun$apply$8.applyOrElse(Analyzer.scala:463)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolveOperators$1.apply(LogicalPlan.scala:61)
      at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:70)
      at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:60)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:463)
      at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations$.apply(Analyzer.scala:453)
      ```
      
      As discussed following comments, we should unify the action when we reading from or writing to a datasource table with a non pre-existing locaiton:
      
      1. reading from a datasource table: return 0 rows
      2. writing to a datasource table:  write data successfully
      
      ## How was this patch tested?
      unit test added
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #16672 from windpiger/insertNotExistLocation.
      6a9a85b8
    • Dongjoon Hyun's avatar
      [SPARK-19607][HOTFIX] Finding QueryExecution that matches provided executionId · 59dc26e3
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      #16940 adds a test case which does not stop the spark job. It causes many failures of other test cases.
      
      - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7/2403/consoleFull
      - https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2600/consoleFull
      
      ```
      [info]   org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243). To ignore this error, set spark.driver.allowMultipleContexts = true. The currently running SparkContext was created at:
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16943 from dongjoon-hyun/SPARK-19607-2.
      59dc26e3
    • jiangxingbo's avatar
      [SPARK-19331][SQL][TESTS] Improve the test coverage of SQLViewSuite · 3755da76
      jiangxingbo authored
      Move `SQLViewSuite` from `sql/hive` to `sql/core`, so we can test the view supports without hive metastore. Also moved the test cases that specified to hive to `HiveSQLViewSuite`.
      
      Improve the test coverage of SQLViewSuite, cover the following cases:
      1. view resolution(possibly a referenced table/view have changed after the view creation);
      2. handle a view with user specified column names;
      3. improve the test cases for a nested view.
      
      Also added a test case for cyclic view reference, which is a known issue that is not fixed yet.
      
      N/A
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #16674 from jiangxb1987/view-test.
      3755da76
    • Felix Cheung's avatar
      [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column · 671bc08e
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
      
      ## How was this patch tested?
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16739 from felixcheung/rcoalesce.
      671bc08e
    • zero323's avatar
      [SPARK-19160][PYTHON][SQL] Add udf decorator · c97f4e17
      zero323 authored
      ## What changes were proposed in this pull request?
      
      This PR adds `udf` decorator syntax as proposed in [SPARK-19160](https://issues.apache.org/jira/browse/SPARK-19160).
      
      This allows users to define UDF using simplified syntax:
      
      ```python
      from pyspark.sql.decorators import udf
      
      udf(IntegerType())
      def add_one(x):
          """Adds one"""
          if x is not None:
              return x + 1
       ```
      
      without need to define a separate function and udf.
      
      ## How was this patch tested?
      
      Existing unit tests to ensure backward compatibility and additional unit tests covering new functionality.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16533 from zero323/SPARK-19160.
      c97f4e17
    • VinceShieh's avatar
      [SPARK-19590][PYSPARK][ML] Update the document for QuantileDiscretizer in pyspark · 6eca21ba
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR is to document the changes on QuantileDiscretizer in pyspark for PR:
      https://github.com/apache/spark/pull/15428
      
      ## How was this patch tested?
      No test needed
      
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #16922 from VinceShieh/spark-19590.
      6eca21ba
    • Liang-Chi Hsieh's avatar
      [SPARK-16475][SQL] broadcast hint for SQL queries - disallow space as the delimiter · acf71c63
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      A follow-up to disallow space as the delimiter in broadcast hint.
      
      ## How was this patch tested?
      
      Jenkins test.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16941 from viirya/disallow-space-delimiter.
      acf71c63
    • Dilip Biswal's avatar
      [SPARK-18872][SQL][TESTS] New test cases for EXISTS subquery (Joins + CTE) · a8a13982
      Dilip Biswal authored
      ## What changes were proposed in this pull request?
      
      This PR adds the third and final set of tests for EXISTS subquery.
      
      File name                        | Brief description
      ------------------------| -----------------
      exists-cte.sql              |Tests Exist subqueries referencing CTE
      exists-joins-and-set-ops.sql|Tests Exists subquery used in Joins (Both when joins occurs in outer and suquery blocks)
      
      DB2 results are attached here as reference :
      
      [exists-cte-db2.txt](https://github.com/apache/spark/files/752091/exists-cte-db2.txt)
      [exists-joins-and-set-ops-db2.txt](https://github.com/apache/spark/files/753283/exists-joins-and-set-ops-db2.txt) (updated)
      
      ## How was this patch tested?
      The test result is compared with the result run from another SQL engine (in this case is IBM DB2). If the result are equivalent, we assume the result is correct.
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #16802 from dilipbiswal/exists-pr3.
      a8a13982
    • Nattavut Sutyanyong's avatar
      [SPARK-18873][SQL][TEST] New test cases for scalar subquery (part 2 of 2) -... · 5ad10c53
      Nattavut Sutyanyong authored
      [SPARK-18873][SQL][TEST] New test cases for scalar subquery (part 2 of 2) - scalar subquery in predicate context
      
      ## What changes were proposed in this pull request?
      This PR adds new test cases for scalar subquery in predicate context
      
      ## How was this patch tested?
      The test result is compared with the result run from another SQL engine (in this case is IBM DB2). If the result are equivalent, we assume the result is correct.
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #16798 from nsyca/18873-2.
      5ad10c53
    • Kevin Yu's avatar
      [SPARK-18871][SQL][TESTS] New test cases for IN/NOT IN subquery 2nd batch · d22db627
      Kevin Yu authored
      ## What changes were proposed in this pull request?
      
      This is 2nd batch of test case for IN/NOT IN subquery.  In this PR, it has these test cases:
      `in-limit.sql`
      `in-order-by.sql`
      `not-in-group-by.sql`
      
      These are the queries and results from running on DB2.
      [in-limit DB2 version](https://github.com/apache/spark/files/743267/in-limit.sql.db2.out.txt)
      [in-order-by DB2 version](https://github.com/apache/spark/files/743269/in-order-by.sql.db2.txt)
      [not-in-group-by DB2 version](https://github.com/apache/spark/files/743271/not-in-group-by.sql.db2.txt)
      [output of in-limit.sql DB2](https://github.com/apache/spark/files/743276/in-limit.sql.db2.out.txt)
      [output of in-order-by.sql DB2](https://github.com/apache/spark/files/743278/in-order-by.sql.db2.out.txt)
      [output of not-in-group-by.sql DB2](https://github.com/apache/spark/files/743279/not-in-group-by.sql.db2.out.txt)
      
      ## How was this patch tested?
      
      This pr is adding new test cases.
      
      Author: Kevin Yu <qyu@us.ibm.com>
      
      Closes #16759 from kevinyu98/spark-18871-2.
      d22db627
    • Zhenhua Wang's avatar
      [SPARK-17076][SQL] Cardinality estimation for join based on basic column statistics · 601b9c3e
      Zhenhua Wang authored
      ## What changes were proposed in this pull request?
      
      Support cardinality estimation and stats propagation for all join types.
      
      Limitations:
      - For inner/outer joins without any equal condition, we estimate it like cartesian product.
      - For left semi/anti joins, since we can't apply the heuristics for inner join to it, for now we just propagate the statistics from left side. We should support them when other advanced stats (e.g. histograms) are available in spark.
      
      ## How was this patch tested?
      
      Add a new test suite.
      
      Author: Zhenhua Wang <wzh_zju@163.com>
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #16228 from wzhfy/joinEstimate.
      601b9c3e
    • Wenchen Fan's avatar
      [SPARK-19587][SQL] bucket sorting columns should not be picked from partition columns · 8b75f8c1
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      We will throw an exception if bucket columns are part of partition columns, this should also apply to sort columns.
      
      This PR also move the checking logic from `DataFrameWriter` to `PreprocessTableCreation`, which is the central place for checking and normailization.
      
      ## How was this patch tested?
      
      updated test.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #16931 from cloud-fan/bucket.
      8b75f8c1
    • Reynold Xin's avatar
      [SPARK-16475][SQL] broadcast hint for SQL queries - follow up · 733c59ec
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      A small update to https://github.com/apache/spark/pull/16925
      
      1. Rename SubstituteHints -> ResolveHints to be more consistent with rest of the rules.
      2. Added more documentation in the rule and be more defensive / future proof to skip views as well as CTEs.
      
      ## How was this patch tested?
      This pull request contains no real logic change and all behavior should be covered by existing tests.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #16939 from rxin/SPARK-16475.
      733c59ec
    • Ala Luszczak's avatar
      [SPARK-19607] Finding QueryExecution that matches provided executionId · b55563c1
      Ala Luszczak authored
      ## What changes were proposed in this pull request?
      
      Implementing a mapping between executionId and corresponding QueryExecution in SQLExecution.
      
      ## How was this patch tested?
      
      Adds a unit test.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #16940 from ala/execution-id.
      b55563c1
    • wm624@hotmail.com's avatar
      [SPARK-19456][SPARKR] Add LinearSVC R API · 3973403d
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      Linear SVM classifier is newly added into ML and python API has been added. This JIRA is to add R side API.
      
      Marked as WIP, as I am designing unit tests.
      
      ## How was this patch tested?
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16800 from wangmiao1981/svc.
      3973403d
  3. Feb 14, 2017
    • Tyson Condie's avatar
      [SPARK-19584][SS][DOCS] update structured streaming documentation around batch mode · 447b2b53
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      Revision to structured-streaming-kafka-integration.md to reflect new Batch query specification and options.
      
      zsxwing tdas
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      
      Closes #16918 from tcondie/kafka-docs.
      447b2b53
    • sureshthalamati's avatar
      [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the... · f48c5a57
      sureshthalamati authored
      [SPARK-19318][SQL] Fix to treat JDBC connection properties specified by the user in case-sensitive manner.
      
      ## What changes were proposed in this pull request?
      The reason for test failure is that the property “oracle.jdbc.mapDateToTimestamp” set by the test was getting converted into all lower case. Oracle database expects this property in case-sensitive manner.
      
      This test was passing in previous releases because connection properties were sent as user specified for the test case scenario. Fixes to handle all option uniformly in case-insensitive manner, converted the JDBC connection properties also to lower case.
      
      This PR  enhances CaseInsensitiveMap to keep track of input case-sensitive keys , and uses those when creating connection properties that are passed to the JDBC connection.
      
      Alternative approach PR https://github.com/apache/spark/pull/16847  is to pass original input keys to JDBC data source by adding check in the  Data source class and handle case-insensitivity in the JDBC source code.
      
      ## How was this patch tested?
      Added new test cases to JdbcSuite , and OracleIntegrationSuite. Ran docker integration tests passed on my laptop, all tests passed successfully.
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #16891 from sureshthalamati/jdbc_case_senstivity_props_fix-SPARK-19318.
      f48c5a57
    • Reynold Xin's avatar
      [SPARK-16475][SQL] Broadcast hint for SQL Queries · da7aef7a
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This pull request introduces a simple hint infrastructure to SQL and implements broadcast join hint using the infrastructure.
      
      The hint syntax looks like the following:
      ```
      SELECT /*+ BROADCAST(t) */ * FROM t
      ```
      
      For broadcast hint, we accept "BROADCAST", "BROADCASTJOIN", and "MAPJOIN", and a sequence of relation aliases can be specified in the hint. A broadcast hint plan node will be inserted on top of any relation (that is not aliased differently), subquery, or common table expression that match the specified name.
      
      The hint resolution works by recursively traversing down the query plan to find a relation or subquery that matches one of the specified broadcast aliases. The traversal does not go past beyond any existing broadcast hints, subquery aliases. This rule happens before common table expressions.
      
      Note that there was an earlier patch in https://github.com/apache/spark/pull/14426. This is a rewrite of that patch, with different semantics and simpler test cases.
      
      ## How was this patch tested?
      Added a new unit test suite for the broadcast hint rule (SubstituteHintsSuite) and new test cases for parser change (in PlanParserSuite). Also added end-to-end test case in BroadcastSuite.
      
      Author: Reynold Xin <rxin@databricks.com>
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #16925 from rxin/SPARK-16475-broadcast-hint.
      da7aef7a
    • Felix Cheung's avatar
      [SPARK-19387][SPARKR] Tests do not run with SparkR source package in CRAN check · a3626ca3
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      - this is cause by changes in SPARK-18444, SPARK-18643 that we no longer install Spark when `master = ""` (default), but also related to SPARK-18449 since the real `master` value is not known at the time the R code in `sparkR.session` is run. (`master` cannot default to "local" since it could be overridden by spark-submit commandline or spark config)
      - as a result, while running SparkR as a package in IDE is working fine, CRAN check is not as it is launching it via non-interactive script
      - fix is to add check to the beginning of each test and vignettes; the same would also work by changing `sparkR.session()` to `sparkR.session(master = "local")` in tests, but I think being more explicit is better.
      
      ## How was this patch tested?
      
      Tested this by reverting version to 2.1, since it needs to download the release jar with matching version. But since there are changes in 2.2 (specifically around SparkR ML) that are incompatible with 2.1, some tests are failing in this config. Will need to port this to branch-2.1 and retest with 2.1 release jar.
      
      manually as:
      ```
      # modify DESCRIPTION to revert version to 2.1.0
      SPARK_HOME=/usr/spark R CMD build pkg
      # run cran check without SPARK_HOME
      R CMD check --as-cran SparkR_2.1.0.tar.gz
      ```
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16720 from felixcheung/rcranchecktest.
      a3626ca3
    • Jong Wook Kim's avatar
      [SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment · ab9872db
      Jong Wook Kim authored
      ## What changes were proposed in this pull request?
      
      As discussed in [JIRA](https://issues.apache.org/jira/browse/SPARK-19501), this patch addresses the problem where too many HDFS RPCs are made when there are many URIs specified in `spark.yarn.jars`, potentially adding hundreds of RTTs to YARN before the application launches. This becomes significant when submitting the application to a non-local YARN cluster (where the RTT may be in order of 100ms, for example). For each URI specified, the current implementation makes at least two HDFS RPCs, for:
      
      - [Calling `getFileStatus()` before uploading each file to the distributed cache in `ClientDistributedCacheManager.addResource()`](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71).
      - [Resolving any symbolic links in each of the file URI](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L377-L379), which repeatedly makes HDFS RPCs until the all symlinks are resolved. (see [`FileContext.resolve(Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java#L2189-L2195), [`FSLinkResolver.resolve(FileContext, Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSLinkResolver.java#L79-L112), and [`AbstractFileSystem.resolvePath()`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/AbstractFileSystem.java#L464-L468).)
      
      The first `getFileStatus` RPC can be removed, using `statCache` populated with the file statuses retrieved with [the previous `globStatus` call](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L531).
      
      The second one can be largely reduced by caching the symlink resolution results in a mutable.HashMap. This patch adds a local variable in `yarn.Client.prepareLocalResources()` and passes it as an additional parameter to `yarn.Client.copyFileToRemote`.  [The symlink resolution code was added in 2013](https://github.com/apache/spark/commit/a35472e1dd2ea1b5a0b1fb6b382f5a98f5aeba5a#diff-b050df3f55b82065803d6e83453b9706R187) and has not changed since. I am assuming that this is still required, but otherwise we can remove using `symlinkCache` and symlink resolution altogether.
      
      ## How was this patch tested?
      
      This patch is based off 8e8afb3a, currently the latest YARN patch on master. All tests except a few in spark-hive passed with `./dev/run-tests` on my machine, using JDK 1.8.0_112 on macOS 10.12.3; also tested myself with this modified version of SPARK 2.2.0-SNAPSHOT which performed a normal deployment and execution on a YARN cluster without errors.
      
      Author: Jong Wook Kim <jongwook@nyu.edu>
      
      Closes #16916 from jongwook/SPARK-19501.
      ab9872db
    • hyukjinkwon's avatar
      [SPARK-19571][R] Fix SparkR test break on Windows via AppVeyor · f776e3b4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems wintuils for Hadoop 2.6.5 not exiting for now in https://github.com/steveloughran/winutils
      
      This breaks the tests in SparkR on Windows so this PR proposes to use winutils built by Hadoop 2.6.4 for now.
      
      ## How was this patch tested?
      
      Manually via AppVeyor
      
      **Before**
      
      https://ci.appveyor.com/project/spark-test/spark/build/627-r-test-break
      
      **After**
      
      https://ci.appveyor.com/project/spark-test/spark/build/629-r-test-break
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16927 from HyukjinKwon/spark-r-windows-break.
      f776e3b4
    • Sheamus K. Parkes's avatar
      [SPARK-18541][PYTHON] Add metadata parameter to pyspark.sql.Column.alias() · 7b64f7aa
      Sheamus K. Parkes authored
      ## What changes were proposed in this pull request?
      
      Add a `metadata` keyword parameter to `pyspark.sql.Column.alias()` to allow users to mix-in metadata while manipulating `DataFrame`s in `pyspark`.  Without this, I believe it was necessary to pass back through `SparkSession.createDataFrame` each time a user wanted to manipulate `StructField.metadata` in `pyspark`.
      
      This pull request also improves consistency between the Scala and Python APIs (i.e. I did not add any functionality that was not already in the Scala API).
      
      Discussed ahead of time on JIRA with marmbrus
      
      ## How was this patch tested?
      
      Added unit tests (and doc tests).  Ran the pertinent tests manually.
      
      Author: Sheamus K. Parkes <shea.parkes@milliman.com>
      
      Closes #16094 from shea-parkes/pyspark-column-alias-metadata.
      7b64f7aa
    • zero323's avatar
      [SPARK-19162][PYTHON][SQL] UserDefinedFunction should validate that func is callable · e0eeb0f8
      zero323 authored
      ## What changes were proposed in this pull request?
      
      UDF constructor checks if `func` argument is callable and if it is not, fails fast instead of waiting for an action.
      
      ## How was this patch tested?
      
      Unit tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16535 from zero323/SPARK-19162.
      e0eeb0f8
    • zero323's avatar
      [SPARK-19453][PYTHON][SQL][DOC] Correct and extend DataFrame.replace docstring · 9c4405e8
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Provides correct description of the semantics of a `dict` argument passed as `to_replace`.
      - Describes type requirements for collection arguments.
      - Describes behavior with `to_replace: List[T]` and `value: T`
      
      ## How was this patch tested?
      
      Manual testing, documentation build.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16792 from zero323/SPARK-19453.
      9c4405e8
    • Xiao Li's avatar
      [SPARK-19589][SQL] Removal of SQLGEN files · 457850e6
      Xiao Li authored
      ### What changes were proposed in this pull request?
      SQLGen is removed. Thus, the generated files should be removed too.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #16921 from gatorsmile/removeSQLGenFiles.
      457850e6
    • Sunitha Kambhampati's avatar
      [SPARK-19585][DOC][SQL] Fix the cacheTable and uncacheTable api call in the doc · 9b5e460a
      Sunitha Kambhampati authored
      ## What changes were proposed in this pull request?
      
      https://spark.apache.org/docs/latest/sql-programming-guide.html#caching-data-in-memory
      In the doc, the call spark.cacheTable(“tableName”) and spark.uncacheTable(“tableName”) actually needs to be spark.catalog.cacheTable and spark.catalog.uncacheTable
      
      ## How was this patch tested?
      Built the docs and verified the change shows up fine.
      
      Author: Sunitha Kambhampati <skambha@us.ibm.com>
      
      Closes #16919 from skambha/docChange.
      9b5e460a
  4. Feb 13, 2017
    • Xin Wu's avatar
      [SPARK-19539][SQL] Block duplicate temp table during creation · 1ab97310
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Current `CREATE TEMPORARY TABLE ... ` is deprecated and recommend users to use `CREATE TEMPORARY VIEW ...` And it does not support `IF NOT EXISTS `clause. However, if there is an existing temporary view defined, it is possible to unintentionally replace this existing view by issuing `CREATE TEMPORARY TABLE ...`  with the same table/view name.
      
      This PR is to disallow `CREATE TEMPORARY TABLE ...` with an existing view name.
      Under the cover, `CREATE TEMPORARY TABLE ...` will be changed to create temporary view, however, passing in a flag `replace=false`, instead of currently `true`. So when creating temporary view under the cover, if there is existing view with the same name, the operation will be blocked.
      
      ## How was this patch tested?
      New unit test case is added and updated some existing test cases to adapt the new behavior
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #16878 from xwu0226/block_duplicate_temp_table.
      1ab97310
    • ouyangxiaochen's avatar
      [SPARK-19115][SQL] Supporting Create Table Like Location · 6e45b547
      ouyangxiaochen authored
      What changes were proposed in this pull request?
      
      Support CREATE [EXTERNAL] TABLE LIKE LOCATION... syntax for Hive serde and datasource tables.
      In this PR,we follow SparkSQL design rules :
      
          supporting create table like view or physical table or temporary view with location.
          creating a table with location,this table will be an external table other than managed table.
      
      How was this patch tested?
      
      Add new test cases and update existing test cases
      
      Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>
      
      Closes #16868 from ouyangxiaochen/spark19115.
      6e45b547
    • zero323's avatar
      [SPARK-19429][PYTHON][SQL] Support slice arguments in Column.__getitem__ · e02ac303
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add support for `slice` arguments in `Column.__getitem__`.
      - Remove obsolete `__getslice__` bindings.
      
      ## How was this patch tested?
      
      Existing unit tests, additional tests covering `[]` with `slice`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16771 from zero323/SPARK-19429.
      e02ac303
    • Marcelo Vanzin's avatar
      [SPARK-19520][STREAMING] Do not encrypt data written to the WAL. · 0169360e
      Marcelo Vanzin authored
      Spark's I/O encryption uses an ephemeral key for each driver instance.
      So driver B cannot decrypt data written by driver A since it doesn't
      have the correct key.
      
      The write ahead log is used for recovery, thus needs to be readable by
      a different driver. So it cannot be encrypted by Spark's I/O encryption
      code.
      
      The BlockManager APIs used by the WAL code to write the data automatically
      encrypt data, so changes are needed so that callers can to opt out of
      encryption.
      
      Aside from that, the "putBytes" API in the BlockManager does not do
      encryption, so a separate situation arised where the WAL would write
      unencrypted data to the BM and, when those blocks were read, decryption
      would fail. So the WAL code needs to ask the BM to encrypt that data
      when encryption is enabled; this code is not optimal since it results
      in a (temporary) second copy of the data block in memory, but should be
      OK for now until a more performant solution is added. The non-encryption
      case should not be affected.
      
      Tested with new unit tests, and by running streaming apps that do
      recovery using the WAL data with I/O encryption turned on.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16862 from vanzin/SPARK-19520.
      0169360e
    • hyukjinkwon's avatar
      [SPARK-19435][SQL] Type coercion between ArrayTypes · 9af8f743
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support type coercion between `ArrayType`s where the element types are compatible.
      
      **Before**
      
      ```
      Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
      org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0;
      
      Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
      org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0;
      
      sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
      org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14
      
      Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
      org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;;
      
      sql("SELECT IF(1=1, array(1), array(1D))")
      org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7;
      ```
      
      **After**
      
      ```scala
      Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
      res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>]
      
      Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
      res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>]
      
      sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
      res8: org.apache.spark.sql.DataFrame = [a: array<double>]
      
      Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
      res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>]
      
      sql("SELECT IF(1=1, array(1), array(1D))")
      res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>]
      ```
      
      ## How was this patch tested?
      
      Unit tests in `TypeCoercion` and Jenkins tests and
      
      building with scala 2.10
      
      ```scala
      ./dev/change-scala-version.sh 2.10
      ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16777 from HyukjinKwon/SPARK-19435.
      9af8f743
Loading