Skip to content
Snippets Groups Projects
  1. Jul 23, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-21512][SQL][TEST] DatasetCacheSuite needs to execute unpersistent after executing peristent · 481f0792
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids to reuse unpersistent dataset among test cases by making dataset unpersistent at the end of each test case.
      
      In `DatasetCacheSuite`, the test case `"get storage level"` does not make dataset unpersisit after make the dataset persisitent. The same dataset will be made persistent by the test case `"persist and then rebind right encoder when join 2 datasets"` Thus, we run these test cases, the second case does not perform to make dataset persistent. This is because in
      
      When we run only the second case, it performs to make dataset persistent. It is not good to change behavior of the second test suite. The first test case should correctly make dataset unpersistent.
      
      ```
      Testing started at 17:52 ...
      01:52:15.053 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      01:52:48.595 WARN org.apache.spark.sql.execution.CacheManager: Asked to cache already cached data.
      01:52:48.692 WARN org.apache.spark.sql.execution.CacheManager: Asked to cache already cached data.
      01:52:50.864 WARN org.apache.spark.storage.RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
      01:52:50.864 WARN org.apache.spark.storage.RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
      01:52:50.868 WARN org.apache.spark.storage.BlockManager: Block rdd_8_1 replicated to only 0 peer(s) instead of 1 peers
      01:52:50.868 WARN org.apache.spark.storage.BlockManager: Block rdd_8_0 replicated to only 0 peer(s) instead of 1 peers
      ```
      
      After this PR, these messages do not appear
      ```
      Testing started at 18:14 ...
      02:15:05.329 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      
      Process finished with exit code 0
      ```
      
      ## How was this patch tested?
      
      Used the existing test
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18719 from kiszk/SPARK-21512.
      481f0792
    • Reynold Xin's avatar
      [MINOR] Remove **** in test case names in FlatMapGroupsWithStateSuite · a4eac8b0
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes the `****` string from test names in FlatMapGroupsWithStateSuite. `***` is a common string developers grep for when using Scala test (because it immediately shows the failing test cases). The existence of the `****` in test names disrupts that workflow.
      
      ## How was this patch tested?
      N/A - test only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18715 from rxin/FlatMapGroupsWithStateStar.
      a4eac8b0
  2. Jul 21, 2017
    • Takuya UESHIN's avatar
      [SPARK-21472][SQL][FOLLOW-UP] Introduce ArrowColumnVector as a reader for Arrow vectors. · 2f146842
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of #18680.
      
      In some environment, a compile error happens saying:
      
      ```
      .../sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java:243:
      error: not found: type Array
        public void loadBytes(Array array) {
                              ^
      ```
      
      This pr fixes it.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18701 from ueshin/issues/SPARK-21472_fup1.
      2f146842
  3. Jul 20, 2017
    • Wenchen Fan's avatar
      [SPARK-10063] Follow-up: remove dead code related to an old output committer · 3ac60930
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      DirectParquetOutputCommitter was removed from Spark as it was deemed unsafe to use. We however still have some code to generate warning. This patch removes those code as well.
      
      This is kind of a follow-up of https://github.com/apache/spark/pull/16796
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18689 from cloud-fan/minor.
      3ac60930
    • Takuya UESHIN's avatar
      [SPARK-21472][SQL] Introduce ArrowColumnVector as a reader for Arrow vectors. · cb19880c
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Introducing `ArrowColumnVector` as a reader for Arrow vectors.
      It extends `ColumnVector`, so we will be able to use it with `ColumnarBatch` and its functionalities.
      Currently it supports primitive types and `StringType`, `ArrayType` and `StructType`.
      
      ## How was this patch tested?
      
      Added tests for `ArrowColumnVector` and existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18680 from ueshin/issues/SPARK-21472.
      cb19880c
    • gatorsmile's avatar
      [SPARK-21477][SQL][MINOR] Mark LocalTableScanExec's input data transient · 256358f6
      gatorsmile authored
      ## What changes were proposed in this pull request?
      This PR is to mark the parameter `rows` and `unsafeRow` of LocalTableScanExec transient. It can avoid serializing the unneeded objects.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18686 from gatorsmile/LocalTableScanExec.
      256358f6
  4. Jul 19, 2017
    • Xiang Gao's avatar
      [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null... · b7a40f64
      Xiang Gao authored
      [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python
      
      ## What changes were proposed in this pull request?
      This is the reopen of https://github.com/apache/spark/pull/14198, with merge conflicts resolved.
      
      ueshin Could you please take a look at my code?
      
      Fix bugs about types that result an array of null when creating DataFrame using python.
      
      Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows.
      
      A simple code to reproduce this bug is:
      
      ```
      from pyspark import SparkContext
      from pyspark.sql import SQLContext,Row,DataFrame
      from array import array
      
      sc = SparkContext()
      sqlContext = SQLContext(sc)
      
      row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
      rows = sc.parallelize([ row1 ])
      df = sqlContext.createDataFrame(rows)
      df.show()
      ```
      
      which have output
      
      ```
      +---------------+------------------+
      |    doublearray|        floatarray|
      +---------------+------------------+
      |[1.0, 2.0, 3.0]|[null, null, null]|
      +---------------+------------------+
      ```
      
      ## How was this patch tested?
      
      New test case added
      
      Author: Xiang Gao <qasdfgtyuiop@gmail.com>
      Author: Gao, Xiang <qasdfgtyuiop@gmail.com>
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18444 from zasdfgbnm/fix_array_infer.
      b7a40f64
    • Burak Yavuz's avatar
      [SPARK-21463] Allow userSpecifiedSchema to override partition inference... · 2c9d5ef1
      Burak Yavuz authored
      [SPARK-21463] Allow userSpecifiedSchema to override partition inference performed by MetadataLogFileIndex
      
      ## What changes were proposed in this pull request?
      
      When using the MetadataLogFileIndex to read back a table, we don't respect the user provided schema as the proper column types. This can lead to issues when trying to read strings that look like dates that get truncated to DateType, or longs being truncated to IntegerType, just because a long value doesn't exist.
      
      ## How was this patch tested?
      
      Unit tests and manual tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #18676 from brkyvz/stream-partitioning.
      2c9d5ef1
    • Corey Woodfield's avatar
      [SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of Dataset#joinWith · 8cd9cdf1
      Corey Woodfield authored
      ## What changes were proposed in this pull request?
      
      Two invalid join types were mistakenly listed in the javadoc for joinWith, in the Dataset class. I presume these were copied from the javadoc of join, but since joinWith returns a Dataset\<Tuple2\>, left_semi and left_anti are invalid, as they only return values from one of the datasets, instead of from both
      
      ## How was this patch tested?
      
      I ran the following code :
      ```
      public static void main(String[] args) {
      	SparkSession spark = new SparkSession(new SparkContext("local[*]", "Test"));
      	Dataset<Row> one = spark.createDataFrame(Arrays.asList(new Bean(1), new Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
      	Dataset<Row> two = spark.createDataFrame(Arrays.asList(new Bean(4), new Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);
      
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "inner").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "cross").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "outer").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full_outer").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_outer").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right_outer").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_semi").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_anti").show();} catch (Exception e) {e.printStackTrace();}
      }
      ```
      which tests all the different join types, and the last two (left_semi and left_anti) threw exceptions. The same code using join instead of joinWith did fine. The Bean class was just a java bean with a single int field, x.
      
      Author: Corey Woodfield <coreywoodfield@gmail.com>
      
      Closes #18462 from coreywoodfield/master.
      8cd9cdf1
    • DFFuture's avatar
      [SPARK-21446][SQL] Fix setAutoCommit never executed · c9729187
      DFFuture authored
      ## What changes were proposed in this pull request?
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446
      options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities.
      So change properties of beforeFetch from  options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap
      
      ## How was this patch tested?
      
      Author: DFFuture <albert.zhang23@gmail.com>
      
      Closes #18665 from DFFuture/sparksql_pg.
      c9729187
    • Tathagata Das's avatar
      [SPARK-21464][SS] Minimize deprecation warnings caused by ProcessingTime class · 70fe99dc
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Use of `ProcessingTime` class was deprecated in favor of `Trigger.ProcessingTime` in Spark 2.2. However interval uses to ProcessingTime causes deprecation warnings during compilation. This cannot be avoided entirely as even though it is deprecated as a public API, ProcessingTime instances are used internally in TriggerExecutor. This PR is to minimize the warning by removing its uses from tests as much as possible.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18678 from tdas/SPARK-21464.
      70fe99dc
    • donnyzone's avatar
      [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures in some cases · 6b6dd682
      donnyzone authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441
      
      This issue can be reproduced by the following example:
      
      ```
      val spark = SparkSession
         .builder()
         .appName("smj-codegen")
         .master("local")
         .config("spark.sql.autoBroadcastJoinThreshold", "1")
         .getOrCreate()
      val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int")
      val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str")
      val df = df1.join(df2, df1("key") === df2("key"))
         .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
         .select("int")
         df.show()
      ```
      
      To conclude, the issue happens when:
      (1) SortMergeJoin condition contains CodegenFallback expressions.
      (2) In PhysicalPlan tree, SortMergeJoin node  is the child of root node, e.g., the Project in above example.
      
      This patch fixes the logic in `CollapseCodegenStages` rule.
      
      ## How was this patch tested?
      Unit test and manual verification in our cluster.
      
      Author: donnyzone <wellfengzhu@gmail.com>
      
      Closes #18656 from DonnyZone/Fix_SortMergeJoinExec.
      6b6dd682
    • jinxing's avatar
      [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM. · 4eb081cc
      jinxing authored
      ## What changes were proposed in this pull request?
      
      In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound.
      This could result in the buffer is very big though the window is small.
      For example:
      ```
      select a, b, sum(a)
      over (partition by b order by a range between 1000000 following and 1000001 following)
      from table
      ```
      We can refine the logic and just add the qualified rows into buffer.
      
      ## How was this patch tested?
      Manual test:
      Run sql
      `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10`
      against a table with 4  columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines.
      Configure the executor with 2G bytes memory.
      With the change in this pr, it works find. Without this change, below exception will be thrown.
      ```
      MemoryError: Java heap space
      	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)
      	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62)
      	at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201)
      	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365)
      	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      ```
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18634 from jinxing64/SPARK-21414.
      4eb081cc
  5. Jul 18, 2017
    • xuanyuanking's avatar
      [SPARK-21435][SQL] Empty files should be skipped while write to file · 81c99a5b
      xuanyuanking authored
      ## What changes were proposed in this pull request?
      
      Add EmptyDirectoryWriteTask for empty task while writing files. Fix the empty result for parquet format by leaving the first partition for meta writing.
      
      ## How was this patch tested?
      
      Add new test in `FileFormatWriterSuite `
      
      Author: xuanyuanking <xyliyuanjian@gmail.com>
      
      Closes #18654 from xuanyuanking/SPARK-21435.
      81c99a5b
    • Tathagata Das's avatar
      [SPARK-21462][SS] Added batchId to StreamingQueryProgress.json · 84f1b25f
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      - Added batchId to StreamingQueryProgress.json as that was missing from the generated json.
      - Also, removed recently added numPartitions from StatefulOperatorProgress as this value does not change through the query run, and there are other ways to find that.
      
      ## How was this patch tested?
      Updated unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18675 from tdas/SPARK-21462.
      84f1b25f
    • Sean Owen's avatar
      [SPARK-21415] Triage scapegoat warnings, part 1 · e26dac5f
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Address scapegoat warnings for:
      - BigDecimal double constructor
      - Catching NPE
      - Finalizer without super
      - List.size is O(n)
      - Prefer Seq.empty
      - Prefer Set.empty
      - reverse.map instead of reverseMap
      - Type shadowing
      - Unnecessary if condition.
      - Use .log1p
      - Var could be val
      
      In some instances like Seq.empty, I avoided making the change even where valid in test code to keep the scope of the change smaller. Those issues are concerned with performance and it won't matter for tests.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #18635 from srowen/Scapegoat1.
      e26dac5f
  6. Jul 17, 2017
    • Tathagata Das's avatar
      [SPARK-21409][SS] Follow up PR to allow different types of custom metrics to be exposed · e9faae13
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Implementation may expose both timing as well as size metrics. This PR enables that.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18661 from tdas/SPARK-21409-2.
      e9faae13
    • Tathagata Das's avatar
      [SPARK-21409][SS] Expose state store memory usage in SQL metrics and progress updates · 9d8c8317
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Currently, there is no tracking of memory usage of state stores. This JIRA is to expose that through SQL metrics and StreamingQueryProgress.
      
      Additionally, added the ability to expose implementation-specific metrics through the StateStore APIs to the SQLMetrics.
      
      ## How was this patch tested?
      Added unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18629 from tdas/SPARK-21409.
      9d8c8317
    • gatorsmile's avatar
      [SPARK-21354][SQL] INPUT FILE related functions do not support more than one sources · e398c281
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The build-in functions `input_file_name`, `input_file_block_start`, `input_file_block_length` do not support more than one sources, like what Hive does. Currently, Spark does not block it and the outputs are ambiguous/non-deterministic. It could be from any side.
      
      ```
      hive> select *, INPUT__FILE__NAME FROM t1, t2;
      FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One Tables/Subqueries
      ```
      
      This PR blocks it and issues an error.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18580 from gatorsmile/inputFileName.
      e398c281
  7. Jul 16, 2017
  8. Jul 14, 2017
  9. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  10. Jul 12, 2017
    • Wenchen Fan's avatar
      [SPARK-17701][SQL] Refactor RowDataSourceScanExec so its sameResult call does not compare strings · 780586a9
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently, `RowDataSourceScanExec` and `FileSourceScanExec` rely on a "metadata" string map to implement equality comparison, since the RDDs they depend on cannot be directly compared. This has resulted in a number of correctness bugs around exchange reuse, e.g. SPARK-17673 and SPARK-16818.
      
      To make these comparisons less brittle, we should refactor these classes to compare constructor parameters directly instead of relying on the metadata map.
      
      This PR refactors `RowDataSourceScanExec`, `FileSourceScanExec` will be fixed in the follow-up PR.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18600 from cloud-fan/minor.
      780586a9
    • liuxian's avatar
      [SPARK-21007][SQL] Add SQL function - RIGHT && LEFT · aaad34dc
      liuxian authored
      ## What changes were proposed in this pull request?
       Add  SQL function - RIGHT && LEFT, same as MySQL:
      https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_left
      https://dev.mysql.com/doc/refman/5.7/en/string-functions.html#function_right
      
      ## How was this patch tested?
      unit test
      
      Author: liuxian <liu.xian3@zte.com.cn>
      
      Closes #18228 from 10110346/lx-wip-0607.
      aaad34dc
    • Burak Yavuz's avatar
      [SPARK-21370][SS] Add test for state reliability when one read-only state... · e0af76a3
      Burak Yavuz authored
      [SPARK-21370][SS] Add test for state reliability when one read-only state store aborts after read-write state store commits
      
      ## What changes were proposed in this pull request?
      
      During Streaming Aggregation, we have two StateStores per task, one used as read-only in
      `StateStoreRestoreExec`, and one read-write used in `StateStoreSaveExec`. `StateStore.abort`
      will be called for these StateStores if they haven't committed their results. We need to
      make sure that `abort` in read-only store after a `commit` in the read-write store doesn't
      accidentally lead to the deletion of state.
      
      This PR adds a test for this condition.
      
      ## How was this patch tested?
      
      This PR adds a test.
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #18603 from brkyvz/ss-test.
      e0af76a3
    • Jane Wang's avatar
      [SPARK-12139][SQL] REGEX Column Specification · 2cbfc975
      Jane Wang authored
      ## What changes were proposed in this pull request?
      Hive interprets regular expression, e.g., `(a)?+.+` in query specification. This PR enables spark to support this feature when hive.support.quoted.identifiers is set to true.
      
      ## How was this patch tested?
      
      - Add unittests in SQLQuerySuite.scala
      - Run spark-shell tested the original failed query:
      scala> hc.sql("SELECT `(a|b)?+.+` from test1").collect.foreach(println)
      
      Author: Jane Wang <janewang@fb.com>
      
      Closes #18023 from janewangfb/support_select_regex.
      2cbfc975
  11. Jul 11, 2017
    • gatorsmile's avatar
      [SPARK-19285][SQL] Implement UDF0 · d3e07165
      gatorsmile authored
      ### What changes were proposed in this pull request?
      This PR is to implement UDF0. `UDF0` is needed when users need to implement a JAVA UDF with no argument.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18598 from gatorsmile/udf0.
      d3e07165
    • hyukjinkwon's avatar
      [SPARK-21365][PYTHON] Deduplicate logics parsing DDL type/schema definition · ebc124d4
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR deals with four points as below:
      
      - Reuse existing DDL parser APIs rather than reimplementing within PySpark
      
      - Support DDL formatted string, `field type, field type`.
      
      - Support case-insensitivity for parsing.
      
      - Support nested data types as below:
      
        **Before**
        ```
        >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show()
        ...
        ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int>
        ```
      
        ```
        >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show()
        ...
        ValueError: The strcut field string format is: 'field_name:field_type', but got: a: struct<b: int>
        ```
      
        ```
        >>> spark.createDataFrame([[1]], "a int").show()
        ...
        ValueError: Could not parse datatype: a int
        ```
      
        **After**
        ```
        >>> spark.createDataFrame([[[1]]], "struct<a: struct<b: int>>").show()
        +---+
        |  a|
        +---+
        |[1]|
        +---+
        ```
      
        ```
        >>> spark.createDataFrame([[[1]]], "a: struct<b: int>").show()
        +---+
        |  a|
        +---+
        |[1]|
        +---+
        ```
      
        ```
        >>> spark.createDataFrame([[1]], "a int").show()
        +---+
        |  a|
        +---+
        |  1|
        +---+
        ```
      
      ## How was this patch tested?
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18590 from HyukjinKwon/deduplicate-python-ddl.
      ebc124d4
    • Xingbo Jiang's avatar
      [SPARK-21366][SQL][TEST] Add sql test for window functions · 66d21686
      Xingbo Jiang authored
      ## What changes were proposed in this pull request?
      
      Add sql test for window functions, also remove uncecessary test cases in `WindowQuerySuite`.
      
      ## How was this patch tested?
      
      Added `window.sql` and the corresponding output file.
      
      Author: Xingbo Jiang <xingbo.jiang@databricks.com>
      
      Closes #18591 from jiangxb1987/window.
      66d21686
    • hyukjinkwon's avatar
      [SPARK-21263][SQL] Do not allow partially parsing double and floats via NumberFormat in CSV · 7514db1d
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to remove `NumberFormat.parse` use to disallow a case of partially parsed data. For example,
      
      ```
      scala> spark.read.schema("a DOUBLE").option("mode", "FAILFAST").csv(Seq("10u12").toDS).show()
      +----+
      |   a|
      +----+
      |10.0|
      +----+
      ```
      
      ## How was this patch tested?
      
      Unit tests added in `UnivocityParserSuite` and `CSVSuite`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18532 from HyukjinKwon/SPARK-21263.
      7514db1d
  12. Jul 10, 2017
    • jinxing's avatar
      [SPARK-21315][SQL] Skip some spill files when generateIterator(startIndex) in... · 97a1aa2c
      jinxing authored
      [SPARK-21315][SQL] Skip some spill files when generateIterator(startIndex) in ExternalAppendOnlyUnsafeRowArray.
      
      ## What changes were proposed in this pull request?
      
      In current code, it is expensive to use `UnboundedFollowingWindowFunctionFrame`, because it is iterating from the start to lower bound every time calling `write` method. When traverse the iterator, it's possible to skip some spilled files thus to save some time.
      
      ## How was this patch tested?
      
      Added unit test
      
      Did a small test for benchmark:
      
      Put 2000200 rows into `UnsafeExternalSorter`-- 2 spill files(each contains 1000000 rows) and inMemSorter contains 200 rows.
      Move the iterator forward to index=2000001.
      
      *With this change*:
      `getIterator(2000001)`, it will cost almost 0ms~1ms;
      *Without this change*:
      `for(int i=0; i<2000001; i++)geIterator().loadNext()`, it will cost 300ms.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18541 from jinxing64/SPARK-21315.
      97a1aa2c
    • gatorsmile's avatar
      [SPARK-21350][SQL] Fix the error message when the number of arguments is wrong when invoking a UDF · 1471ee7a
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Users get a very confusing error when users specify a wrong number of parameters.
      ```Scala
          val df = spark.emptyDataFrame
          spark.udf.register("foo", (_: String).length)
          df.selectExpr("foo(2, 3, 4)")
      ```
      ```
      org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be cast to scala.Function3
      java.lang.ClassCastException: org.apache.spark.sql.UDFSuite$$anonfun$9$$anonfun$apply$mcV$sp$12 cannot be cast to scala.Function3
      	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.<init>(ScalaUDF.scala:109)
      ```
      
      This PR is to capture the exception and issue an error message that is consistent with what we did for built-in functions. After the fix, the error message is improved to
      ```
      Invalid number of arguments for function foo; line 1 pos 0
      org.apache.spark.sql.AnalysisException: Invalid number of arguments for function foo; line 1 pos 0
      	at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:119)
      ```
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18574 from gatorsmile/statsCheck.
      1471ee7a
    • Takeshi Yamamuro's avatar
      [SPARK-21043][SQL] Add unionByName in Dataset · a2bec6c9
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr added `unionByName` in `DataSet`.
      Here is how to use:
      ```
      val df1 = Seq((1, 2, 3)).toDF("col0", "col1", "col2")
      val df2 = Seq((4, 5, 6)).toDF("col1", "col2", "col0")
      df1.unionByName(df2).show
      
      // output:
      // +----+----+----+
      // |col0|col1|col2|
      // +----+----+----+
      // |   1|   2|   3|
      // |   6|   4|   5|
      // +----+----+----+
      ```
      
      ## How was this patch tested?
      Added tests in `DataFrameSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #18300 from maropu/SPARK-21043-2.
      a2bec6c9
    • Bryan Cutler's avatar
      [SPARK-13534][PYSPARK] Using Apache Arrow to increase performance of DataFrame.toPandas · d03aebbe
      Bryan Cutler authored
      ## What changes were proposed in this pull request?
      Integrate Apache Arrow with Spark to increase performance of `DataFrame.toPandas`.  This has been done by using Arrow to convert data partitions on the executor JVM to Arrow payload byte arrays where they are then served to the Python process.  The Python DataFrame can then collect the Arrow payloads where they are combined and converted to a Pandas DataFrame.  Data types except complex, date, timestamp, and decimal  are currently supported, otherwise an `UnsupportedOperation` exception is thrown.
      
      Additions to Spark include a Scala package private method `Dataset.toArrowPayload` that will convert data partitions in the executor JVM to `ArrowPayload`s as byte arrays so they can be easily served.  A package private class/object `ArrowConverters` that provide data type mappings and conversion routines.  In Python, a private method `DataFrame._collectAsArrow` is added to collect Arrow payloads and a SQLConf "spark.sql.execution.arrow.enable" can be used in `toPandas()` to enable using Arrow (uses the old conversion by default).
      
      ## How was this patch tested?
      Added a new test suite `ArrowConvertersSuite` that will run tests on conversion of Datasets to Arrow payloads for supported types.  The suite will generate a Dataset and matching Arrow JSON data, then the dataset is converted to an Arrow payload and finally validated against the JSON data.  This will ensure that the schema and data has been converted correctly.
      
      Added PySpark tests to verify the `toPandas` method is producing equal DataFrames with and without pyarrow.  A roundtrip test to ensure the pandas DataFrame produced by pyspark is equal to a one made directly with pandas.
      
      Author: Bryan Cutler <cutlerb@gmail.com>
      Author: Li Jin <ice.xelloss@gmail.com>
      Author: Li Jin <li.jin@twosigma.com>
      Author: Wes McKinney <wes.mckinney@twosigma.com>
      
      Closes #18459 from BryanCutler/toPandas_with_arrow-SPARK-13534.
      d03aebbe
    • hyukjinkwon's avatar
      [SPARK-21266][R][PYTHON] Support schema a DDL-formatted string in dapply/gapply/from_json · 2bfd5acc
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR supports schema in a DDL formatted string for `from_json` in R/Python and `dapply` and `gapply` in R, which are commonly used and/or consistent with Scala APIs.
      
      Additionally, this PR exposes `structType` in R to allow working around in other possible corner cases.
      
      **Python**
      
      `from_json`
      
      ```python
      from pyspark.sql.functions import from_json
      
      data = [(1, '''{"a": 1}''')]
      df = spark.createDataFrame(data, ("key", "value"))
      df.select(from_json(df.value, "a INT").alias("json")).show()
      ```
      
      **R**
      
      `from_json`
      
      ```R
      df <- sql("SELECT named_struct('name', 'Bob') as people")
      df <- mutate(df, people_json = to_json(df$people))
      head(select(df, from_json(df$people_json, "name STRING")))
      ```
      
      `structType.character`
      
      ```R
      structType("a STRING, b INT")
      ```
      
      `dapply`
      
      ```R
      dapply(createDataFrame(list(list(1.0)), "a"), function(x) {x}, "a DOUBLE")
      ```
      
      `gapply`
      
      ```R
      gapply(createDataFrame(list(list(1.0)), "a"), "a", function(key, x) { x }, "a DOUBLE")
      ```
      
      ## How was this patch tested?
      
      Doc tests for `from_json` in Python and unit tests `test_sparkSQL.R` in R.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18498 from HyukjinKwon/SPARK-21266.
      2bfd5acc
    • Juliusz Sompolski's avatar
      [SPARK-21272] SortMergeJoin LeftAnti does not update numOutputRows · 18b3b00e
      Juliusz Sompolski authored
      ## What changes were proposed in this pull request?
      
      Updating numOutputRows metric was missing from one return path of LeftAnti SortMergeJoin.
      
      ## How was this patch tested?
      
      Non-zero output rows manually seen in metrics.
      
      Author: Juliusz Sompolski <julek@databricks.com>
      
      Closes #18494 from juliuszsompolski/SPARK-21272.
      18b3b00e
    • Takeshi Yamamuro's avatar
      [SPARK-20460][SQL] Make it more consistent to handle column name duplication · 647963a2
      Takeshi Yamamuro authored
      ## What changes were proposed in this pull request?
      This pr made it more consistent to handle column name duplication. In the current master, error handling is different when hitting column name duplication:
      ```
      // json
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
      scala> spark.read.format("json").schema(schema).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#12, a#13.;
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
      
      scala> spark.read.format("json").load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Duplicate column(s) : "a" found, cannot save to JSON format;
        at org.apache.spark.sql.execution.datasources.json.JsonDataSource.checkConstraints(JsonDataSource.scala:81)
        at org.apache.spark.sql.execution.datasources.json.JsonDataSource.inferSchema(JsonDataSource.scala:63)
        at org.apache.spark.sql.execution.datasources.json.JsonFileFormat.inferSchema(JsonFileFormat.scala:57)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
        at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$7.apply(DataSource.scala:176)
      
      // csv
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
      scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#41, a#42.;
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
      
      // If `inferSchema` is true, a CSV format is duplicate-safe (See SPARK-16896)
      scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
      +---+---+
      | a0| a1|
      +---+---+
      |  1|  1|
      +---+---+
      
      // parquet
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
      scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Reference 'a' is ambiguous, could be: a#110, a#111.;
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:287)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolve(LogicalPlan.scala:181)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:153)
        at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan$$anonfun$resolve$1.apply(LogicalPlan.scala:152)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
      ```
      When this patch applied, the results change to;
      ```
      
      // json
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq("""{"a":1, "a":1}"""""").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
      scala> spark.read.format("json").schema(schema).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
        at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
        at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
      
      scala> spark.read.format("json").load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
        at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
        at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:156)
      
      // csv
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq("a,a", "1,1").toDF().coalesce(1).write.mode("overwrite").text("/tmp/data")
      scala> spark.read.format("csv").schema(schema).option("header", false).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
        at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
        at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
        at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
      
      scala> spark.read.format("csv").option("header", true).load("/tmp/data").show
      +---+---+
      | a0| a1|
      +---+---+
      |  1|  1|
      +---+---+
      
      // parquet
      scala> val schema = StructType(StructField("a", IntegerType) :: StructField("a", IntegerType) :: Nil)
      scala> Seq((1, 1)).toDF("a", "b").coalesce(1).write.mode("overwrite").parquet("/tmp/data")
      scala> spark.read.format("parquet").schema(schema).option("header", false).load("/tmp/data").show
      org.apache.spark.sql.AnalysisException: Found duplicate column(s) in datasource: "a";
        at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtil.scala:47)
        at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtil.scala:33)
        at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:186)
        at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:368)
      ```
      
      ## How was this patch tested?
      Added tests in `DataFrameReaderWriterSuite` and `SQLQueryTestSuite`.
      
      Author: Takeshi Yamamuro <yamamuro@apache.org>
      
      Closes #17758 from maropu/SPARK-20460.
      647963a2
    • Wenchen Fan's avatar
      [SPARK-21100][SQL][FOLLOWUP] cleanup code and add more comments for Dataset.summary · 0e80ecae
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Some code cleanup and adding comments to make the code more readable. Changed the way to generate result rows, to be more clear.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18570 from cloud-fan/summary.
      0e80ecae
  13. Jul 09, 2017
    • Wenchen Fan's avatar
      [SPARK-18016][SQL][FOLLOWUP] merge declareAddedFunctions, initNestedClasses... · 680b33f1
      Wenchen Fan authored
      [SPARK-18016][SQL][FOLLOWUP] merge declareAddedFunctions, initNestedClasses and declareNestedClasses
      
      ## What changes were proposed in this pull request?
      
      These 3 methods have to be used together, so it makes more sense to merge them into one method and then the caller side only need to call one method.
      
      ## How was this patch tested?
      
      existing tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18579 from cloud-fan/minor.
      680b33f1
Loading