Skip to content
Snippets Groups Projects
  1. Jul 27, 2017
    • Wenchen Fan's avatar
      [SPARK-21319][SQL] Fix memory leak in sorter · 9f5647d6
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      `UnsafeExternalSorter.recordComparator` can be either `KVComparator` or `RowComparator`, and both of them will keep the reference to the input rows they compared last time.
      
      After sorting, we return the sorted iterator to upstream operators. However, the upstream operators may take a while to consume up the sorted iterator, and `UnsafeExternalSorter` is registered to `TaskContext` at [here](https://github.com/apache/spark/blob/v2.2.0/core/src/main/java/org/apache/spark/util/collection/unsafe/sort/UnsafeExternalSorter.java#L159-L161), which means we will keep the `UnsafeExternalSorter` instance and keep the last compared input rows in memory until the sorted iterator is consumed up.
      
      Things get worse if we sort within partitions of a dataset and coalesce all partitions into one, as we will keep a lot of input rows in memory and the time to consume up all the sorted iterators is long.
      
      This PR takes over https://github.com/apache/spark/pull/18543 , the idea is that, we do not keep the record comparator instance in `UnsafeExternalSorter`, but a generator of record comparator.
      
      close #18543
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18679 from cloud-fan/memory-leak.
      9f5647d6
    • Takuya UESHIN's avatar
      [SPARK-21440][SQL][PYSPARK] Refactor ArrowConverters and add ArrayType and StructType support. · 2ff35a05
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      This is a refactoring of `ArrowConverters` and related classes.
      
      1. Refactor `ColumnWriter` as `ArrowWriter`.
      2. Add `ArrayType` and `StructType` support.
      3. Refactor `ArrowConverters` to skip intermediate `ArrowRecordBatch` creation.
      
      ## How was this patch tested?
      
      Added some tests and existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18655 from ueshin/issues/SPARK-21440.
      2ff35a05
    • Kazuaki Ishizaki's avatar
      [SPARK-21271][SQL] Ensure Unsafe.sizeInBytes is a multiple of 8 · ebbe589d
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR ensures that `Unsafe.sizeInBytes` must be a multiple of 8. It it is not satisfied. `Unsafe.hashCode` causes the assertion violation.
      
      ## How was this patch tested?
      
      Will add test cases
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18503 from kiszk/SPARK-21271.
      ebbe589d
  2. Jul 26, 2017
    • hyukjinkwon's avatar
      [SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions · 60472dbf
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This generates a documentation for Spark SQL built-in functions.
      
      One drawback is, this requires a proper build to generate built-in function list.
      Once it is built, it only takes few seconds by `sql/create-docs.sh`.
      
      Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation.
      
      There are few more works to be done in order to make the documentation pretty, for example, separating `Arguments:` and `Examples:` but I guess this should be done within `ExpressionDescription` and `ExpressionInfo` rather than manually parsing it. I will fix these in a follow up.
      
      This requires `pip install mkdocs` to generate HTMLs from markdown files.
      
      ## How was this patch tested?
      
      Manually tested:
      
      ```
      cd docs
      jekyll build
      ```
      ,
      
      ```
      cd docs
      jekyll serve
      ```
      
      and
      
      ```
      cd sql
      create-docs.sh
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #18702 from HyukjinKwon/SPARK-21485.
      60472dbf
  3. Jul 25, 2017
    • gatorsmile's avatar
      [SPARK-20586][SQL] Add deterministic to ScalaUDF · ebc24a9b
      gatorsmile authored
      ### What changes were proposed in this pull request?
      Like [Hive UDFType](https://hive.apache.org/javadocs/r2.0.1/api/org/apache/hadoop/hive/ql/udf/UDFType.html), we should allow users to add the extra flags for ScalaUDF and JavaUDF too. _stateful_/_impliesOrder_ are not applicable to our Scala UDF. Thus, we only add the following two flags.
      
      - deterministic: Certain optimizations should not be applied if UDF is not deterministic. Deterministic UDF returns same result each time it is invoked with a particular input. This determinism just needs to hold within the context of a query.
      
      When the deterministic flag is not correctly set, the results could be wrong.
      
      For ScalaUDF in Dataset APIs, users can call the following extra APIs for `UserDefinedFunction` to make the corresponding changes.
      - `nonDeterministic`: Updates UserDefinedFunction to non-deterministic.
      
      Also fixed the Java UDF name loss issue.
      
      Will submit a separate PR for `distinctLike`  for UDAF
      
      ### How was this patch tested?
      Added test cases for both ScalaUDF
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: Wenchen Fan <cloud0fan@gmail.com>
      
      Closes #17848 from gatorsmile/udfRegister.
      ebc24a9b
  4. Jul 24, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-21516][SQL][TEST] Overriding afterEach() in DatasetCacheSuite must call super.afterEach() · 7f295059
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR ensures to call `super.afterEach()` in overriding `afterEach()` method in `DatasetCacheSuite`. When we override `afterEach()` method in Testsuite, we have to call `super.afterEach()`.
      
      This is a follow-up of #18719 and SPARK-21512.
      
      ## How was this patch tested?
      
      Used the existing test suite
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18721 from kiszk/SPARK-21516.
      7f295059
    • Wenchen Fan's avatar
      [SPARK-17528][SQL][FOLLOWUP] remove unnecessary data copy in object hash aggregate · 86664338
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      In #18483 , we fixed the data copy bug when saving into `InternalRow`, and removed all workarounds for this bug in the aggregate code path. However, the object hash aggregate was missed, this PR fixes it.
      
      This patch is also a requirement for #17419 , which shows that DataFrame version is slower than RDD version because of this issue.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18712 from cloud-fan/minor.
      86664338
  5. Jul 23, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-21512][SQL][TEST] DatasetCacheSuite needs to execute unpersistent after executing peristent · 481f0792
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR avoids to reuse unpersistent dataset among test cases by making dataset unpersistent at the end of each test case.
      
      In `DatasetCacheSuite`, the test case `"get storage level"` does not make dataset unpersisit after make the dataset persisitent. The same dataset will be made persistent by the test case `"persist and then rebind right encoder when join 2 datasets"` Thus, we run these test cases, the second case does not perform to make dataset persistent. This is because in
      
      When we run only the second case, it performs to make dataset persistent. It is not good to change behavior of the second test suite. The first test case should correctly make dataset unpersistent.
      
      ```
      Testing started at 17:52 ...
      01:52:15.053 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      01:52:48.595 WARN org.apache.spark.sql.execution.CacheManager: Asked to cache already cached data.
      01:52:48.692 WARN org.apache.spark.sql.execution.CacheManager: Asked to cache already cached data.
      01:52:50.864 WARN org.apache.spark.storage.RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
      01:52:50.864 WARN org.apache.spark.storage.RandomBlockReplicationPolicy: Expecting 1 replicas with only 0 peer/s.
      01:52:50.868 WARN org.apache.spark.storage.BlockManager: Block rdd_8_1 replicated to only 0 peer(s) instead of 1 peers
      01:52:50.868 WARN org.apache.spark.storage.BlockManager: Block rdd_8_0 replicated to only 0 peer(s) instead of 1 peers
      ```
      
      After this PR, these messages do not appear
      ```
      Testing started at 18:14 ...
      02:15:05.329 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
      
      Process finished with exit code 0
      ```
      
      ## How was this patch tested?
      
      Used the existing test
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #18719 from kiszk/SPARK-21512.
      481f0792
    • Reynold Xin's avatar
      [MINOR] Remove **** in test case names in FlatMapGroupsWithStateSuite · a4eac8b0
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch removes the `****` string from test names in FlatMapGroupsWithStateSuite. `***` is a common string developers grep for when using Scala test (because it immediately shows the failing test cases). The existence of the `****` in test names disrupts that workflow.
      
      ## How was this patch tested?
      N/A - test only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #18715 from rxin/FlatMapGroupsWithStateStar.
      a4eac8b0
    • pj.fanning's avatar
      [SPARK-20871][SQL] limit logging of Janino code · 2a53fbfc
      pj.fanning authored
      ## What changes were proposed in this pull request?
      
      When the code that is generated is greater than 64k, then Janino compile will fail and CodeGenerator.scala will log the entire code at Error level.
      SPARK-20871 suggests only logging the code at Debug level.
      Since, the code is already logged at debug level, this Pull Request proposes not including the formatted code in the Error logging and exception message at all.
      When an exception occurs, the code will be logged at Info level but truncated if it is more than 1000 lines long.
      
      ## How was this patch tested?
      
      Existing tests were run.
      An extra test test case was added to CodeFormatterSuite to test the new maxLines parameter,
      
      Author: pj.fanning <pj.fanning@workday.com>
      
      Closes #18658 from pjfanning/SPARK-20871.
      2a53fbfc
    • Wenchen Fan's avatar
      [SPARK-10063] Follow-up: remove a useless test related to an old output committer · ccaee5b5
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      It's a follow-up of https://github.com/apache/spark/pull/18689 , which forgot to remove a useless test.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18716 from cloud-fan/test.
      ccaee5b5
  6. Jul 21, 2017
    • Takuya UESHIN's avatar
      [SPARK-21472][SQL][FOLLOW-UP] Introduce ArrowColumnVector as a reader for Arrow vectors. · 2f146842
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      This is a follow-up of #18680.
      
      In some environment, a compile error happens saying:
      
      ```
      .../sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/ArrowColumnVector.java:243:
      error: not found: type Array
        public void loadBytes(Array array) {
                              ^
      ```
      
      This pr fixes it.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18701 from ueshin/issues/SPARK-21472_fup1.
      2f146842
  7. Jul 20, 2017
    • Wenchen Fan's avatar
      [SPARK-10063] Follow-up: remove dead code related to an old output committer · 3ac60930
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      DirectParquetOutputCommitter was removed from Spark as it was deemed unsafe to use. We however still have some code to generate warning. This patch removes those code as well.
      
      This is kind of a follow-up of https://github.com/apache/spark/pull/16796
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #18689 from cloud-fan/minor.
      3ac60930
    • Takuya UESHIN's avatar
      [SPARK-21472][SQL] Introduce ArrowColumnVector as a reader for Arrow vectors. · cb19880c
      Takuya UESHIN authored
      ## What changes were proposed in this pull request?
      
      Introducing `ArrowColumnVector` as a reader for Arrow vectors.
      It extends `ColumnVector`, so we will be able to use it with `ColumnarBatch` and its functionalities.
      Currently it supports primitive types and `StringType`, `ArrayType` and `StructType`.
      
      ## How was this patch tested?
      
      Added tests for `ArrowColumnVector` and existing tests.
      
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18680 from ueshin/issues/SPARK-21472.
      cb19880c
    • gatorsmile's avatar
      [SPARK-21477][SQL][MINOR] Mark LocalTableScanExec's input data transient · 256358f6
      gatorsmile authored
      ## What changes were proposed in this pull request?
      This PR is to mark the parameter `rows` and `unsafeRow` of LocalTableScanExec transient. It can avoid serializing the unneeded objects.
      
      ## How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18686 from gatorsmile/LocalTableScanExec.
      256358f6
  8. Jul 19, 2017
    • Xiang Gao's avatar
      [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null... · b7a40f64
      Xiang Gao authored
      [SPARK-16542][SQL][PYSPARK] Fix bugs about types that result an array of null when creating DataFrame using python
      
      ## What changes were proposed in this pull request?
      This is the reopen of https://github.com/apache/spark/pull/14198, with merge conflicts resolved.
      
      ueshin Could you please take a look at my code?
      
      Fix bugs about types that result an array of null when creating DataFrame using python.
      
      Python's array.array have richer type than python itself, e.g. we can have `array('f',[1,2,3])` and `array('d',[1,2,3])`. Codes in spark-sql and pyspark didn't take this into consideration which might cause a problem that you get an array of null values when you have `array('f')` in your rows.
      
      A simple code to reproduce this bug is:
      
      ```
      from pyspark import SparkContext
      from pyspark.sql import SQLContext,Row,DataFrame
      from array import array
      
      sc = SparkContext()
      sqlContext = SQLContext(sc)
      
      row1 = Row(floatarray=array('f',[1,2,3]), doublearray=array('d',[1,2,3]))
      rows = sc.parallelize([ row1 ])
      df = sqlContext.createDataFrame(rows)
      df.show()
      ```
      
      which have output
      
      ```
      +---------------+------------------+
      |    doublearray|        floatarray|
      +---------------+------------------+
      |[1.0, 2.0, 3.0]|[null, null, null]|
      +---------------+------------------+
      ```
      
      ## How was this patch tested?
      
      New test case added
      
      Author: Xiang Gao <qasdfgtyuiop@gmail.com>
      Author: Gao, Xiang <qasdfgtyuiop@gmail.com>
      Author: Takuya UESHIN <ueshin@databricks.com>
      
      Closes #18444 from zasdfgbnm/fix_array_infer.
      b7a40f64
    • Burak Yavuz's avatar
      [SPARK-21463] Allow userSpecifiedSchema to override partition inference... · 2c9d5ef1
      Burak Yavuz authored
      [SPARK-21463] Allow userSpecifiedSchema to override partition inference performed by MetadataLogFileIndex
      
      ## What changes were proposed in this pull request?
      
      When using the MetadataLogFileIndex to read back a table, we don't respect the user provided schema as the proper column types. This can lead to issues when trying to read strings that look like dates that get truncated to DateType, or longs being truncated to IntegerType, just because a long value doesn't exist.
      
      ## How was this patch tested?
      
      Unit tests and manual tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #18676 from brkyvz/stream-partitioning.
      2c9d5ef1
    • Corey Woodfield's avatar
      [SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of Dataset#joinWith · 8cd9cdf1
      Corey Woodfield authored
      ## What changes were proposed in this pull request?
      
      Two invalid join types were mistakenly listed in the javadoc for joinWith, in the Dataset class. I presume these were copied from the javadoc of join, but since joinWith returns a Dataset\<Tuple2\>, left_semi and left_anti are invalid, as they only return values from one of the datasets, instead of from both
      
      ## How was this patch tested?
      
      I ran the following code :
      ```
      public static void main(String[] args) {
      	SparkSession spark = new SparkSession(new SparkContext("local[*]", "Test"));
      	Dataset<Row> one = spark.createDataFrame(Arrays.asList(new Bean(1), new Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
      	Dataset<Row> two = spark.createDataFrame(Arrays.asList(new Bean(4), new Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);
      
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "inner").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "cross").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "outer").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "full_outer").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_outer").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "right_outer").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_semi").show();} catch (Exception e) {e.printStackTrace();}
      	try {two.joinWith(one, one.col("x").equalTo(two.col("x")), "left_anti").show();} catch (Exception e) {e.printStackTrace();}
      }
      ```
      which tests all the different join types, and the last two (left_semi and left_anti) threw exceptions. The same code using join instead of joinWith did fine. The Bean class was just a java bean with a single int field, x.
      
      Author: Corey Woodfield <coreywoodfield@gmail.com>
      
      Closes #18462 from coreywoodfield/master.
      8cd9cdf1
    • DFFuture's avatar
      [SPARK-21446][SQL] Fix setAutoCommit never executed · c9729187
      DFFuture authored
      ## What changes were proposed in this pull request?
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446
      options.asConnectionProperties can not have fetchsize,because fetchsize belongs to Spark-only options, and Spark-only options have been excluded in connection properities.
      So change properties of beforeFetch from  options.asConnectionProperties.asScala.toMap to options.asProperties.asScala.toMap
      
      ## How was this patch tested?
      
      Author: DFFuture <albert.zhang23@gmail.com>
      
      Closes #18665 from DFFuture/sparksql_pg.
      c9729187
    • Tathagata Das's avatar
      [SPARK-21464][SS] Minimize deprecation warnings caused by ProcessingTime class · 70fe99dc
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Use of `ProcessingTime` class was deprecated in favor of `Trigger.ProcessingTime` in Spark 2.2. However interval uses to ProcessingTime causes deprecation warnings during compilation. This cannot be avoided entirely as even though it is deprecated as a public API, ProcessingTime instances are used internally in TriggerExecutor. This PR is to minimize the warning by removing its uses from tests as much as possible.
      
      ## How was this patch tested?
      Existing tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18678 from tdas/SPARK-21464.
      70fe99dc
    • donnyzone's avatar
      [SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures in some cases · 6b6dd682
      donnyzone authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441
      
      This issue can be reproduced by the following example:
      
      ```
      val spark = SparkSession
         .builder()
         .appName("smj-codegen")
         .master("local")
         .config("spark.sql.autoBroadcastJoinThreshold", "1")
         .getOrCreate()
      val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key", "int")
      val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3, "3"))).toDF("key", "str")
      val df = df1.join(df2, df1("key") === df2("key"))
         .filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
         .select("int")
         df.show()
      ```
      
      To conclude, the issue happens when:
      (1) SortMergeJoin condition contains CodegenFallback expressions.
      (2) In PhysicalPlan tree, SortMergeJoin node  is the child of root node, e.g., the Project in above example.
      
      This patch fixes the logic in `CollapseCodegenStages` rule.
      
      ## How was this patch tested?
      Unit test and manual verification in our cluster.
      
      Author: donnyzone <wellfengzhu@gmail.com>
      
      Closes #18656 from DonnyZone/Fix_SortMergeJoinExec.
      6b6dd682
    • jinxing's avatar
      [SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM. · 4eb081cc
      jinxing authored
      ## What changes were proposed in this pull request?
      
      In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer for which the input row value is equal to or less than the output row upper bound, then drop all rows from the buffer for which the input row value is smaller than the output row lower bound.
      This could result in the buffer is very big though the window is small.
      For example:
      ```
      select a, b, sum(a)
      over (partition by b order by a range between 1000000 following and 1000001 following)
      from table
      ```
      We can refine the logic and just add the qualified rows into buffer.
      
      ## How was this patch tested?
      Manual test:
      Run sql
      `select shop, shopInfo, district, sum(revenue) over(partition by district order by revenue range between 100 following and 200 following) from revenueList limit 10`
      against a table with 4  columns(shop: String, shopInfo: String, district: String, revenue: Int). The biggest partition is around 2G bytes, containing 200k lines.
      Configure the executor with 2G bytes memory.
      With the change in this pr, it works find. Without this change, below exception will be thrown.
      ```
      MemoryError: Java heap space
      	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)
      	at org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62)
      	at org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201)
      	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365)
      	at org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289)
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
      	at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
      	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
      	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
      	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
      	at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
      	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
      	at org.apache.spark.scheduler.Task.run(Task.scala:108)
      	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
      	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
      	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
      ```
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #18634 from jinxing64/SPARK-21414.
      4eb081cc
  9. Jul 18, 2017
  10. Jul 17, 2017
    • aokolnychyi's avatar
      [SPARK-21332][SQL] Incorrect result type inferred for some decimal expressions · 0be5fb41
      aokolnychyi authored
      ## What changes were proposed in this pull request?
      
      This PR changes the direction of expression transformation in the DecimalPrecision rule. Previously, the expressions were transformed down, which led to incorrect result types when decimal expressions had other decimal expressions as their operands. The root cause of this issue was in visiting outer nodes before their children. Consider the example below:
      
      ```
          val inputSchema = StructType(StructField("col", DecimalType(26, 6)) :: Nil)
          val sc = spark.sparkContext
          val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
          val df = spark.createDataFrame(rdd, inputSchema)
      
          // Works correctly since no nested decimal expression is involved
          // Expected result type: (26, 6) * (26, 6) = (38, 12)
          df.select($"col" * $"col").explain(true)
          df.select($"col" * $"col").printSchema()
      
          // Gives a wrong result since there is a nested decimal expression that should be visited first
          // Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) * (26, 6) = (38, 18)
          df.select($"col" * $"col" * $"col").explain(true)
          df.select($"col" * $"col" * $"col").printSchema()
      ```
      
      The example above gives the following output:
      
      ```
      // Correct result without sub-expressions
      == Parsed Logical Plan ==
      'Project [('col * 'col) AS (col * col)#4]
      +- LogicalRDD [col#1]
      
      == Analyzed Logical Plan ==
      (col * col): decimal(38,12)
      Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col * col)#4]
      +- LogicalRDD [col#1]
      
      == Optimized Logical Plan ==
      Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
      +- LogicalRDD [col#1]
      
      == Physical Plan ==
      *Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col * col)#4]
      +- Scan ExistingRDD[col#1]
      
      // Schema
      root
       |-- (col * col): decimal(38,12) (nullable = true)
      
      // Incorrect result with sub-expressions
      == Parsed Logical Plan ==
      'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11]
      +- LogicalRDD [col#1]
      
      == Analyzed Logical Plan ==
      ((col * col) * col): decimal(38,12)
      Project [CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11]
      +- LogicalRDD [col#1]
      
      == Optimized Logical Plan ==
      Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
      +- LogicalRDD [col#1]
      
      == Physical Plan ==
      *Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1), DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col * col) * col)#11]
      +- Scan ExistingRDD[col#1]
      
      // Schema
      root
       |-- ((col * col) * col): decimal(38,12) (nullable = true)
      ```
      
      ## How was this patch tested?
      
      This PR was tested with available unit tests. Moreover, there are tests to cover previously failing scenarios.
      
      Author: aokolnychyi <anton.okolnychyi@sap.com>
      
      Closes #18583 from aokolnychyi/spark-21332.
      0be5fb41
    • Tathagata Das's avatar
      [SPARK-21409][SS] Follow up PR to allow different types of custom metrics to be exposed · e9faae13
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Implementation may expose both timing as well as size metrics. This PR enables that.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18661 from tdas/SPARK-21409-2.
      e9faae13
    • gatorsmile's avatar
      [MINOR] Improve SQLConf messages · a8c6d0f6
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The current SQLConf messages of `spark.sql.hive.convertMetastoreParquet` and `spark.sql.hive.convertMetastoreOrc` are not very clear to end users. This PR is to improve them.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18657 from gatorsmile/msgUpdates.
      a8c6d0f6
    • Tathagata Das's avatar
      [SPARK-21409][SS] Expose state store memory usage in SQL metrics and progress updates · 9d8c8317
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Currently, there is no tracking of memory usage of state stores. This JIRA is to expose that through SQL metrics and StreamingQueryProgress.
      
      Additionally, added the ability to expose implementation-specific metrics through the StateStore APIs to the SQLMetrics.
      
      ## How was this patch tested?
      Added unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #18629 from tdas/SPARK-21409.
      9d8c8317
    • gatorsmile's avatar
      [SPARK-21354][SQL] INPUT FILE related functions do not support more than one sources · e398c281
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The build-in functions `input_file_name`, `input_file_block_start`, `input_file_block_length` do not support more than one sources, like what Hive does. Currently, Spark does not block it and the outputs are ambiguous/non-deterministic. It could be from any side.
      
      ```
      hive> select *, INPUT__FILE__NAME FROM t1, t2;
      FAILED: SemanticException Column INPUT__FILE__NAME Found in more than One Tables/Subqueries
      ```
      
      This PR blocks it and issues an error.
      
      ### How was this patch tested?
      Added a test case
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #18580 from gatorsmile/inputFileName.
      e398c281
  11. Jul 16, 2017
  12. Jul 14, 2017
  13. Jul 13, 2017
    • Sean Owen's avatar
      [SPARK-19810][BUILD][CORE] Remove support for Scala 2.10 · 425c4ada
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove Scala 2.10 build profiles and support
      - Replace some 2.10 support in scripts with commented placeholders for 2.12 later
      - Remove deprecated API calls from 2.10 support
      - Remove usages of deprecated context bounds where possible
      - Remove Scala 2.10 workarounds like ScalaReflectionLock
      - Other minor Scala warning fixes
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17150 from srowen/SPARK-19810.
      425c4ada
  14. Jul 12, 2017
Loading