Skip to content
Snippets Groups Projects
  1. Mar 24, 2017
    • Adam Budde's avatar
      [SPARK-19911][STREAMING] Add builder interface for Kinesis DStreams · 707e5018
      Adam Budde authored
      ## What changes were proposed in this pull request?
      
      - Add new KinesisDStream.scala containing KinesisDStream.Builder class
      - Add KinesisDStreamBuilderSuite test suite
      - Make KinesisInputDStream ctor args package private for testing
      - Add JavaKinesisDStreamBuilderSuite test suite
      - Add args to KinesisInputDStream and KinesisReceiver for optional
        service-specific auth (Kinesis, DynamoDB and CloudWatch)
      ## How was this patch tested?
      
      Added ```KinesisDStreamBuilderSuite``` to verify builder class works as expected
      
      Author: Adam Budde <budde@amazon.com>
      
      Closes #17250 from budde/KinesisStreamBuilder.
      707e5018
    • Jacek Laskowski's avatar
      [SQL][MINOR] Fix for typo in Analyzer · 9299d071
      Jacek Laskowski authored
      ## What changes were proposed in this pull request?
      
      Fix for typo in Analyzer
      
      ## How was this patch tested?
      
      local build
      
      Author: Jacek Laskowski <jacek@japila.pl>
      
      Closes #17409 from jaceklaskowski/analyzer-typo.
      9299d071
    • Nick Pentreath's avatar
      [SPARK-15040][ML][PYSPARK] Add Imputer to PySpark · d9f4ce69
      Nick Pentreath authored
      Add Python wrapper for `Imputer` feature transformer.
      
      ## How was this patch tested?
      
      New doc tests and tweak to PySpark ML `tests.py`
      
      Author: Nick Pentreath <nickp@za.ibm.com>
      
      Closes #17316 from MLnick/SPARK-15040-pyspark-imputer.
      d9f4ce69
    • Xiao Li's avatar
      [SPARK-19970][SQL][FOLLOW-UP] Table owner should be USER instead of PRINCIPAL... · 344f38b0
      Xiao Li authored
      [SPARK-19970][SQL][FOLLOW-UP] Table owner should be USER instead of PRINCIPAL in kerberized clusters #17311
      
      ### What changes were proposed in this pull request?
      This is a follow-up for the PR: https://github.com/apache/spark/pull/17311
      
      - For safety, use `sessionState` to get the user name, instead of calling `SessionState.get()` in the function `toHiveTable`.
      - Passing `user names` instead of `conf` when calling `toHiveTable`.
      
      ### How was this patch tested?
      N/A
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17405 from gatorsmile/user.
      344f38b0
    • Eric Liang's avatar
      [SPARK-19820][CORE] Add interface to kill tasks w/ a reason · 8e558041
      Eric Liang authored
      This commit adds a killTaskAttempt method to SparkContext, to allow users to
      kill tasks so that they can be re-scheduled elsewhere.
      
      This also refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as `(N killed: $reason)` and `TaskKilled: $reason`. Without this change, there is no way to provide the user feedback through the UI.
      
      Currently used reasons are "stage cancelled", "another attempt succeeded", and "killed via SparkContext.killTask". The user can also specify a custom reason through `SparkContext.killTask`.
      
      cc rxin
      
      In the stage overview UI the reasons are summarized:
      ![1](https://cloud.githubusercontent.com/assets/14922/23929209/a83b2862-08e1-11e7-8b3e-ae1967bbe2e5.png)
      
      Within the stage UI you can see individual task kill reasons:
      ![2](https://cloud.githubusercontent.com/assets/14922/23929200/9a798692-08e1-11e7-8697-72b27ad8a287.png)
      
      Existing tests, tried killing some stages in the UI and verified the messages are as expected.
      
      Author: Eric Liang <ekl@databricks.com>
      Author: Eric Liang <ekl@google.com>
      
      Closes #17166 from ericl/kill-reason.
      8e558041
    • jinxing's avatar
      [SPARK-16929] Improve performance when check speculatable tasks. · 19596c28
      jinxing authored
      ## What changes were proposed in this pull request?
      1. Use a MedianHeap to record durations of successful tasks.  When check speculatable tasks, we can get the median duration with O(1) time complexity.
      
      2. `checkSpeculatableTasks` will synchronize `TaskSchedulerImpl`. If `checkSpeculatableTasks` doesn't finish with 100ms, then the possibility exists for that thread to release and then immediately re-acquire the lock. Change `scheduleAtFixedRate` to be `scheduleWithFixedDelay` when call method of `checkSpeculatableTasks`.
      ## How was this patch tested?
      Added MedianHeapSuite.
      
      Author: jinxing <jinxing6042@126.com>
      
      Closes #16867 from jinxing64/SPARK-16929.
      19596c28
  2. Mar 23, 2017
    • Kazuaki Ishizaki's avatar
      [SPARK-19959][SQL] Fix to throw NullPointerException in df[java.lang.Long].collect · bb823ca4
      Kazuaki Ishizaki authored
      ## What changes were proposed in this pull request?
      
      This PR fixes `NullPointerException` in the generated code by Catalyst. When we run the following code, we get the following `NullPointerException`. This is because there is no null checks for `inputadapter_value`  while `java.lang.Long inputadapter_value` at Line 30 may have `null`.
      
      This happen when a type of DataFrame is nullable primitive type such as `java.lang.Long` and the wholestage codegen is used. While the physical plan keeps `nullable=true` in `input[0, java.lang.Long, true].longValue`, `BoundReference.doGenCode` ignores `nullable=true`. Thus, nullcheck code will not be generated and `NullPointerException` will occur.
      
      This PR checks the nullability and correctly generates nullcheck if needed.
      ```java
      sparkContext.parallelize(Seq[java.lang.Long](0L, null, 2L), 1).toDF.collect
      ```
      
      ```java
      Caused by: java.lang.NullPointerException
      	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(generated.java:37)
      	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
      	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:393)
      ...
      ```
      
      Generated code without this PR
      ```java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */
      /* 013 */   public GeneratedIterator(Object[] references) {
      /* 014 */     this.references = references;
      /* 015 */   }
      /* 016 */
      /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 018 */     partitionIndex = index;
      /* 019 */     this.inputs = inputs;
      /* 020 */     inputadapter_input = inputs[0];
      /* 021 */     serializefromobject_result = new UnsafeRow(1);
      /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 024 */
      /* 025 */   }
      /* 026 */
      /* 027 */   protected void processNext() throws java.io.IOException {
      /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
      /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 030 */       java.lang.Long inputadapter_value = (java.lang.Long)inputadapter_row.get(0, null);
      /* 031 */
      /* 032 */       boolean serializefromobject_isNull = true;
      /* 033 */       long serializefromobject_value = -1L;
      /* 034 */       if (!false) {
      /* 035 */         serializefromobject_isNull = false;
      /* 036 */         if (!serializefromobject_isNull) {
      /* 037 */           serializefromobject_value = inputadapter_value.longValue();
      /* 038 */         }
      /* 039 */
      /* 040 */       }
      /* 041 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 042 */
      /* 043 */       if (serializefromobject_isNull) {
      /* 044 */         serializefromobject_rowWriter.setNullAt(0);
      /* 045 */       } else {
      /* 046 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
      /* 047 */       }
      /* 048 */       append(serializefromobject_result);
      /* 049 */       if (shouldStop()) return;
      /* 050 */     }
      /* 051 */   }
      /* 052 */ }
      ```
      
      Generated code with this PR
      
      ```java
      /* 005 */ final class GeneratedIterator extends org.apache.spark.sql.execution.BufferedRowIterator {
      /* 006 */   private Object[] references;
      /* 007 */   private scala.collection.Iterator[] inputs;
      /* 008 */   private scala.collection.Iterator inputadapter_input;
      /* 009 */   private UnsafeRow serializefromobject_result;
      /* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder serializefromobject_holder;
      /* 011 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter serializefromobject_rowWriter;
      /* 012 */
      /* 013 */   public GeneratedIterator(Object[] references) {
      /* 014 */     this.references = references;
      /* 015 */   }
      /* 016 */
      /* 017 */   public void init(int index, scala.collection.Iterator[] inputs) {
      /* 018 */     partitionIndex = index;
      /* 019 */     this.inputs = inputs;
      /* 020 */     inputadapter_input = inputs[0];
      /* 021 */     serializefromobject_result = new UnsafeRow(1);
      /* 022 */     this.serializefromobject_holder = new org.apache.spark.sql.catalyst.expressions.codegen.BufferHolder(serializefromobject_result, 0);
      /* 023 */     this.serializefromobject_rowWriter = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(serializefromobject_holder, 1);
      /* 024 */
      /* 025 */   }
      /* 026 */
      /* 027 */   protected void processNext() throws java.io.IOException {
      /* 028 */     while (inputadapter_input.hasNext() && !stopEarly()) {
      /* 029 */       InternalRow inputadapter_row = (InternalRow) inputadapter_input.next();
      /* 030 */       boolean inputadapter_isNull = inputadapter_row.isNullAt(0);
      /* 031 */       java.lang.Long inputadapter_value = inputadapter_isNull ? null : ((java.lang.Long)inputadapter_row.get(0, null));
      /* 032 */
      /* 033 */       boolean serializefromobject_isNull = true;
      /* 034 */       long serializefromobject_value = -1L;
      /* 035 */       if (!inputadapter_isNull) {
      /* 036 */         serializefromobject_isNull = false;
      /* 037 */         if (!serializefromobject_isNull) {
      /* 038 */           serializefromobject_value = inputadapter_value.longValue();
      /* 039 */         }
      /* 040 */
      /* 041 */       }
      /* 042 */       serializefromobject_rowWriter.zeroOutNullBytes();
      /* 043 */
      /* 044 */       if (serializefromobject_isNull) {
      /* 045 */         serializefromobject_rowWriter.setNullAt(0);
      /* 046 */       } else {
      /* 047 */         serializefromobject_rowWriter.write(0, serializefromobject_value);
      /* 048 */       }
      /* 049 */       append(serializefromobject_result);
      /* 050 */       if (shouldStop()) return;
      /* 051 */     }
      /* 052 */   }
      /* 053 */ }
      ```
      
      ## How was this patch tested?
      
      Added new test suites in `DataFrameSuites`
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #17302 from kiszk/SPARK-19959.
      bb823ca4
    • Timothy Hunter's avatar
      [SPARK-19636][ML] Feature parity for correlation statistics in MLlib · d27daa54
      Timothy Hunter authored
      ## What changes were proposed in this pull request?
      
      This patch adds the Dataframes-based support for the correlation statistics found in the `org.apache.spark.mllib.stat.correlation.Statistics`, following the design doc discussed in the JIRA ticket.
      
      The current implementation is a simple wrapper around the `spark.mllib` implementation. Future optimizations can be implemented at a later stage.
      
      ## How was this patch tested?
      
      ```
      build/sbt "testOnly org.apache.spark.ml.stat.StatisticsSuite"
      ```
      
      Author: Timothy Hunter <timhunter@databricks.com>
      
      Closes #17108 from thunterdb/19636.
      d27daa54
    • Burak Yavuz's avatar
      Fix compilation of the Scala 2.10 master branch · 93581fbc
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      Fixes break caused by: https://github.com/apache/spark/commit/746a558de2136f91f8fe77c6e51256017aa50913
      
      ## How was this patch tested?
      
      Compiled with `build/sbt -Dscala2.10 sql/compile` locally
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #17403 from brkyvz/onceTrigger2.10.
      93581fbc
    • sureshthalamati's avatar
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to... · c7911807
      sureshthalamati authored
      [SPARK-10849][SQL] Adds option to the JDBC data source write for user to specify database column type for the create table
      
      ## What changes were proposed in this pull request?
      Currently JDBC data source creates tables in the target database using the default type mapping, and the JDBC dialect mechanism.  If users want to specify different database data type for only some of columns, there is no option available. In scenarios where default mapping does not work, users are forced to create tables on the target database before writing. This workaround is probably not acceptable from a usability point of view. This PR is to provide a user-defined type mapping for specific columns.
      
      The solution is to allow users to specify database column data type for the create table  as JDBC datasource option(createTableColumnTypes) on write. Data type information can be specified in the same format as table schema DDL format (e.g: `name CHAR(64), comments VARCHAR(1024)`).
      
      All supported target database types can not be specified ,  the data types has to be valid spark sql data types also.  For example user can not specify target database  CLOB data type. This will be supported in the follow-up PR.
      
      Example:
      ```Scala
      df.write
      .option("createTableColumnTypes", "name CHAR(64), comments VARCHAR(1024)")
      .jdbc(url, "TEST.DBCOLTYPETEST", properties)
      ```
      ## How was this patch tested?
      Added new test cases to the JDBCWriteSuite
      
      Author: sureshthalamati <suresh.thalamati@gmail.com>
      
      Closes #16209 from sureshthalamati/jdbc_custom_dbtype_option_json-spark-10849.
      c7911807
    • erenavsarogullari's avatar
      [SPARK-19567][CORE][SCHEDULER] Support some Schedulable variables immutability and access · b7be05a2
      erenavsarogullari authored
      ## What changes were proposed in this pull request?
      Some `Schedulable` Entities(`Pool` and `TaskSetManager`) variables need refactoring for _immutability_ and _access modifiers_ levels as follows:
      - From `var` to `val` (if there is no requirement): This is important to support immutability as much as possible.
        - Sample => `Pool`: `weight`, `minShare`, `priority`, `name` and `taskSetSchedulingAlgorithm`.
      - Access modifiers: Specially, `var`s access needs to be restricted from other parts of codebase to prevent potential side effects.
        - `TaskSetManager`: `tasksSuccessful`, `totalResultSize`, `calculatedTasks` etc...
      
      This PR is related with #15604 and has been created seperatedly to keep patch content as isolated and to help the reviewers.
      
      ## How was this patch tested?
      Added new UTs and existing UT coverage.
      
      Author: erenavsarogullari <erenavsarogullari@gmail.com>
      
      Closes #16905 from erenavsarogullari/SPARK-19567.
      b7be05a2
    • Tyson Condie's avatar
      [SPARK-19876][SS][WIP] OneTime Trigger Executor · 746a558d
      Tyson Condie authored
      ## What changes were proposed in this pull request?
      
      An additional trigger and trigger executor that will execute a single trigger only. One can use this OneTime trigger to have more control over the scheduling of triggers.
      
      In addition, this patch requires an optimization to StreamExecution that logs a commit record at the end of successfully processing a batch. This new commit log will be used to determine the next batch (offsets) to process after a restart, instead of using the offset log itself to determine what batch to process next after restart; using the offset log to determine this would process the previously logged batch, always, thus not permitting a OneTime trigger feature.
      
      ## How was this patch tested?
      
      A number of existing tests have been revised. These tests all assumed that when restarting a stream, the last batch in the offset log is to be re-processed. Given that we now have a commit log that will tell us if that last batch was processed successfully, the results/assumptions of those tests needed to be revised accordingly.
      
      In addition, a OneTime trigger test was added to StreamingQuerySuite, which tests:
      - The semantics of OneTime trigger (i.e., on start, execute a single batch, then stop).
      - The case when the commit log was not able to successfully log the completion of a batch before restart, which would mean that we should fall back to what's in the offset log.
      - A OneTime trigger execution that results in an exception being thrown.
      
      marmbrus tdas zsxwing
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17219 from tcondie/stream-commit.
      746a558d
    • Ye Yin's avatar
      Typo fixup in comment · b0ae6a38
      Ye Yin authored
      ## What changes were proposed in this pull request?
      
      Fixup typo in comment.
      
      ## How was this patch tested?
      
      Don't need.
      
      Author: Ye Yin <eyniy@qq.com>
      
      Closes #17396 from hustcat/fix.
      b0ae6a38
    • Sean Owen's avatar
      [INFRA] Close stale PRs · b70c03a4
      Sean Owen authored
      Closes #16819
      Closes #13467
      Closes #16083
      Closes #17135
      Closes #8785
      Closes #16278
      Closes #16997
      Closes #17073
      Closes #17220
      
      Added:
      Closes #12059
      Closes #12524
      Closes #12888
      Closes #16061
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #17386 from srowen/StalePRs.
      b70c03a4
    • hyukjinkwon's avatar
      [MINOR][BUILD] Fix javadoc8 break · aefe7989
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Several javadoc8 breaks have been introduced. This PR proposes fix those instances so that we can build Scala/Java API docs.
      
      ```
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:6: error: reference not found
      [error]  * <code>flatMapGroupsWithState</code> operations on {link KeyValueGroupedDataset}.
      [error]                                                             ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:10: error: reference not found
      [error]  * Both, <code>mapGroupsWithState</code> and <code>flatMapGroupsWithState</code> in {link KeyValueGroupedDataset}
      [error]                                                                                            ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:51: error: reference not found
      [error]  *    {link GroupStateTimeout.ProcessingTimeTimeout}) or event time (i.e.
      [error]              ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:52: error: reference not found
      [error]  *    {link GroupStateTimeout.EventTimeTimeout}).
      [error]              ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/streaming/GroupState.java:158: error: reference not found
      [error]  *           Spark SQL types (see {link Encoder} for more details).
      [error]                                          ^
      [error] .../spark/mllib/target/java/org/apache/spark/ml/fpm/FPGrowthParams.java:26: error: bad use of '>'
      [error]    * Number of partitions (>=1) used by parallel FP-growth. By default the param is not set, and
      [error]                            ^
      [error] .../spark/sql/core/src/main/java/org/apache/spark/api/java/function/FlatMapGroupsWithStateFunction.java:30: error: reference not found
      [error]  * {link org.apache.spark.sql.KeyValueGroupedDataset#flatMapGroupsWithState(
      [error]           ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:211: error: reference not found
      [error]    * See {link GroupState} for more details.
      [error]                 ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:232: error: reference not found
      [error]    * See {link GroupState} for more details.
      [error]                 ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:254: error: reference not found
      [error]    * See {link GroupState} for more details.
      [error]                 ^
      [error] .../spark/sql/core/target/java/org/apache/spark/sql/KeyValueGroupedDataset.java:277: error: reference not found
      [error]    * See {link GroupState} for more details.
      [error]                 ^
      [error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
      [error]  * {link TaskMetrics} &amp; {link MetricsSystem} objects are not thread safe.
      [error]           ^
      [error] .../spark/core/target/java/org/apache/spark/TaskContextImpl.java:10: error: reference not found
      [error]  * {link TaskMetrics} &amp; {link MetricsSystem} objects are not thread safe.
      [error]                                     ^
      [info] 13 errors
      ```
      
      ```
      jekyll 3.3.1 | Error:  Unidoc generation failed
      ```
      
      ## How was this patch tested?
      
      Manually via `jekyll build`
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17389 from HyukjinKwon/minor-javadoc8-fix.
      aefe7989
    • hyukjinkwon's avatar
      [SPARK-18579][SQL] Use ignoreLeadingWhiteSpace and ignoreTrailingWhiteSpace options in CSV writing · 07c12c09
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support _not_ trimming the white spaces when writing out. These are `false` by default in CSV reading path but these are `true` by default in CSV writing in univocity parser.
      
      Both `ignoreLeadingWhiteSpace` and `ignoreTrailingWhiteSpace` options are not being used for writing and therefore, we are always trimming the white spaces.
      
      It seems we should provide a way to keep this white spaces easily.
      
      WIth the data below:
      
      ```scala
      val df = spark.read.csv(Seq("a , b  , c").toDS)
      df.show()
      ```
      
      ```
      +---+----+---+
      |_c0| _c1|_c2|
      +---+----+---+
      | a | b  |  c|
      +---+----+---+
      ```
      
      **Before**
      
      ```scala
      df.write.csv("/tmp/text.csv")
      spark.read.text("/tmp/text.csv").show()
      ```
      
      ```
      +-----+
      |value|
      +-----+
      |a,b,c|
      +-----+
      ```
      
      It seems this can't be worked around via `quoteAll` too.
      
      ```scala
      df.write.option("quoteAll", true).csv("/tmp/text.csv")
      spark.read.text("/tmp/text.csv").show()
      ```
      ```
      +-----------+
      |      value|
      +-----------+
      |"a","b","c"|
      +-----------+
      ```
      
      **After**
      
      ```scala
      df.write.option("ignoreLeadingWhiteSpace", false).option("ignoreTrailingWhiteSpace", false).csv("/tmp/text.csv")
      spark.read.text("/tmp/text.csv").show()
      ```
      
      ```
      +----------+
      |     value|
      +----------+
      |a , b  , c|
      +----------+
      ```
      
      Note that this case is possible in R
      
      ```r
      > system("cat text.csv")
      f1,f2,f3
      a , b  , c
      > df <- read.csv(file="text.csv")
      > df
        f1   f2 f3
      1 a   b    c
      > write.csv(df, file="text1.csv", quote=F, row.names=F)
      > system("cat text1.csv")
      f1,f2,f3
      a , b  , c
      ```
      
      ## How was this patch tested?
      
      Unit tests in `CSVSuite` and manual tests for Python.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17310 from HyukjinKwon/SPARK-18579.
      07c12c09
  3. Mar 22, 2017
    • Sameer Agarwal's avatar
      [BUILD][MINOR] Fix 2.10 build · 12cd0070
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      https://github.com/apache/spark/pull/17385 breaks the 2.10 sbt/maven builds by hitting an empty-string interpolation bug (https://issues.scala-lang.org/browse/SI-7919).
      
      https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-sbt-scala-2.10/4072/
      https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-scala-2.10/3987/
      
      ## How was this patch tested?
      
      Compiles
      
      Author: Sameer Agarwal <sameerag@cs.berkeley.edu>
      
      Closes #17391 from sameeragarwal/build-fix.
      12cd0070
    • Tathagata Das's avatar
      [SPARK-20057][SS] Renamed KeyedState to GroupState in mapGroupsWithState · 82b598b9
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Since the state is tied a "group" in the "mapGroupsWithState" operations, its better to call the state "GroupState" instead of a key. This would make it more general if you extends this operation to RelationGroupedDataset and python APIs.
      
      ## How was this patch tested?
      Existing unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17385 from tdas/SPARK-20057.
      82b598b9
    • hyukjinkwon's avatar
      [SPARK-20018][SQL] Pivot with timestamp and count should not print internal representation · 80fd0703
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Currently, when we perform count with timestamp types, it prints the internal representation as the column name as below:
      
      ```scala
      Seq(new java.sql.Timestamp(1)).toDF("a").groupBy("a").pivot("a").count().show()
      ```
      
      ```
      +--------------------+----+
      |                   a|1000|
      +--------------------+----+
      |1969-12-31 16:00:...|   1|
      +--------------------+----+
      ```
      
      This PR proposes to use external Scala value instead of the internal representation in the column names as below:
      
      ```
      +--------------------+-----------------------+
      |                   a|1969-12-31 16:00:00.001|
      +--------------------+-----------------------+
      |1969-12-31 16:00:...|                      1|
      +--------------------+-----------------------+
      ```
      
      ## How was this patch tested?
      
      Unit test in `DataFramePivotSuite` and manual tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17348 from HyukjinKwon/SPARK-20018.
      80fd0703
    • hyukjinkwon's avatar
      [SPARK-19949][SQL][FOLLOW-UP] Clean up parse modes and update related comments · 46581838
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to make `mode` options in both CSV and JSON to use `cass object` and fix some related comments related previous fix.
      
      Also, this PR modifies some tests related parse modes.
      
      ## How was this patch tested?
      
      Modified unit tests in both `CSVSuite.scala` and `JsonSuite.scala`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17377 from HyukjinKwon/SPARK-19949.
      46581838
    • Prashant Sharma's avatar
      [SPARK-20027][DOCS] Compilation fix in java docs. · 0caade63
      Prashant Sharma authored
      ## What changes were proposed in this pull request?
      
      During build/sbt publish-local, build breaks due to javadocs errors. This patch fixes those errors.
      
      ## How was this patch tested?
      
      Tested by running the sbt build.
      
      Author: Prashant Sharma <prashsh1@in.ibm.com>
      
      Closes #17358 from ScrapCodes/docs-fix.
      0caade63
    • uncleGen's avatar
      [SPARK-20021][PYSPARK] Miss backslash in python code · facfd608
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      Add backslash for line continuation in python code.
      
      ## How was this patch tested?
      
      Jenkins.
      
      Author: uncleGen <hustyugm@gmail.com>
      Author: dylon <hustyugm@gmail.com>
      
      Closes #17352 from uncleGen/python-example-doc.
      facfd608
    • Xiao Li's avatar
      [SPARK-20023][SQL] Output table comment for DESC FORMATTED · 7343a094
      Xiao Li authored
      ### What changes were proposed in this pull request?
      Currently, `DESC FORMATTED` did not output the table comment, unlike what `DESC EXTENDED` does. This PR is to fix it.
      
      Also correct the following displayed names in `DESC FORMATTED`, for being consistent with `DESC EXTENDED`
      - `"Create Time:"` -> `"Created:"`
      - `"Last Access Time:"` -> `"Last Access:"`
      
      ### How was this patch tested?
      Added test cases in `describe.sql`
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17381 from gatorsmile/descFormattedTableComment.
      7343a094
  4. Mar 21, 2017
    • Yanbo Liang's avatar
      [SPARK-19925][SPARKR] Fix SparkR spark.getSparkFiles fails when it was called on executors. · 478fbc86
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      SparkR ```spark.getSparkFiles``` fails when it was called on executors, see details at [SPARK-19925](https://issues.apache.org/jira/browse/SPARK-19925).
      
      ## How was this patch tested?
      Add unit tests, and verify this fix at standalone and yarn cluster.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #17274 from yanboliang/spark-19925.
      478fbc86
    • Tathagata Das's avatar
      [SPARK-20030][SS] Event-time-based timeout for MapGroupsWithState · c1e87e38
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      Adding event time based timeout. The user sets the timeout timestamp directly using `KeyedState.setTimeoutTimestamp`. The keys times out when the watermark crosses the timeout timestamp.
      
      ## How was this patch tested?
      Unit tests
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #17361 from tdas/SPARK-20030.
      c1e87e38
    • Kunal Khamar's avatar
      [SPARK-20051][SS] Fix StreamSuite flaky test - recover from v2.1 checkpoint · 2d73fcce
      Kunal Khamar authored
      ## What changes were proposed in this pull request?
      
      There is a race condition between calling stop on a streaming query and deleting directories in `withTempDir` that causes test to fail, fixing to do lazy deletion using delete on shutdown JVM hook.
      
      ## How was this patch tested?
      
      - Unit test
        - repeated 300 runs with no failure
      
      Author: Kunal Khamar <kkhamar@outlook.com>
      
      Closes #17382 from kunalkhamar/partition-bugfix.
      2d73fcce
    • hyukjinkwon's avatar
      [SPARK-19919][SQL] Defer throwing the exception for empty paths in CSV datasource into `DataSource` · 9281a3d5
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to defer throwing the exception within `DataSource`.
      
      Currently, if other datasources fail to infer the schema, it returns `None` and then this is being validated in `DataSource` as below:
      
      ```
      scala> spark.read.json("emptydir")
      org.apache.spark.sql.AnalysisException: Unable to infer schema for JSON. It must be specified manually.;
      ```
      
      ```
      scala> spark.read.orc("emptydir")
      org.apache.spark.sql.AnalysisException: Unable to infer schema for ORC. It must be specified manually.;
      ```
      
      ```
      scala> spark.read.parquet("emptydir")
      org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
      ```
      
      However, CSV it checks it within the datasource implementation and throws another exception message as below:
      
      ```
      scala> spark.read.csv("emptydir")
      java.lang.IllegalArgumentException: requirement failed: Cannot infer schema from an empty set of files
      ```
      
      We could remove this duplicated check and validate this in one place in the same way with the same message.
      
      ## How was this patch tested?
      
      Unit test in `CSVSuite` and manual test.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #17256 from HyukjinKwon/SPARK-19919.
      9281a3d5
    • Will Manning's avatar
      clarify array_contains function description · a04dcde8
      Will Manning authored
      ## What changes were proposed in this pull request?
      
      The description in the comment for array_contains is vague/incomplete (i.e., doesn't mention that it returns `null` if the array is `null`); this PR fixes that.
      
      ## How was this patch tested?
      
      No testing, since it merely changes a comment.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Will Manning <lwwmanning@gmail.com>
      
      Closes #17380 from lwwmanning/patch-1.
      a04dcde8
    • Felix Cheung's avatar
      [SPARK-19237][SPARKR][CORE] On Windows spark-submit should handle when java is not installed · a8877bdb
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      When SparkR is installed as a R package there might not be any java runtime.
      If it is not there SparkR's `sparkR.session()` will block waiting for the connection timeout, hanging the R IDE/shell, without any notification or message.
      
      ## How was this patch tested?
      
      manually
      
      - [x] need to test on Windows
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16596 from felixcheung/rcheckjava.
      a8877bdb
    • zhaorongsheng's avatar
      [SPARK-20017][SQL] change the nullability of function 'StringToMap' from 'false' to 'true' · 7dbc162f
      zhaorongsheng authored
      ## What changes were proposed in this pull request?
      
      Change the nullability of function `StringToMap` from `false` to `true`.
      
      Author: zhaorongsheng <334362872@qq.com>
      
      Closes #17350 from zhaorongsheng/bug-fix_strToMap_NPE.
      7dbc162f
    • Joseph K. Bradley's avatar
      [SPARK-20039][ML] rename ChiSquare to ChiSquareTest · ae4b91d1
      Joseph K. Bradley authored
      ## What changes were proposed in this pull request?
      
      I realized that since ChiSquare is in the package stat, it's pretty unclear if it's the hypothesis test, distribution, or what. This PR renames it to ChiSquareTest to clarify this.
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Joseph K. Bradley <joseph@databricks.com>
      
      Closes #17368 from jkbradley/SPARK-20039.
      ae4b91d1
    • Xin Wu's avatar
      [SPARK-19261][SQL] Alter add columns for Hive serde and some datasource tables · 4c0ff5f5
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Support` ALTER TABLE ADD COLUMNS (...) `syntax for Hive serde and some datasource tables.
      In this PR, we consider a few aspects:
      
      1. View is not supported for `ALTER ADD COLUMNS`
      
      2. Since tables created in SparkSQL with Hive DDL syntax will populate table properties with schema information, we need make sure the consistency of the schema before and after ALTER operation in order for future use.
      
      3. For embedded-schema type of format, such as `parquet`, we need to make sure that the predicate on the newly-added columns can be evaluated properly, or pushed down properly. In case of the data file does not have the columns for the newly-added columns, such predicates should return as if the column values are NULLs.
      
      4. For datasource table, this feature does not support the following:
      4.1 TEXT format, since there is only one default column `value` is inferred for text format data.
      4.2 ORC format, since SparkSQL native ORC reader does not support the difference between user-specified-schema and inferred schema from ORC files.
      4.3 Third party datasource types that implements RelationProvider, including the built-in JDBC format, since different implementations by the vendors may have different ways to dealing with schema.
      4.4 Other datasource types, such as `parquet`, `json`, `csv`, `hive` are supported.
      
      5. Column names being added can not be duplicate of any existing data column or partition column names. Case sensitivity is taken into consideration according to the sql configuration.
      
      6. This feature also supports In-Memory catalog, while Hive support is turned off.
      ## How was this patch tested?
      Add new test cases
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #16626 from xwu0226/alter_add_columns.
      4c0ff5f5
    • Zheng RuiFeng's avatar
      [SPARK-20041][DOC] Update docs for NaN handling in approxQuantile · 63f077fb
      Zheng RuiFeng authored
      ## What changes were proposed in this pull request?
      Update docs for NaN handling in approxQuantile.
      
      ## How was this patch tested?
      existing tests.
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #17369 from zhengruifeng/doc_quantiles_nan.
      63f077fb
    • wangzhenhua's avatar
      [SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method... · 14865d7f
      wangzhenhua authored
      [SPARK-17080][SQL][FOLLOWUP] Improve documentation, change buildJoin method structure and add a debug log
      
      ## What changes were proposed in this pull request?
      
      1. Improve documentation for class `Cost` and `JoinReorderDP` and method `buildJoin()`.
      2. Change code structure of `buildJoin()` to make the logic clearer.
      3. Add a debug-level log to record information for join reordering, including time cost, the number of items and the number of plans in memo.
      
      ## How was this patch tested?
      
      Not related.
      
      Author: wangzhenhua <wangzhenhua@huawei.com>
      
      Closes #17353 from wzhfy/reorderFollow.
      14865d7f
    • jianran.tfh's avatar
      [SPARK-19998][BLOCK MANAGER] Change the exception log to add RDD id of the related the block · 650d03cf
      jianran.tfh authored
      ## What changes were proposed in this pull request?
      
      "java.lang.Exception: Could not compute split, block $blockId not found" doesn't have the rdd id info, the "BlockManager: Removing RDD $id" has only the RDD id, so it couldn't find that the Exception's reason is the Removing; so it's better block not found Exception add RDD id info
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: jianran.tfh <jianran.tfh@taobao.com>
      Author: jianran <tanfanhua1984@163.com>
      
      Closes #17334 from jianran/SPARK-19998.
      650d03cf
    • christopher snow's avatar
      [SPARK-20011][ML][DOCS] Clarify documentation for ALS 'rank' parameter · 7620aed8
      christopher snow authored
      ## What changes were proposed in this pull request?
      
      API documentation and collaborative filtering documentation page changes to clarify inconsistent description of ALS rank parameter.
      
       - [DOCS] was previously: "rank is the number of latent factors in the model."
       - [API] was previously:  "rank - number of features to use"
      
      This change describes rank in both places consistently as:
      
       - "Number of features to use (also referred to as the number of latent factors)"
      
      Author: Chris Snow <chris.snowuk.ibm.com>
      
      Author: christopher snow <chsnow123@gmail.com>
      
      Closes #17345 from snowch/SPARK-20011.
      7620aed8
    • Xiao Li's avatar
      [SPARK-20024][SQL][TEST-MAVEN] SessionCatalog reset need to set the current... · d2dcd679
      Xiao Li authored
      [SPARK-20024][SQL][TEST-MAVEN] SessionCatalog reset need to set the current database of ExternalCatalog
      
      ### What changes were proposed in this pull request?
      SessionCatalog API setCurrentDatabase does not set the current database of the underlying ExternalCatalog. Thus, weird errors could come in the test suites after we call reset. We need to fix it.
      
      So far, have not found the direct impact in the other code paths because we expect all the SessionCatalog APIs should always use the current database value we managed, unless some of code paths skip it. Thus, we fix it in the test-only function reset().
      
      ### How was this patch tested?
      Multiple test case failures are observed in mvn and add a test case in SessionCatalogSuite.
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #17354 from gatorsmile/useDB.
      d2dcd679
  5. Mar 20, 2017
    • Wenchen Fan's avatar
      [SPARK-19949][SQL] unify bad record handling in CSV and JSON · 68d65fae
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Currently JSON and CSV have exactly the same logic about handling bad records, this PR tries to abstract it and put it in a upper level to reduce code duplication.
      
      The overall idea is, we make the JSON and CSV parser to throw a BadRecordException, then the upper level, FailureSafeParser, handles bad records according to the parse mode.
      
      Behavior changes:
      1. with PERMISSIVE mode, if the number of tokens doesn't match the schema, previously CSV parser will treat it as a legal record and parse as many tokens as possible. After this PR, we treat it as an illegal record, and put the raw record string in a special column, but we still parse as many tokens as possible.
      2. all logging is removed as they are not very useful in practice.
      
      ## How was this patch tested?
      
      existing tests
      
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Wenchen Fan <cloud0fan@gmail.com>
      
      Closes #17315 from cloud-fan/bad-record2.
      68d65fae
    • Dongjoon Hyun's avatar
      [SPARK-19912][SQL] String literals should be escaped for Hive metastore partition pruning · 21e366ae
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since current `HiveShim`'s `convertFilters` does not escape the string literals. There exists the following correctness issues. This PR aims to return the correct result and also shows the more clear exception message.
      
      **BEFORE**
      
      ```scala
      scala> Seq((1, "p1", "q1"), (2, "p1\" and q=\"q1", "q2")).toDF("a", "p", "q").write.partitionBy("p", "q").saveAsTable("t1")
      
      scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
      +---+
      |  a|
      +---+
      +---+
      
      scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
      java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from ...
      ```
      
      **AFTER**
      
      ```scala
      scala> spark.table("t1").filter($"p" === "p1\" and q=\"q1").select($"a").show
      +---+
      |  a|
      +---+
      |  2|
      +---+
      
      scala> spark.table("t1").filter($"p" === "'\"").select($"a").show
      java.lang.UnsupportedOperationException: Partition filter cannot have both `"` and `'` characters
      ```
      
      ## How was this patch tested?
      
      Pass the Jenkins test with new test cases.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #17266 from dongjoon-hyun/SPARK-19912.
      21e366ae
    • Michael Allman's avatar
      [SPARK-17204][CORE] Fix replicated off heap storage · 7fa116f8
      Michael Allman authored
      (Jira: https://issues.apache.org/jira/browse/SPARK-17204)
      
      ## What changes were proposed in this pull request?
      
      There are a couple of bugs in the `BlockManager` with respect to support for replicated off-heap storage. First, the locally-stored off-heap byte buffer is disposed of when it is replicated. It should not be. Second, the replica byte buffers are stored as heap byte buffers instead of direct byte buffers even when the storage level memory mode is off-heap. This PR addresses both of these problems.
      
      ## How was this patch tested?
      
      `BlockManagerReplicationSuite` was enhanced to fill in the coverage gaps. It now fails if either of the bugs in this PR exist.
      
      Author: Michael Allman <michael@videoamp.com>
      
      Closes #16499 from mallman/spark-17204-replicated_off_heap_storage.
      7fa116f8
Loading