Skip to content
Snippets Groups Projects
  1. Nov 23, 2016
    • hyukjinkwon's avatar
      [SPARK-18179][SQL] Throws analysis exception with a proper message for... · fabb5aea
      hyukjinkwon authored
      [SPARK-18179][SQL] Throws analysis exception with a proper message for unsupported argument types in reflect/java_method function
      
      ## What changes were proposed in this pull request?
      
      This PR proposes throwing an `AnalysisException` with a proper message rather than `NoSuchElementException` with the message ` key not found: TimestampType` when unsupported types are given to `reflect` and `java_method` functions.
      
      ```scala
      spark.range(1).selectExpr("reflect('java.lang.String', 'valueOf', cast('1990-01-01' as timestamp))")
      ```
      
      produces
      
      **Before**
      
      ```
      java.util.NoSuchElementException: key not found: TimestampType
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:59)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:59)
        at org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection$$anonfun$findMethod$1$$anonfun$apply$1.apply(CallMethodViaReflection.scala:159)
      ...
      ```
      
      **After**
      
      ```
      cannot resolve 'reflect('java.lang.String', 'valueOf', CAST('1990-01-01' AS TIMESTAMP))' due to data type mismatch: arguments from the third require boolean, byte, short, integer, long, float, double or string expressions; line 1 pos 0;
      'Project [unresolvedalias(reflect(java.lang.String, valueOf, cast(1990-01-01 as timestamp)), Some(<function1>))]
      +- Range (0, 1, step=1, splits=Some(2))
      ...
      ```
      
      Added message is,
      
      ```
      arguments from the third require boolean, byte, short, integer, long, float, double or string expressions
      ```
      
      ## How was this patch tested?
      
      Tests added in `CallMethodViaReflection`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15694 from HyukjinKwon/SPARK-18179.
      
      (cherry picked from commit 2559fb4b)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      fabb5aea
  2. Nov 22, 2016
    • Yanbo Liang's avatar
      [SPARK-18501][ML][SPARKR] Fix spark.glm errors when fitting on collinear data · fc5fee83
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      * Fix SparkR ```spark.glm``` errors when fitting on collinear data, since ```standard error of coefficients, t value and p value``` are not available in this condition.
      * Scala/Python GLM summary should throw exception if users get ```standard error of coefficients, t value and p value``` but the underlying WLS was solved by local "l-bfgs".
      
      ## How was this patch tested?
      Add unit tests.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15930 from yanboliang/spark-18501.
      
      (cherry picked from commit 982b82e3)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      fc5fee83
    • Shixiong Zhu's avatar
      [SPARK-18530][SS][KAFKA] Change Kafka timestamp column type to TimestampType · 3be2d1e0
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      Changed Kafka timestamp column type to TimestampType.
      
      ## How was this patch tested?
      
      `test("Kafka column types")`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15969 from zsxwing/SPARK-18530.
      
      (cherry picked from commit d0212eb0)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      3be2d1e0
    • Dilip Biswal's avatar
      [SPARK-18533] Raise correct error upon specification of schema for datasource... · 4b96ffb1
      Dilip Biswal authored
      [SPARK-18533] Raise correct error upon specification of schema for datasource tables created using CTAS
      
      ## What changes were proposed in this pull request?
      Fixes the inconsistency of error raised between data source and hive serde
      tables when schema is specified in CTAS scenario. In the process the grammar for
      create table (datasource) is simplified.
      
      **before:**
      ``` SQL
      spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1;
      Error in query:
      mismatched input 'as' expecting {<EOF>, '.', 'OPTIONS', 'CLUSTERED', 'PARTITIONED'}(line 1, pos 64)
      
      == SQL ==
      create table t2 (c1 int, c2 int) using parquet as select * from t1
      ----------------------------------------------------------------^^^
      ```
      
      **After:**
      ```SQL
      spark-sql> create table t2 (c1 int, c2 int) using parquet as select * from t1
               > ;
      Error in query:
      Operation not allowed: Schema may not be specified in a Create Table As Select (CTAS) statement(line 1, pos 0)
      
      == SQL ==
      create table t2 (c1 int, c2 int) using parquet as select * from t1
      ^^^
      ```
      ## How was this patch tested?
      Added a new test in CreateTableAsSelectSuite
      
      Author: Dilip Biswal <dbiswal@us.ibm.com>
      
      Closes #15968 from dilipbiswal/ctas.
      
      (cherry picked from commit 39a1d306)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      4b96ffb1
    • gatorsmile's avatar
      [SPARK-16803][SQL] SaveAsTable does not work when target table is a Hive serde table · 64b9de9c
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      
      In Spark 2.0, `SaveAsTable` does not work when the target table is a Hive serde table, but Spark 1.6 works.
      
      **Spark 1.6**
      
      ``` Scala
      scala> sql("create table sample.sample stored as SEQUENCEFILE as select 1 as key, 'abc' as value")
      res2: org.apache.spark.sql.DataFrame = []
      
      scala> val df = sql("select key, value as value from sample.sample")
      df: org.apache.spark.sql.DataFrame = [key: int, value: string]
      
      scala> df.write.mode("append").saveAsTable("sample.sample")
      
      scala> sql("select * from sample.sample").show()
      +---+-----+
      |key|value|
      +---+-----+
      |  1|  abc|
      |  1|  abc|
      +---+-----+
      ```
      
      **Spark 2.0**
      
      ``` Scala
      scala> df.write.mode("append").saveAsTable("sample.sample")
      org.apache.spark.sql.AnalysisException: Saving data in MetastoreRelation sample, sample
       is not supported.;
      ```
      
      So far, we do not plan to support it in Spark 2.1 due to the risk. Spark 1.6 works because it internally uses insertInto. But, if we change it back it will break the semantic of saveAsTable (this method uses by-name resolution instead of using by-position resolution used by insertInto). More extra changes are needed to support `hive` as a `format` in DataFrameWriter.
      
      Instead, users should use insertInto API. This PR corrects the error messages. Users can understand how to bypass it before we support it in a separate PR.
      ### How was this patch tested?
      
      Test cases are added
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #15926 from gatorsmile/saveAsTableFix5.
      
      (cherry picked from commit 9c42d4a7)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      64b9de9c
    • Shixiong Zhu's avatar
      [SPARK-18373][SPARK-18529][SS][KAFKA] Make failOnDataLoss=false work with Spark jobs · bd338f60
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      This PR adds `CachedKafkaConsumer.getAndIgnoreLostData` to handle corner cases of `failOnDataLoss=false`.
      
      It also resolves [SPARK-18529](https://issues.apache.org/jira/browse/SPARK-18529
      
      ) after refactoring codes: Timeout will throw a TimeoutException.
      
      ## How was this patch tested?
      
      Because I cannot find any way to manually control the Kafka server to clean up logs, it's impossible to write unit tests for each corner case. Therefore, I just created `test("stress test for failOnDataLoss=false")` which should cover most of corner cases.
      
      I also modified some existing tests to test for both `failOnDataLoss=false` and `failOnDataLoss=true` to make sure it doesn't break existing logic.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15820 from zsxwing/failOnDataLoss.
      
      (cherry picked from commit 2fd101b2)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      bd338f60
    • Burak Yavuz's avatar
      [SPARK-18465] Add 'IF EXISTS' clause to 'UNCACHE' to not throw exceptions when table doesn't exist · fb2ea54a
      Burak Yavuz authored
      
      ## What changes were proposed in this pull request?
      
      While this behavior is debatable, consider the following use case:
      ```sql
      UNCACHE TABLE foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      The command above fails the first time you run it. But I want to run the command above over and over again, and I don't want to change my code just for the first run of it.
      The issue is that subsequent `CACHE TABLE` commands do not overwrite the existing table.
      
      Now we can do:
      ```sql
      UNCACHE TABLE IF EXISTS foo;
      CACHE TABLE foo AS
      SELECT * FROM bar
      ```
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #15896 from brkyvz/uncache.
      
      (cherry picked from commit bdc8153e)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      fb2ea54a
    • Wenchen Fan's avatar
      [SPARK-18507][SQL] HiveExternalCatalog.listPartitions should only call getTable once · fa360134
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      HiveExternalCatalog.listPartitions should only call `getTable` once, instead of calling it for every partitions.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15978 from cloud-fan/perf.
      
      (cherry picked from commit 702cd403)
      Signed-off-by: default avatarAndrew Or <andrewor14@gmail.com>
      fa360134
    • Nattavut Sutyanyong's avatar
      [SPARK-18504][SQL] Scalar subquery with extra group by columns returning incorrect result · 0e624e99
      Nattavut Sutyanyong authored
      
      ## What changes were proposed in this pull request?
      
      This PR blocks an incorrect result scenario in scalar subquery where there are GROUP BY column(s)
      that are not part of the correlated predicate(s).
      
      Example:
      // Incorrect result
      Seq(1).toDF("c1").createOrReplaceTempView("t1")
      Seq((1,1),(1,2)).toDF("c1","c2").createOrReplaceTempView("t2")
      sql("select (select sum(-1) from t2 where t1.c1=t2.c1 group by t2.c2) from t1").show
      
      // How can selecting a scalar subquery from a 1-row table return 2 rows?
      
      ## How was this patch tested?
      sql/test, catalyst/test
      new test case covering the reported problem is added to SubquerySuite.scala
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #15936 from nsyca/scalarSubqueryIncorrect-1.
      
      (cherry picked from commit 45ea46b7)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      0e624e99
    • Wenchen Fan's avatar
      [SPARK-18519][SQL] map type can not be used in EqualTo · 0e60e4b8
      Wenchen Fan authored
      
      ## What changes were proposed in this pull request?
      
      Technically map type is not orderable, but can be used in equality comparison. However, due to the limitation of the current implementation, map type can't be used in equality comparison so that it can't be join key or grouping key.
      
      This PR makes this limitation explicit, to avoid wrong result.
      
      ## How was this patch tested?
      
      updated tests.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15956 from cloud-fan/map-type.
      
      (cherry picked from commit bb152cdf)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      0e60e4b8
    • hyukjinkwon's avatar
      [SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across... · 36cd10d1
      hyukjinkwon authored
      [SPARK-18447][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across Python API documentation
      
      ## What changes were proposed in this pull request?
      
      It seems in Python, there are
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `.. note::`
      
      This PR proposes to fix those to `.. note::` to be consistent.
      
      **Before**
      
      <img width="567" alt="2016-11-21 1 18 49" src="https://cloud.githubusercontent.com/assets/6477701/20464305/85144c86-af88-11e6-8ee9-90f584dd856c.png">
      
      <img width="617" alt="2016-11-21 12 42 43" src="https://cloud.githubusercontent.com/assets/6477701/20464263/27be5022-af88-11e6-8577-4bbca7cdf36c.png">
      
      **After**
      
      <img width="554" alt="2016-11-21 1 18 42" src="https://cloud.githubusercontent.com/assets/6477701/20464306/8fe48932-af88-11e6-83e1-fc3cbf74407d.png">
      
      <img width="628" alt="2016-11-21 12 42 51" src="https://cloud.githubusercontent.com/assets/6477701/20464264/2d3e156e-af88-11e6-93f3-cab8d8d02983.png
      
      ">
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "Note: " .
      grep -r "NOTE: " .
      grep -r "Note that " .
      ```
      
      And then fixed one by one comparing with API documentation.
      
      After that, manually tested via `make html` under `./python/docs`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15947 from HyukjinKwon/SPARK-18447.
      
      (cherry picked from commit 933a6548)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      36cd10d1
    • hyukjinkwon's avatar
      [SPARK-18514][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that` across R API documentation · 63aa01ff
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      It seems in R, there are
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      
      This PR proposes to fix those to `Note:` to be consistent.
      
      **Before**
      
      ![2016-11-21 11 30 07](https://cloud.githubusercontent.com/assets/6477701/20468848/2f27b0fa-afde-11e6-89e3-993701269dbe.png)
      
      **After**
      
      ![2016-11-21 11 29 44](https://cloud.githubusercontent.com/assets/6477701/20468851/39469664-afde-11e6-9929-ad80be7fc405.png
      
      )
      
      ## How was this patch tested?
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " .
      grep -r "Note that " .
      ```
      
      And then fixed one by one comparing with API documentation.
      
      After that, manually tested via `sh create-docs.sh` under `./R`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15952 from HyukjinKwon/SPARK-18514.
      
      (cherry picked from commit 4922f9cd)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      63aa01ff
    • Yanbo Liang's avatar
      [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package. · c7021407
      Yanbo Liang authored
      
      ## What changes were proposed in this pull request?
      When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
      ```
      ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
      ```
      The following is output:
      ```
      Attaching package: ‘SparkR’
      
      The following objects are masked from ‘package:stats’:
      
          cov, filter, lag, na.omit, predict, sd, var, window
      
      The following objects are masked from ‘package:base’:
      
          as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
          rank, rbind, sample, startsWith, subset, summary, transform, union
      
      Spark not found in SPARK_HOME:
      Spark not found in the cache directory. Installation will start.
      MirrorUrl not provided.
      Looking for preferred site from apache website...
      ......
      ```
      There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.
      
      ## How was this patch tested?
      Offline test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #15888 from yanboliang/spark-18444.
      
      (cherry picked from commit acb97157)
      Signed-off-by: default avatarYanbo Liang <ybliang8@gmail.com>
      c7021407
  3. Nov 21, 2016
  4. Nov 20, 2016
    • Takuya UESHIN's avatar
      [SPARK-18467][SQL] Extracts method for preparing arguments from StaticInvoke,... · fb4e6359
      Takuya UESHIN authored
      [SPARK-18467][SQL] Extracts method for preparing arguments from StaticInvoke, Invoke and NewInstance and modify to short circuit if arguments have null when `needNullCheck == true`.
      
      ## What changes were proposed in this pull request?
      
      This pr extracts method for preparing arguments from `StaticInvoke`, `Invoke` and `NewInstance` and modify to short circuit if arguments have `null` when `propageteNull == true`.
      
      The steps are as follows:
      
      1. Introduce `InvokeLike` to extract common logic from `StaticInvoke`, `Invoke` and `NewInstance` to prepare arguments.
      `StaticInvoke` and `Invoke` had a risk to exceed 64kb JVM limit to prepare arguments but after this patch they can handle them because they share the preparing code of NewInstance, which handles the limit well.
      
      2. Remove unneeded null checking and fix nullability of `NewInstance`.
      Avoid some of nullabilty checking which are not needed because the expression is not nullable.
      
      3. Modify to short circuit if arguments have `null` when `needNullCheck == true`.
      If `needNullCheck == true`, preparing arguments can be skipped if we found one of them is `null`, so modified to short circuit in the case.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Takuya UESHIN <ueshin@happy-camper.st>
      
      Closes #15901 from ueshin/issues/SPARK-18467.
      
      (cherry picked from commit 65854797)
      Signed-off-by: default avatarWenchen Fan <wenchen@databricks.com>
      fb4e6359
    • Reynold Xin's avatar
      [HOTFIX][SQL] Fix DDLSuite failure. · f8662db7
      Reynold Xin authored
      
      (cherry picked from commit b625a36e)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      f8662db7
    • Herman van Hovell's avatar
      [SPARK-17732][SQL] Revert ALTER TABLE DROP PARTITION should support comparators · cffaf503
      Herman van Hovell authored
      This reverts commit 1126c319.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #15948 from hvanhovell/SPARK-17732.
      cffaf503
    • hyukjinkwon's avatar
      [SPARK-3359][BUILD][DOCS] Print examples and disable group and tparam tags in javadoc · bc3e7b3b
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes/fixes two things.
      
      - Remove many errors to generate javadoc with Java8 from unrecognisable tags, `tparam` and `group`.
      
        ```
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:18: error: unknown tag: group
        [error]   /** group setParam */
        [error]       ^
        [error] .../spark/mllib/target/java/org/apache/spark/ml/classification/Classifier.java:8: error: unknown tag: tparam
        [error]  * tparam FeaturesType  Type of input features.  E.g., <code>Vector</code>
        [error]    ^
        ...
        ```
      
        It does not fully resolve the problem but remove many errors. It seems both `group` and `tparam` are unrecognisable in javadoc. It seems we can't print them pretty in javadoc in a way of `example` here because they appear differently (both examples can be found in http://spark.apache.org/docs/2.0.2/api/scala/index.html#org.apache.spark.ml.classification.Classifier).
      
      - Print `example` in javadoc.
        Currently, there are few `example` tag in several places.
      
        ```
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This operation might be used to evaluate a graph
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We might use this operation to change the vertex values
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function might be used to initialize edge
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example We can use this function to compute the in-degree of each
        ./graphx/src/main/scala/org/apache/spark/graphx/Graph.scala:   * example This function is used to update the vertices with new values based on external data.
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphLoader.scala:   * example Loads a file in the following format:
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function is used to update the vertices with new
        ./graphx/src/main/scala/org/apache/spark/graphx/GraphOps.scala:   * example This function can be used to filter the graph based on some property, without
        ./graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala: * example We can use the Pregel abstraction to implement PageRank:
        ./graphx/src/main/scala/org/apache/spark/graphx/VertexRDD.scala: * example Construct a `VertexRDD` from a plain RDD:
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkCommandLine.scala: * example new SparkCommandLine(Nil).settings
        ./repl/scala-2.10/src/main/scala/org/apache/spark/repl/SparkIMain.scala:   * example addImports("org.apache.spark.SparkContext")
        ./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/LiteralGenerator.scala: * example {{{
        ```
      
      **Before**
      
        <img width="505" alt="2016-11-20 2 43 23" src="https://cloud.githubusercontent.com/assets/6477701/20457285/26f07e1c-aecb-11e6-9ae9-d9dee66845f4.png">
      
      **After**
        <img width="499" alt="2016-11-20 1 27 17" src="https://cloud.githubusercontent.com/assets/6477701/20457240/409124e4-aeca-11e6-9a91-0ba514148b52.png
      
      ">
      
      ## How was this patch tested?
      
      Maunally tested by `jekyll build` with Java 7 and 8
      
      ```
      java version "1.7.0_80"
      Java(TM) SE Runtime Environment (build 1.7.0_80-b15)
      Java HotSpot(TM) 64-Bit Server VM (build 24.80-b11, mixed mode)
      ```
      
      ```
      java version "1.8.0_45"
      Java(TM) SE Runtime Environment (build 1.8.0_45-b14)
      Java HotSpot(TM) 64-Bit Server VM (build 25.45-b02, mixed mode)
      ```
      
      Note: this does not make sbt unidoc suceed with Java 8 yet but it reduces the number of errors with Java 8.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15939 from HyukjinKwon/SPARK-3359-javadoc.
      
      (cherry picked from commit c528812c)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      bc3e7b3b
  5. Nov 19, 2016
    • Reynold Xin's avatar
      [SQL] Fix documentation for Concat and ConcatWs · 063da0c8
      Reynold Xin authored
      
      (cherry picked from commit a64f25d8)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      063da0c8
    • Reynold Xin's avatar
      [SPARK-18508][SQL] Fix documentation error for DateDiff · 94a9eed1
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      The previous documentation and example for DateDiff was wrong.
      
      ## How was this patch tested?
      Doc only change.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15937 from rxin/datediff-doc.
      
      (cherry picked from commit bce9a036)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      94a9eed1
    • Kazuaki Ishizaki's avatar
      [SPARK-18458][CORE] Fix signed integer overflow problem at an expression in RadixSort.java · b0b2f108
      Kazuaki Ishizaki authored
      
      ## What changes were proposed in this pull request?
      
      This PR avoids that a result of an expression is negative due to signed integer overflow (e.g. 0x10?????? * 8 < 0). This PR casts each operand to `long` before executing a calculation. Since the result is interpreted as long, the result of the expression is positive.
      
      ## How was this patch tested?
      
      Manually executed query82 of TPC-DS with 100TB
      
      Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
      
      Closes #15907 from kiszk/SPARK-18458.
      
      (cherry picked from commit d93b6552)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      b0b2f108
    • sethah's avatar
      [SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training · 15eb86c2
      sethah authored
      ## What changes were proposed in this pull request?
      
      This is a follow up to some of the discussion [here](https://github.com/apache/spark/pull/15593
      
      ). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.
      
      Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.
      
      ## How was this patch tested?
      
      This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #15893 from sethah/logreg_refactor.
      
      (cherry picked from commit 856e0042)
      Signed-off-by: default avatarDB Tsai <dbtsai@dbtsai.com>
      15eb86c2
    • Sean Owen's avatar
      [SPARK-18448][CORE] Fix @since 2.1.0 on new SparkSession.close() method · 15ad3a31
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Fix since 2.1.0 on new SparkSession.close() method. I goofed in https://github.com/apache/spark/pull/15932
      
       because it was back-ported to 2.1 instead of just master as originally planned.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15938 from srowen/SPARK-18448.2.
      
      (cherry picked from commit ded5fefb)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      15ad3a31
    • Sean Owen's avatar
      [SPARK-18353][CORE] spark.rpc.askTimeout defalut value is not 120s · 30a6fbbb
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Avoid hard-coding spark.rpc.askTimeout to non-default in Client; fix doc about spark.rpc.askTimeout default
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15833 from srowen/SPARK-18353.
      
      (cherry picked from commit 8b1e1088)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      30a6fbbb
    • hyukjinkwon's avatar
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note... · 4b396a65
      hyukjinkwon authored
      [SPARK-18445][BUILD][DOCS] Fix the markdown for `Note:`/`NOTE:`/`Note that`/`'''Note:'''` across Scala/Java API documentation
      
      It seems in Scala/Java,
      
      - `Note:`
      - `NOTE:`
      - `Note that`
      - `'''Note:'''`
      - `note`
      
      This PR proposes to fix those to `note` to be consistent.
      
      **Before**
      
      - Scala
        ![2016-11-17 6 16 39](https://cloud.githubusercontent.com/assets/6477701/20383180/1a7aed8c-acf2-11e6-9611-5eaf6d52c2e0.png)
      
      - Java
        ![2016-11-17 6 14 41](https://cloud.githubusercontent.com/assets/6477701/20383096/c8ffc680-acf1-11e6-914a-33460bf1401d.png)
      
      **After**
      
      - Scala
        ![2016-11-17 6 16 44](https://cloud.githubusercontent.com/assets/6477701/20383167/09940490-acf2-11e6-937a-0d5e1dc2cadf.png)
      
      - Java
        ![2016-11-17 6 13 39](https://cloud.githubusercontent.com/assets/6477701/20383132/e7c2a57e-acf1-11e6-9c47-b849674d4d88.png
      
      )
      
      The notes were found via
      
      ```bash
      grep -r "NOTE: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// NOTE: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \ # note that this is a regular expression. So actual matches were mostly `org/apache/spark/api/java/functions ...`
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note that " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note that " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "Note: " . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// Note: " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      ```bash
      grep -r "'''Note:'''" . | \ # Note:|NOTE:|Note that|'''Note:'''
      grep -v "// '''Note:''' " | \  # starting with // does not appear in API documentation.
      grep -E '.scala|.java' | \ # java/scala files
      grep -v Suite | \ # exclude tests
      grep -v Test | \ # exclude tests
      grep -e 'org.apache.spark.api.java' \ # packages appear in API documenation
      -e 'org.apache.spark.api.java.function' \
      -e 'org.apache.spark.api.r' \
      ...
      ```
      
      And then fixed one by one comparing with API documentation/access modifiers.
      
      After that, manually tested via `jekyll build`.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15889 from HyukjinKwon/SPARK-18437.
      
      (cherry picked from commit d5b1d5fc)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      4b396a65
    • Sean Owen's avatar
      [SPARK-18448][CORE] SparkSession should implement java.lang.AutoCloseable like JavaSparkContext · 693401be
      Sean Owen authored
      
      ## What changes were proposed in this pull request?
      
      Just adds `close()` + `Closeable` as a synonym for `stop()`. This makes it usable in Java in try-with-resources, as suggested by ash211  (`Closeable` extends `AutoCloseable` BTW)
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #15932 from srowen/SPARK-18448.
      
      (cherry picked from commit db9fb9ba)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      693401be
  6. Nov 18, 2016
    • Shixiong Zhu's avatar
      [SPARK-18497][SS] Make ForeachSink support watermark · b4bad04c
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      The issue in ForeachSink is the new created DataSet still uses the old QueryExecution. When `foreachPartition` is called, `QueryExecution.toString` will be called and then fail because it doesn't know how to plan EventTimeWatermark.
      
      This PR just replaces the QueryExecution with IncrementalExecution to fix the issue.
      
      ## How was this patch tested?
      
      `test("foreach with watermark")`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15934 from zsxwing/SPARK-18497.
      
      (cherry picked from commit 2a40de40)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      b4bad04c
    • Reynold Xin's avatar
      [SPARK-18505][SQL] Simplify AnalyzeColumnCommand · 4b1df0e8
      Reynold Xin authored
      
      ## What changes were proposed in this pull request?
      I'm spending more time at the design & code level for cost-based optimizer now, and have found a number of issues related to maintainability and compatibility that I will like to address.
      
      This is a small pull request to clean up AnalyzeColumnCommand:
      
      1. Removed warning on duplicated columns. Warnings in log messages are useless since most users that run SQL don't see them.
      2. Removed the nested updateStats function, by just inlining the function.
      3. Renamed a few functions to better reflect what they do.
      4. Removed the factory apply method for ColumnStatStruct. It is a bad pattern to use a apply method that returns an instantiation of a class that is not of the same type (ColumnStatStruct.apply used to return CreateNamedStruct).
      5. Renamed ColumnStatStruct to just AnalyzeColumnCommand.
      6. Added more documentation explaining some of the non-obvious return types and code blocks.
      
      In follow-up pull requests, I'd like to address the following:
      
      1. Get rid of the Map[String, ColumnStat] map, since internally we should be using Attribute to reference columns, rather than strings.
      2. Decouple the fields exposed by ColumnStat and internals of Spark SQL's execution path. Currently the two are coupled because ColumnStat takes in an InternalRow.
      3. Correctness: Remove code path that stores statistics in the catalog using the base64 encoding of the UnsafeRow format, which is not stable across Spark versions.
      4. Clearly document the data representation stored in the catalog for statistics.
      
      ## How was this patch tested?
      Affected test cases have been updated.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #15933 from rxin/SPARK-18505.
      
      (cherry picked from commit 6f7ff750)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      4b1df0e8
    • Shixiong Zhu's avatar
      [SPARK-18477][SS] Enable interrupts for HDFS in HDFSMetadataLog · 136f687c
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      HDFS `write` may just hang until timeout if some network error happens. It's better to enable interrupts to allow stopping the query fast on HDFS.
      
      This PR just changes the logic to only disable interrupts for local file system, as HADOOP-10622 only happens for local file system.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #15911 from zsxwing/interrupt-on-dfs.
      
      (cherry picked from commit e5f5c29e)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      136f687c
    • hyukjinkwon's avatar
      [SPARK-18422][CORE] Fix wholeTextFiles test to pass on Windows in JavaAPISuite · 6717981e
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the test `wholeTextFiles` in `JavaAPISuite.java`. This is failed due to the different path format on Windows.
      
      For example, the path in `container` was
      
      ```
      C:\projects\spark\target\tmp\1478967560189-0/part-00000
      ```
      
      whereas `new URI(res._1()).getPath()` was as below:
      
      ```
      /C:/projects/spark/target/tmp/1478967560189-0/part-00000
      ```
      
      ## How was this patch tested?
      
      Tests in `JavaAPISuite.java`.
      
      Tested via AppVeyor.
      
      **Before**
      Build: https://ci.appveyor.com/project/spark-test/spark/build/63-JavaAPISuite-1
      Diff: https://github.com/apache/spark/compare/master...spark-test:JavaAPISuite-1
      
      ```
      [info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started
      [error] Test org.apache.spark.JavaAPISuite.wholeTextFiles failed: java.lang.AssertionError: expected:<spark is easy to use.
      [error] > but was:<null>, took 0.578 sec
      [error]     at org.apache.spark.JavaAPISuite.wholeTextFiles(JavaAPISuite.java:1089)
      ...
      ```
      
      **After**
      Build started: [CORE] `org.apache.spark.JavaAPISuite` [![PR-15866](https://ci.appveyor.com/api/projects/status/github/spark-test/spark?branch=198DDA52-F201-4D2B-BE2F-244E0C1725B2&svg=true)](https://ci.appveyor.com/project/spark-test/spark/branch/198DDA52-F201-4D2B-BE2F-244E0C1725B2)
      Diff: https://github.com/apache/spark/compare/master...spark-test:198DDA52-F201-4D2B-BE2F-244E0C1725B2
      
      
      
      ```
      [info] Test org.apache.spark.JavaAPISuite.wholeTextFiles started
      ...
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #15866 from HyukjinKwon/SPARK-18422.
      
      (cherry picked from commit 40d59ff5)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      6717981e
    • Andrew Ray's avatar
      [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all... · ec622eb7
      Andrew Ray authored
      [SPARK-18457][SQL] ORC and other columnar formats using HiveShim read all columns when doing a simple count
      
      ## What changes were proposed in this pull request?
      
      When reading zero columns (e.g., count(*)) from ORC or any other format that uses HiveShim, actually set the read column list to empty for Hive to use.
      
      ## How was this patch tested?
      
      Query correctness is handled by existing unit tests. I'm happy to add more if anyone can point out some case that is not covered.
      
      Reduction in data read can be verified in the UI when built with a recent version of Hadoop say:
      ```
      build/mvn -Pyarn -Phadoop-2.7 -Dhadoop.version=2.7.0 -Phive -DskipTests clean package
      ```
      However the default Hadoop 2.2 that is used for unit tests does not report actual bytes read and instead just full file sizes (see FileScanRDD.scala line 80). Therefore I don't think there is a good way to add a unit test for this.
      
      I tested with the following setup using above build options
      ```
      case class OrcData(intField: Long, stringField: String)
      spark.range(1,1000000).map(i => OrcData(i, s"part-$i")).toDF().write.format("orc").save("orc_test")
      
      sql(
            s"""CREATE EXTERNAL TABLE orc_test(
               |  intField LONG,
               |  stringField STRING
               |)
               |STORED AS ORC
               |LOCATION '${System.getProperty("user.dir") + "/orc_test"}'
             """.stripMargin)
      ```
      
      ## Results
      
      query | Spark 2.0.2 | this PR
      ---|---|---
      `sql("select count(*) from orc_test").collect`|4.4 MB|199.4 KB
      `sql("select intField from orc_test").collect`|743.4 KB|743.4 KB
      `sql("select * from orc_test").collect`|4.4 MB|4.4 MB
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #15898 from aray/sql-orc-no-col.
      
      (cherry picked from commit 795e9fc9)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      ec622eb7
    • Tyson Condie's avatar
      [SPARK-18187][SQL] CompactibleFileStreamLog should not use "compactInterval"... · 5912c19e
      Tyson Condie authored
      [SPARK-18187][SQL] CompactibleFileStreamLog should not use "compactInterval" direcly with user setting.
      
      ## What changes were proposed in this pull request?
      CompactibleFileStreamLog relys on "compactInterval" to detect a compaction batch. If the "compactInterval" is reset by user, CompactibleFileStreamLog will return wrong answer, resulting data loss. This PR procides a way to check the validity of 'compactInterval', and calculate an appropriate value.
      
      ## How was this patch tested?
      When restart a stream, we change the 'spark.sql.streaming.fileSource.log.compactInterval' different with the former one.
      
      The primary solution to this issue was given by uncleGen
      Added extensions include an additional metadata field in OffsetSeq and CompactibleFileStreamLog APIs. zsxwing
      
      Author: Tyson Condie <tcondie@gmail.com>
      Author: genmao.ygm <genmao.ygm@genmaoygmdeMacBook-Air.local>
      
      Closes #15852 from tcondie/spark-18187.
      
      (cherry picked from commit 51baca22)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      5912c19e
  7. Nov 17, 2016
    • Josh Rosen's avatar
      [SPARK-18462] Fix ClassCastException in SparkListenerDriverAccumUpdates event · e8b1955e
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch fixes a `ClassCastException: java.lang.Integer cannot be cast to java.lang.Long` error which could occur in the HistoryServer while trying to process a deserialized `SparkListenerDriverAccumUpdates` event.
      
      The problem stems from how `jackson-module-scala` handles primitive type parameters (see https://github.com/FasterXML/jackson-module-scala/wiki/FAQ#deserializing-optionint-and-other-primitive-challenges
      
       for more details). This was causing a problem where our code expected a field to be deserialized as a `(Long, Long)` tuple but we got an `(Int, Int)` tuple instead.
      
      This patch hacks around this issue by registering a custom `Converter` with Jackson in order to deserialize the tuples as `(Object, Object)` and perform the appropriate casting.
      
      ## How was this patch tested?
      
      New regression tests in `SQLListenerSuite`.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #15922 from JoshRosen/SPARK-18462.
      
      (cherry picked from commit d9dd979d)
      Signed-off-by: default avatarReynold Xin <rxin@databricks.com>
      e8b1955e
    • Wenchen Fan's avatar
      [SPARK-18360][SQL] default table path of tables in default database should... · fc466be4
      Wenchen Fan authored
      [SPARK-18360][SQL] default table path of tables in default database should depend on the location of default database
      
      ## What changes were proposed in this pull request?
      
      The current semantic of the warehouse config:
      
      1. it's a static config, which means you can't change it once your spark application is launched.
      2. Once a database is created, its location won't change even the warehouse path config is changed.
      3. default database is a special case, although its location is fixed, but the locations of tables created in it are not. If a Spark app starts with warehouse path B(while the location of default database is A), then users create a table `tbl` in default database, its location will be `B/tbl` instead of `A/tbl`. If uses change the warehouse path config to C, and create another table `tbl2`, its location will still be `B/tbl2` instead of `C/tbl2`.
      
      rule 3 doesn't make sense and I think we made it by mistake, not intentionally. Data source tables don't follow rule 3 and treat default database like normal ones.
      
      This PR fixes hive serde tables to make it consistent with data source tables.
      
      ## How was this patch tested?
      
      HiveSparkSubmitSuite
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #15812 from cloud-fan/default-db.
      
      (cherry picked from commit ce13c267)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      fc466be4
Loading