Skip to content
Snippets Groups Projects
  1. Feb 14, 2017
  2. Feb 13, 2017
    • Xin Wu's avatar
      [SPARK-19539][SQL] Block duplicate temp table during creation · 1ab97310
      Xin Wu authored
      ## What changes were proposed in this pull request?
      Current `CREATE TEMPORARY TABLE ... ` is deprecated and recommend users to use `CREATE TEMPORARY VIEW ...` And it does not support `IF NOT EXISTS `clause. However, if there is an existing temporary view defined, it is possible to unintentionally replace this existing view by issuing `CREATE TEMPORARY TABLE ...`  with the same table/view name.
      
      This PR is to disallow `CREATE TEMPORARY TABLE ...` with an existing view name.
      Under the cover, `CREATE TEMPORARY TABLE ...` will be changed to create temporary view, however, passing in a flag `replace=false`, instead of currently `true`. So when creating temporary view under the cover, if there is existing view with the same name, the operation will be blocked.
      
      ## How was this patch tested?
      New unit test case is added and updated some existing test cases to adapt the new behavior
      
      Author: Xin Wu <xinwu@us.ibm.com>
      
      Closes #16878 from xwu0226/block_duplicate_temp_table.
      1ab97310
    • ouyangxiaochen's avatar
      [SPARK-19115][SQL] Supporting Create Table Like Location · 6e45b547
      ouyangxiaochen authored
      What changes were proposed in this pull request?
      
      Support CREATE [EXTERNAL] TABLE LIKE LOCATION... syntax for Hive serde and datasource tables.
      In this PR,we follow SparkSQL design rules :
      
          supporting create table like view or physical table or temporary view with location.
          creating a table with location,this table will be an external table other than managed table.
      
      How was this patch tested?
      
      Add new test cases and update existing test cases
      
      Author: ouyangxiaochen <ou.yangxiaochen@zte.com.cn>
      
      Closes #16868 from ouyangxiaochen/spark19115.
      6e45b547
    • zero323's avatar
      [SPARK-19429][PYTHON][SQL] Support slice arguments in Column.__getitem__ · e02ac303
      zero323 authored
      ## What changes were proposed in this pull request?
      
      - Add support for `slice` arguments in `Column.__getitem__`.
      - Remove obsolete `__getslice__` bindings.
      
      ## How was this patch tested?
      
      Existing unit tests, additional tests covering `[]` with `slice`.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16771 from zero323/SPARK-19429.
      e02ac303
    • Marcelo Vanzin's avatar
      [SPARK-19520][STREAMING] Do not encrypt data written to the WAL. · 0169360e
      Marcelo Vanzin authored
      Spark's I/O encryption uses an ephemeral key for each driver instance.
      So driver B cannot decrypt data written by driver A since it doesn't
      have the correct key.
      
      The write ahead log is used for recovery, thus needs to be readable by
      a different driver. So it cannot be encrypted by Spark's I/O encryption
      code.
      
      The BlockManager APIs used by the WAL code to write the data automatically
      encrypt data, so changes are needed so that callers can to opt out of
      encryption.
      
      Aside from that, the "putBytes" API in the BlockManager does not do
      encryption, so a separate situation arised where the WAL would write
      unencrypted data to the BM and, when those blocks were read, decryption
      would fail. So the WAL code needs to ask the BM to encrypt that data
      when encryption is enabled; this code is not optimal since it results
      in a (temporary) second copy of the data block in memory, but should be
      OK for now until a more performant solution is added. The non-encryption
      case should not be affected.
      
      Tested with new unit tests, and by running streaming apps that do
      recovery using the WAL data with I/O encryption turned on.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16862 from vanzin/SPARK-19520.
      0169360e
    • hyukjinkwon's avatar
      [SPARK-19435][SQL] Type coercion between ArrayTypes · 9af8f743
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      This PR proposes to support type coercion between `ArrayType`s where the element types are compatible.
      
      **Before**
      
      ```
      Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
      org.apache.spark.sql.AnalysisException: cannot resolve 'greatest(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got GREATEST(array<int>, array<double>).; line 1 pos 0;
      
      Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
      org.apache.spark.sql.AnalysisException: cannot resolve 'least(`a`, array(1.0D))' due to data type mismatch: The expressions should all have the same type, got LEAST(array<int>, array<double>).; line 1 pos 0;
      
      sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
      org.apache.spark.sql.AnalysisException: incompatible types found in column a for inline table; line 1 pos 14
      
      Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
      org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. ArrayType(DoubleType,false) <> ArrayType(IntegerType,false) at the first column of the second table;;
      
      sql("SELECT IF(1=1, array(1), array(1D))")
      org.apache.spark.sql.AnalysisException: cannot resolve '(IF((1 = 1), array(1), array(1.0D)))' due to data type mismatch: differing types in '(IF((1 = 1), array(1), array(1.0D)))' (array<int> and array<double>).; line 1 pos 7;
      ```
      
      **After**
      
      ```scala
      Seq(Array(1)).toDF("a").selectExpr("greatest(a, array(1D))")
      res5: org.apache.spark.sql.DataFrame = [greatest(a, array(1.0)): array<double>]
      
      Seq(Array(1)).toDF("a").selectExpr("least(a, array(1D))")
      res6: org.apache.spark.sql.DataFrame = [least(a, array(1.0)): array<double>]
      
      sql("SELECT * FROM values (array(0)), (array(1D)) as data(a)")
      res8: org.apache.spark.sql.DataFrame = [a: array<double>]
      
      Seq(Array(1)).toDF("a").union(Seq(Array(1D)).toDF("b"))
      res10: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [a: array<double>]
      
      sql("SELECT IF(1=1, array(1), array(1D))")
      res15: org.apache.spark.sql.DataFrame = [(IF((1 = 1), array(1), array(1.0))): array<double>]
      ```
      
      ## How was this patch tested?
      
      Unit tests in `TypeCoercion` and Jenkins tests and
      
      building with scala 2.10
      
      ```scala
      ./dev/change-scala-version.sh 2.10
      ./build/mvn -Pyarn -Phadoop-2.4 -Dscala-2.10 -DskipTests clean package
      ```
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16777 from HyukjinKwon/SPARK-19435.
      9af8f743
    • Shixiong Zhu's avatar
      [SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using... · 905fdf0c
      Shixiong Zhu authored
      [SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using ExecutorClassLoader to load Netty generated classes
      
      ## What changes were proposed in this pull request?
      
      Netty's `MessageToMessageEncoder` uses [Javassist](https://github.com/netty/netty/blob/91a0bdc17a8298437d6de08a8958d753799bd4a6/common/src/main/java/io/netty/util/internal/JavassistTypeParameterMatcherGenerator.java#L62) to generate a matcher class and the implementation calls `Class.forName` to check if this class is already generated. If `MessageEncoder` or `MessageDecoder` is created in `ExecutorClassLoader.findClass`, it will cause `ClassCircularityError`. This is because loading this Netty generated class will call `ExecutorClassLoader.findClass` to search this class, and `ExecutorClassLoader` will try to use RPC to load it and cause to load the non-exist matcher class again. JVM will report `ClassCircularityError` to prevent such infinite recursion.
      
      ##### Why it only happens in Maven builds
      
      It's because Maven and SBT have different class loader tree. The Maven build will set a URLClassLoader as the current context class loader to run the tests and expose this issue. The class loader tree is as following:
      
      ```
      bootstrap class loader ------ ... ----- REPL class loader ---- ExecutorClassLoader
      |
      |
      URLClasssLoader
      ```
      
      The SBT build uses the bootstrap class loader directly and `ReplSuite.test("propagation of local properties")` is the first test in ReplSuite, which happens to load `io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher` into the bootstrap class loader (Note: in maven build, it's loaded into URLClasssLoader so it cannot be found in ExecutorClassLoader). This issue can be reproduced in SBT as well. Here are the produce steps:
      - Enable `hadoop.caller.context.enabled`.
      - Replace `Class.forName` with `Utils.classForName` in `object CallerContext`.
      - Ignore `ReplSuite.test("propagation of local properties")`.
      - Run `ReplSuite` using SBT.
      
      This PR just creates a singleton MessageEncoder and MessageDecoder and makes sure they are created before switching to ExecutorClassLoader. TransportContext will be created when creating RpcEnv and that happens before creating ExecutorClassLoader.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16859 from zsxwing/SPARK-17714.
      905fdf0c
    • Shixiong Zhu's avatar
      [SPARK-19542][SS] Delete the temp checkpoint if a query is stopped without errors · 3dbff9be
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When a query uses a temp checkpoint dir, it's better to delete it if it's stopped without errors.
      
      ## How was this patch tested?
      
      New unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16880 from zsxwing/delete-temp-checkpoint.
      3dbff9be
    • Ala Luszczak's avatar
      [SPARK-19514] Enhancing the test for Range interruption. · 0417ce87
      Ala Luszczak authored
      Improve the test for SPARK-19514, so that it's clear which stage is being cancelled.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #16914 from ala/fix-range-test.
      0417ce87
    • Josh Rosen's avatar
      [SPARK-19529] TransportClientFactory.createClient() shouldn't call awaitUninterruptibly() · 1c4d10b1
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch replaces a single `awaitUninterruptibly()` call with a plain `await()` call in Spark's `network-common` library in order to fix a bug which may cause tasks to be uncancellable.
      
      In Spark's Netty RPC layer, `TransportClientFactory.createClient()` calls `awaitUninterruptibly()` on a Netty future while waiting for a connection to be established. This creates problem when a Spark task is interrupted while blocking in this call (which can happen in the event of a slow connection which will eventually time out). This has bad impacts on task cancellation when `interruptOnCancel = true`.
      
      As an example of the impact of this problem, I experienced significant numbers of uncancellable "zombie tasks" on a production cluster where several tasks were blocked trying to connect to a dead shuffle server and then continued running as zombies after I cancelled the associated Spark stage. The zombie tasks ran for several minutes with the following stack:
      
      ```
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:460)
      io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607)
      io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
      org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
      org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) => holding Monitor(java.lang.Object1849476028})
      org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
      org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
      org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
      org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
      org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
      org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
      350)
      org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
      org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:120)
      org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
      org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
      [...]
      ```
      
      As far as I can tell, `awaitUninterruptibly()` might have been used in order to avoid having to declare that methods throw `InterruptedException` (this code is written in Java, hence the need to use checked exceptions). This patch simply replaces this with a regular, interruptible `await()` call,.
      
      This required several interface changes to declare a new checked exception (these are internal interfaces, though, and this change doesn't significantly impact binary compatibility).
      
      An alternative approach would be to wrap `InterruptedException` into `IOException` in order to avoid having to change interfaces. The problem with this approach is that the `network-shuffle` project's `RetryingBlockFetcher` code treats `IOExceptions` as transitive failures when deciding whether to retry fetches, so throwing a wrapped `IOException` might cause an interrupted shuffle fetch to be retried, further prolonging the lifetime of a cancelled zombie task.
      
      Note that there are three other `awaitUninterruptibly()` in the codebase, but those calls have a hard 10 second timeout and are waiting on a `close()` operation which is expected to complete near instantaneously, so the impact of uninterruptibility there is much smaller.
      
      ## How was this patch tested?
      
      Manually.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16866 from JoshRosen/SPARK-19529.
      1c4d10b1
    • zero323's avatar
      [SPARK-19427][PYTHON][SQL] Support data type string as a returnType argument of UDF · ab88b241
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add support for data type string as a return type argument of `UserDefinedFunction`:
      
      ```python
      f = udf(lambda x: x, "integer")
       f.returnType
      
      ## IntegerType
      ```
      
      ## How was this patch tested?
      
      Existing unit tests, additional unit tests covering new feature.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16769 from zero323/SPARK-19427.
      ab88b241
    • zero323's avatar
      [SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml.util · 5e7cd332
      zero323 authored
      ## What changes were proposed in this pull request?
      
      Add missing `warnings` import.
      
      ## How was this patch tested?
      
      Manual tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16846 from zero323/SPARK-19506.
      5e7cd332
    • hyukjinkwon's avatar
      [SPARK-19544][SQL] Improve error message when some column types are compatible... · 4321ff9e
      hyukjinkwon authored
      [SPARK-19544][SQL] Improve error message when some column types are compatible and others are not in set operations
      
      ## What changes were proposed in this pull request?
      
      This PR proposes to fix the error message when some data types are compatible and others are not in set/union operation.
      
      Currently, the code below:
      
      ```scala
      Seq((1,("a", 1))).toDF.union(Seq((1L,("a", "b"))).toDF)
      ```
      
      throws an exception saying `LongType` and `IntegerType` are incompatible types. It should say something about `StructType`s with more readable format as below:
      
      **Before**
      
      ```
      Union can only be performed on tables with the compatible column types.
      LongType <> IntegerType at the first column of the second table;;
      ```
      
      **After**
      
      ```
      Union can only be performed on tables with the compatible column types.
      struct<_1:string,_2:string> <> struct<_1:string,_2:int> at the second column of the second table;;
      ```
      
      *I manually inserted a newline in the messages above for readability only in this PR description.
      
      ## How was this patch tested?
      
      Unit tests in `AnalysisErrorSuite`, manual tests and build wth Scala 2.10.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #16882 from HyukjinKwon/SPARK-19544.
      4321ff9e
    • windpiger's avatar
      [SPARK-19496][SQL] to_date udf to return null when input date is invalid · 04ad8225
      windpiger authored
      ## What changes were proposed in this pull request?
      
      Currently the udf  `to_date` has different return value with an invalid date input.
      
      ```
      SELECT to_date('2015-07-22', 'yyyy-dd-MM') ->  return `2016-10-07`
      SELECT to_date('2014-31-12')    -> return null
      ```
      
      As discussed in JIRA [SPARK-19496](https://issues.apache.org/jira/browse/SPARK-19496), we should return null in both situations when the input date is invalid
      
      ## How was this patch tested?
      unit test added
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #16870 from windpiger/to_date.
      04ad8225
    • Armin Braun's avatar
      [SPARK-19562][BUILD] Added exclude for dev/pr-deps to gitignore · 8f03ad54
      Armin Braun authored
      ## What changes were proposed in this pull request?
      
      Just adding a missing .gitignore entry.
      
      ## How was this patch tested?
      
      Entry added, now repo is not dirty anymore after running the build.
      
      Author: Armin Braun <me@obrown.io>
      
      Closes #16904 from original-brownbear/SPARK-19562.
      8f03ad54
    • Xiao Li's avatar
      [SPARK-19574][ML][DOCUMENTATION] Fix Liquid Exception: Start indices amount is... · 855a1b75
      Xiao Li authored
      [SPARK-19574][ML][DOCUMENTATION] Fix Liquid Exception: Start indices amount is not equal to end indices amount
      
      ### What changes were proposed in this pull request?
      ```
      Liquid Exception: Start indices amount is not equal to end indices amount, see /Users/xiao/IdeaProjects/sparkDelivery/docs/../examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java. in ml-features.md
      ```
      
      So far, the build is broken after merging https://github.com/apache/spark/pull/16789
      
      This PR is to fix it.
      
      ## How was this patch tested?
      Manual
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #16908 from gatorsmile/docMLFix.
      855a1b75
    • Liwei Lin's avatar
      [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not be in the same group · 2bdbc870
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      In `KafkaOffsetReader`, when error occurs, we abort the existing consumer and create a new consumer. In our current implementation, the first consumer and the second consumer would be in the same group (which leads to SPARK-19559), **_violating our intention of the two consumers not being in the same group._**
      
      The cause is that, in our current implementation, the first consumer is created before `groupId` and `nextId` are initialized in the constructor. Then even if `groupId` and `nextId` are increased during the creation of that first consumer, `groupId` and `nextId` would still be initialized to default values in the constructor for the second consumer.
      
      We should make sure that `groupId` and `nextId` are initialized before any consumer is created.
      
      ## How was this patch tested?
      
      Ran 100 times of `KafkaSourceSuite`; all passed
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16902 from lw-lin/SPARK-19564-.
      2bdbc870
  3. Feb 12, 2017
    • titicaca's avatar
      [SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column · bc0a0e63
      titicaca authored
      ## What changes were proposed in this pull request?
      
      Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:
      
      ```
      library(SparkR)
      sparkR.session(master = "local")
      df <- data.frame(col1 = c(0, 1, 2),
                       col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))
      
      sdf1 <- createDataFrame(df)
      print(dtypes(sdf1))
      df1 <- collect(sdf1)
      print(lapply(df1, class))
      
      sdf2 <- filter(sdf1, "col1 > 0")
      print(dtypes(sdf2))
      df2 <- collect(sdf2)
      print(lapply(df2, class))
      ```
      
      As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.
      
      This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.
      
      Therefore, we need to cast the data type of the vector explicitly.
      
      ## How was this patch tested?
      
      The patch can be tested manually with the same code above.
      
      Author: titicaca <fangzhou.yang@hotmail.com>
      
      Closes #16689 from titicaca/sparkr-dev.
      bc0a0e63
    • windpiger's avatar
      [SPARK-19448][SQL] optimize some duplication functions between HiveClientImpl and HiveUtils · 3881f342
      windpiger authored
      ## What changes were proposed in this pull request?
      
      There are some duplicate functions between `HiveClientImpl` and `HiveUtils`, we can merge them to one place. such as: `toHiveTable` 、`toHivePartition`、`fromHivePartition`.
      
      And additional modify is change `MetastoreRelation.attributes` to `MetastoreRelation.dataColKeys`
      https://github.com/apache/spark/blob/master/sql/hive/src/main/scala/org/apache/spark/sql/hive/MetastoreRelation.scala#L234
      
      ## How was this patch tested?
      N/A
      
      Author: windpiger <songjun@outlook.com>
      
      Closes #16787 from windpiger/todoInMetaStoreRelation.
      3881f342
  4. Feb 11, 2017
    • Kay Ousterhout's avatar
      [SPARK-19537] Move pendingPartitions to ShuffleMapStage. · 0fbecc73
      Kay Ousterhout authored
      The pendingPartitions instance variable should be moved to ShuffleMapStage,
      because it is only used by ShuffleMapStages. This change is purely refactoring
      and does not change functionality.
      
      I fixed this in an attempt to clarify some of the discussion around #16620, which I was having trouble reasoning about.  I stole the helpful comment Imran wrote for pendingPartitions and used it here.
      
      cc squito markhamstra jinxing64
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #16876 from kayousterhout/SPARK-19537.
      0fbecc73
  5. Feb 10, 2017
    • Herman van Hovell's avatar
      [SPARK-19548][SQL] Support Hive UDFs which return typed Lists/Maps · 226d3884
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      This PR adds support for Hive UDFs that return fully typed java Lists or Maps, for example `List<String>` or `Map<String, Integer>`.  It is also allowed to nest these structures, for example `Map<String, List<Integer>>`. Raw collections or collections using wildcards are still not supported, and cannot be supported due to the lack of type information.
      
      ## How was this patch tested?
      Modified existing tests in `HiveUDFSuite`, and I have added test cases for raw collection and collection using wildcards.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16886 from hvanhovell/SPARK-19548.
      226d3884
    • Ala Luszczak's avatar
      [SPARK-19549] Allow providing reason for stage/job cancelling · d785217b
      Ala Luszczak authored
      ## What changes were proposed in this pull request?
      
      This change add an optional argument to `SparkContext.cancelStage()` and `SparkContext.cancelJob()` functions, which allows the caller to provide exact reason  for the cancellation.
      
      ## How was this patch tested?
      
      Adds unit test.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #16887 from ala/cancel.
      d785217b
    • sueann's avatar
      [SPARK-18613][ML] make spark.mllib LDA dependencies in spark.ml LDA private · 3a43ae7c
      sueann authored
      ## What changes were proposed in this pull request?
      spark.ml.*LDAModel classes were exposing spark.mllib LDA models via protected methods. Made them package (clustering) private.
      
      ## How was this patch tested?
      ```
      build/sbt doc  # "millib.clustering" no longer appears in the docs for *LDA* classes
      build/sbt compile  # compiles
      build/sbt
      > mllib/testOnly   # tests pass
      ```
      
      Author: sueann <sueann@databricks.com>
      
      Closes #16860 from sueann/SPARK-18613.
      3a43ae7c
    • Herman van Hovell's avatar
      [SPARK-19459][SQL] Add Hive datatype (char/varchar) to StructField metadata · de8a03e6
      Herman van Hovell authored
      ## What changes were proposed in this pull request?
      Reading from an existing ORC table which contains `char` or `varchar` columns can fail with a `ClassCastException` if the table metadata has been created using Spark. This is caused by the fact that spark internally replaces `char` and `varchar` columns with a `string` column.
      
      This PR fixes this by adding the hive type to the `StructField's` metadata under the `HIVE_TYPE_STRING` key. This is picked up by the `HiveClient` and the ORC reader, see https://github.com/apache/spark/pull/16060 for more details on how the metadata is used.
      
      ## How was this patch tested?
      Added a regression test to `OrcSourceSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16804 from hvanhovell/SPARK-19459.
      de8a03e6
    • Eren Avsarogullari's avatar
      [SPARK-19466][CORE][SCHEDULER] Improve Fair Scheduler Logging · dadff5f0
      Eren Avsarogullari authored
      Fair Scheduler Logging for the following cases can be useful for the user.
      
      1. If **valid** `spark.scheduler.allocation.file` property is set, user can be informed and aware which scheduler file is processed when `SparkContext` initializes.
      
      2. If **invalid** `spark.scheduler.allocation.file` property is set, currently, the following stacktrace is shown to user. In addition to this, more meaningful message can be shown to user by emphasizing the problem at building level of fair scheduler. Also other potential issues can be covered at this level as **Fair Scheduler can not be built. + exception stacktrace**
      ```
      Exception in thread "main" java.io.FileNotFoundException: INVALID_FILE (No such file or directory)
      	at java.io.FileInputStream.open0(Native Method)
      	at java.io.FileInputStream.open(FileInputStream.java:195)
      	at java.io.FileInputStream.<init>(FileInputStream.java:138)
      	at java.io.FileInputStream.<init>(FileInputStream.java:93)
      	at org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:76)
      	at org.apache.spark.scheduler.FairSchedulableBuilder$$anonfun$buildPools$1.apply(SchedulableBuilder.scala:75)
      ```
      3. If `spark.scheduler.allocation.file` property is not set and **default** fair scheduler file (**fairscheduler.xml**) is found in classpath, it will be loaded but currently, user is not informed for using default file so logging can be useful as **Fair Scheduler file: fairscheduler.xml is found successfully and will be parsed.**
      
      4. If **spark.scheduler.allocation.file** property is not set and **default** fair scheduler file does not exist in classpath, currently, user is not informed so logging can be useful as **No Fair Scheduler file found.**
      
      Also this PR is related with https://github.com/apache/spark/pull/15237 to emphasize fileName in warning logs when fair scheduler file has invalid minShare, weight or schedulingMode values.
      
      ## How was this patch tested?
      Added new Unit Tests.
      
      Author: erenavsarogullari <erenavsarogullari@gmail.com>
      
      Closes #16813 from erenavsarogullari/SPARK-19466.
      dadff5f0
    • Hervé's avatar
      Encryption of shuffle files · c5a66356
      Hervé authored
      Hello
      
      According to my understanding of commits 4b4e329e & 8b325b17, one may now encrypt shuffle files regardless of the cluster manager in use.
      
      However I have limited understanding of the code, I'm not able to find out whether theses changes also comprise all "temporary local storage, such as shuffle files, cached data, and other application files".
      
      Please feel free to amend or reject my PR if I'm wrong.
      
      dud
      
      Author: Hervé <dud225@users.noreply.github.com>
      
      Closes #16885 from dud225/patch-1.
      c5a66356
    • Devaraj K's avatar
      [SPARK-10748][MESOS] Log error instead of crashing Spark Mesos dispatcher when... · 8640dc08
      Devaraj K authored
      [SPARK-10748][MESOS] Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured
      
      ## What changes were proposed in this pull request?
      
      Now handling the spark exception which gets thrown for invalid job configuration, marking that job as failed and continuing to launch the other drivers instead of throwing the exception.
      ## How was this patch tested?
      
      I verified manually, now the misconfigured jobs move to Finished Drivers section in UI and continue to launch the other jobs.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #13077 from devaraj-kavali/SPARK-10748.
      8640dc08
    • jerryshao's avatar
      [SPARK-19545][YARN] Fix compile issue for Spark on Yarn when building against Hadoop 2.6.0~2.6.3 · 8e8afb3a
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Due to the newly added API in Hadoop 2.6.4+, Spark builds against Hadoop 2.6.0~2.6.3 will meet compile error. So here still reverting back to use reflection to handle this issue.
      
      ## How was this patch tested?
      
      Manual verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16884 from jerryshao/SPARK-19545.
      8e8afb3a
    • Burak Yavuz's avatar
      [SPARK-19543] from_json fails when the input row is empty · d5593f7f
      Burak Yavuz authored
      ## What changes were proposed in this pull request?
      
      Using from_json on a column with an empty string results in: java.util.NoSuchElementException: head of empty list.
      
      This is because `parser.parse(input)` may return `Nil` when `input.trim.isEmpty`
      
      ## How was this patch tested?
      
      Regression test in `JsonExpressionsSuite`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #16881 from brkyvz/json-fix.
      d5593f7f
  6. Feb 09, 2017
    • jinxing's avatar
      [SPARK-19263] Fix race in SchedulerIntegrationSuite. · fd6c3a0b
      jinxing authored
      ## What changes were proposed in this pull request?
      
      All the process of offering resource and generating `TaskDescription` should be guarded by taskScheduler.synchronized in `reviveOffers`, otherwise there is race condition.
      
      ## How was this patch tested?
      
      Existing unit tests.
      
      Author: jinxing <jinxing@meituan.com>
      
      Closes #16831 from jinxing64/SPARK-19263-FixRaceInTest.
      fd6c3a0b
    • Shixiong Zhu's avatar
      [SPARK-19481] [REPL] [MAVEN] Avoid to leak SparkContext in Signaling.cancelOnInterrupt · 303f00a4
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      `Signaling.cancelOnInterrupt` leaks a SparkContext per call and it makes ReplSuite unstable.
      
      This PR adds `SparkContext.getActive` to allow `Signaling.cancelOnInterrupt` to get the active `SparkContext` to avoid the leak.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16825 from zsxwing/SPARK-19481.
      303f00a4
    • José Hiram Soltren's avatar
      [SPARK-16554][CORE] Automatically Kill Executors and Nodes when they are Blacklisted · 6287c94f
      José Hiram Soltren authored
      ## What changes were proposed in this pull request?
      
      In SPARK-8425, we introduced a mechanism for blacklisting executors and nodes (hosts). After a certain number of failures, these resources would be "blacklisted" and no further work would be assigned to them for some period of time.
      
      In some scenarios, it is better to fail fast, and to simply kill these unreliable resources. This changes proposes to do so by having the BlacklistTracker kill unreliable resources when they would otherwise be "blacklisted".
      
      In order to be thread safe, this code depends on the CoarseGrainedSchedulerBackend sending a message to the driver backend in order to do the actual killing. This also helps to prevent a race which would permit work to begin on a resource (executor or node), between the time the resource is marked for killing and the time at which it is finally killed.
      
      ## How was this patch tested?
      
      ./dev/run-tests
      Ran https://github.com/jsoltren/jose-utils/blob/master/blacklist/test-blacklist.sh, and checked logs to see executors and nodes being killed.
      
      Testing can likely be improved here; suggestions welcome.
      
      Author: José Hiram Soltren <jose@cloudera.com>
      
      Closes #16650 from jsoltren/SPARK-16554-submit.
      6287c94f
    • jiangxingbo's avatar
      [SPARK-19025][SQL] Remove SQL builder for operators · af63c52f
      jiangxingbo authored
      ## What changes were proposed in this pull request?
      
      With the new approach of view resolution, we can get rid of SQL generation on view creation, so let's remove SQL builder for operators.
      
      Note that, since all sql generation for operators is defined in one file (org.apache.spark.sql.catalyst.SQLBuilder), it’d be trivial to recover it in the future.
      
      ## How was this patch tested?
      
      N/A
      
      Author: jiangxingbo <jiangxb1987@gmail.com>
      
      Closes #16869 from jiangxb1987/SQLBuilder.
      af63c52f
    • Bogdan Raducanu's avatar
      [SPARK-19512][SQL] codegen for compare structs fails · 1af0dee4
      Bogdan Raducanu authored
      ## What changes were proposed in this pull request?
      
      Set currentVars to null in GenerateOrdering.genComparisons before genCode is called. genCode ignores INPUT_ROW if currentVars is not null and in genComparisons we want it to use INPUT_ROW.
      
      ## How was this patch tested?
      
      Added test with 2 queries in WholeStageCodegenSuite
      
      Author: Bogdan Raducanu <bogdan.rdc@gmail.com>
      
      Closes #16852 from bogdanrdc/SPARK-19512.
      1af0dee4
    • Ala Luszczak's avatar
      [SPARK-19514] Making range interruptible. · 4064574d
      Ala Luszczak authored
      ## What changes were proposed in this pull request?
      
      Previously range operator could not be interrupted. For example, using DAGScheduler.cancelStage(...) on a query with range might have been ineffective.
      
      This change adds periodic checks of TaskContext.isInterrupted to codegen version, and InterruptibleOperator to non-codegen version.
      
      I benchmarked the performance of codegen version on a sample query `spark.range(1000L * 1000 * 1000 * 10).count()` and there is no measurable difference.
      
      ## How was this patch tested?
      
      Adds a unit test.
      
      Author: Ala Luszczak <ala@databricks.com>
      
      Closes #16872 from ala/SPARK-19514b.
      4064574d
Loading