Skip to content
Snippets Groups Projects
  1. Feb 17, 2017
    • Davies Liu's avatar
      [SPARK-19500] [SQL] Fix off-by-one bug in BytesToBytesMap · 6e3abed8
      Davies Liu authored
      
      ## What changes were proposed in this pull request?
      
      Radix sort require that half of array as free (as temporary space), so we use 0.5 as the scale factor to make sure that BytesToBytesMap will not have more items than 1/2 of capacity. Turned out this is not true, the current implementation of append() could leave 1 more item than the threshold (1/2 of capacity) in the array, which break the requirement of radix sort (fail the assert in 2.2, or fail to insert into InMemorySorter in 2.1).
      
      This PR fix the off-by-one bug in BytesToBytesMap.
      
      This PR also fix a bug that the array will never grow if it fail to grow once (stay as initial capacity), introduced by #15722 .
      
      ## How was this patch tested?
      
      Added regression test.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #16844 from davies/off_by_one.
      
      (cherry picked from commit 3d0c3af0)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      6e3abed8
    • Stan Zhai's avatar
      [SPARK-19622][WEBUI] Fix a http error in a paged table when using a `Go` button to search. · 55958bcd
      Stan Zhai authored
      ## What changes were proposed in this pull request?
      
      The search function of paged table is not available because of we don't skip the hash data of the reqeust path.
      
      ![](https://issues.apache.org/jira/secure/attachment/12852996/screenshot-1.png
      
      )
      
      ## How was this patch tested?
      
      Tested manually with my browser.
      
      Author: Stan Zhai <zhaishidan@haizhi.com>
      
      Closes #16953 from stanzhai/fix-webui-paged-table.
      
      (cherry picked from commit 021062af)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      55958bcd
  2. Feb 15, 2017
    • Felix Cheung's avatar
      [SPARK-19399][SPARKR][BACKPORT-2.1] fix tests broken by merge · 252dd05f
      Felix Cheung authored
      ## What changes were proposed in this pull request?
      
      fix test broken by git merge for #16739
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16950 from felixcheung/fixrtest.
      252dd05f
    • Shixiong Zhu's avatar
      [SPARK-19603][SS] Fix StreamingQuery explain command · db7adb61
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      `StreamingQuery.explain` doesn't show the correct streaming physical plan right now because `ExplainCommand` receives a runtime batch plan and its `logicalPlan.isStreaming` is always false.
      
      This PR adds `streaming` parameter to `ExplainCommand` to allow `StreamExecution` to specify that it's a streaming plan.
      
      Examples of the explain outputs:
      
      - streaming DataFrame.explain()
      ```
      == Physical Plan ==
      *HashAggregate(keys=[value#518], functions=[count(1)])
      +- StateStoreSave [value#518], OperatorStateId(<unknown>,0,0), Append, 0
         +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
            +- StateStoreRestore [value#518], OperatorStateId(<unknown>,0,0)
               +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
                  +- Exchange hashpartitioning(value#518, 5)
                     +- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
                        +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                           +- *MapElements <function1>, obj#517: java.lang.String
                              +- *DeserializeToObject value#513.toString, obj#516: java.lang.String
                                 +- StreamingRelation MemoryStream[value#513], [value#513]
      ```
      
      - StreamingQuery.explain(extended = false)
      ```
      == Physical Plan ==
      *HashAggregate(keys=[value#518], functions=[count(1)])
      +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
         +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
            +- StateStoreRestore [value#518], OperatorStateId(...,0,0)
               +- *HashAggregate(keys=[value#518], functions=[merge_count(1)])
                  +- Exchange hashpartitioning(value#518, 5)
                     +- *HashAggregate(keys=[value#518], functions=[partial_count(1)])
                        +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                           +- *MapElements <function1>, obj#517: java.lang.String
                              +- *DeserializeToObject value#543.toString, obj#516: java.lang.String
                                 +- LocalTableScan [value#543]
      ```
      
      - StreamingQuery.explain(extended = true)
      ```
      == Parsed Logical Plan ==
      Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
            +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
               +- LocalRelation [value#543]
      
      == Analyzed Logical Plan ==
      value: string, count(1): bigint
      Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
            +- DeserializeToObject cast(value#543 as string).toString, obj#516: java.lang.String
               +- LocalRelation [value#543]
      
      == Optimized Logical Plan ==
      Aggregate [value#518], [value#518, count(1) AS count(1)#524L]
      +- SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
         +- MapElements <function1>, class java.lang.String, [StructField(value,StringType,true)], obj#517: java.lang.String
            +- DeserializeToObject value#543.toString, obj#516: java.lang.String
               +- LocalRelation [value#543]
      
      == Physical Plan ==
      *HashAggregate(keys=[value#518], functions=[count(1)], output=[value#518, count(1)#524L])
      +- StateStoreSave [value#518], OperatorStateId(...,0,0), Complete, 0
         +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
            +- StateStoreRestore [value#518], OperatorStateId(...,0,0)
               +- *HashAggregate(keys=[value#518], functions=[merge_count(1)], output=[value#518, count#530L])
                  +- Exchange hashpartitioning(value#518, 5)
                     +- *HashAggregate(keys=[value#518], functions=[partial_count(1)], output=[value#518, count#530L])
                        +- *SerializeFromObject [staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, input[0, java.lang.String, true], true) AS value#518]
                           +- *MapElements <function1>, obj#517: java.lang.String
                              +- *DeserializeToObject value#543.toString, obj#516: java.lang.String
                                 +- LocalTableScan [value#543]
      ```
      
      ## How was this patch tested?
      
      The updated unit test.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16934 from zsxwing/SPARK-19603.
      
      (cherry picked from commit fc02ef95)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      db7adb61
    • Yin Huai's avatar
      [SPARK-19604][TESTS] Log the start of every Python test · b9ab4c0e
      Yin Huai authored
      
      ## What changes were proposed in this pull request?
      Right now, we only have info level log after we finish the tests of a Python test file. We should also log the start of a test. So, if a test is hanging, we can tell which test file is running.
      
      ## How was this patch tested?
      This is a change for python tests.
      
      Author: Yin Huai <yhuai@databricks.com>
      
      Closes #16935 from yhuai/SPARK-19604.
      
      (cherry picked from commit f6c3bba2)
      Signed-off-by: default avatarYin Huai <yhuai@databricks.com>
      b9ab4c0e
    • Shixiong Zhu's avatar
      [SPARK-19599][SS] Clean up HDFSMetadataLog · 88c43f4f
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      SPARK-19464 removed support for Hadoop 2.5 and earlier, so we can do some cleanup for HDFSMetadataLog.
      
      This PR includes the following changes:
      - ~~Remove the workaround codes for HADOOP-10622.~~ Unfortunately, there is another issue [HADOOP-14084](https://issues.apache.org/jira/browse/HADOOP-14084
      
      ) that prevents us from removing the workaround codes.
      - Remove unnecessary `writer: (T, OutputStream) => Unit` and just call `serialize` directly.
      - Remove catching FileNotFoundException.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16932 from zsxwing/metadata-cleanup.
      
      (cherry picked from commit 21b4ba2d)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      88c43f4f
    • Felix Cheung's avatar
      [SPARK-19399][SPARKR] Add R coalesce API for DataFrame and Column · 6c353990
      Felix Cheung authored
      
      Add coalesce on DataFrame for down partitioning without shuffle and coalesce on Column
      
      manual, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16739 from felixcheung/rcoalesce.
      
      (cherry picked from commit 671bc08e)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      6c353990
  3. Feb 14, 2017
  4. Feb 13, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19520][STREAMING] Do not encrypt data written to the WAL. · 7fe3543f
      Marcelo Vanzin authored
      
      Spark's I/O encryption uses an ephemeral key for each driver instance.
      So driver B cannot decrypt data written by driver A since it doesn't
      have the correct key.
      
      The write ahead log is used for recovery, thus needs to be readable by
      a different driver. So it cannot be encrypted by Spark's I/O encryption
      code.
      
      The BlockManager APIs used by the WAL code to write the data automatically
      encrypt data, so changes are needed so that callers can to opt out of
      encryption.
      
      Aside from that, the "putBytes" API in the BlockManager does not do
      encryption, so a separate situation arised where the WAL would write
      unencrypted data to the BM and, when those blocks were read, decryption
      would fail. So the WAL code needs to ask the BM to encrypt that data
      when encryption is enabled; this code is not optimal since it results
      in a (temporary) second copy of the data block in memory, but should be
      OK for now until a more performant solution is added. The non-encryption
      case should not be affected.
      
      Tested with new unit tests, and by running streaming apps that do
      recovery using the WAL data with I/O encryption turned on.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16862 from vanzin/SPARK-19520.
      
      (cherry picked from commit 0169360e)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      7fe3543f
    • Josh Rosen's avatar
      [SPARK-19529] TransportClientFactory.createClient() shouldn't call awaitUninterruptibly() · 5db23473
      Josh Rosen authored
      
      This patch replaces a single `awaitUninterruptibly()` call with a plain `await()` call in Spark's `network-common` library in order to fix a bug which may cause tasks to be uncancellable.
      
      In Spark's Netty RPC layer, `TransportClientFactory.createClient()` calls `awaitUninterruptibly()` on a Netty future while waiting for a connection to be established. This creates problem when a Spark task is interrupted while blocking in this call (which can happen in the event of a slow connection which will eventually time out). This has bad impacts on task cancellation when `interruptOnCancel = true`.
      
      As an example of the impact of this problem, I experienced significant numbers of uncancellable "zombie tasks" on a production cluster where several tasks were blocked trying to connect to a dead shuffle server and then continued running as zombies after I cancelled the associated Spark stage. The zombie tasks ran for several minutes with the following stack:
      
      ```
      java.lang.Object.wait(Native Method)
      java.lang.Object.wait(Object.java:460)
      io.netty.util.concurrent.DefaultPromise.await0(DefaultPromise.java:607)
      io.netty.util.concurrent.DefaultPromise.awaitUninterruptibly(DefaultPromise.java:301)
      org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:224)
      org.apache.spark.network.client.TransportClientFactory.createClient(TransportClientFactory.java:179) => holding Monitor(java.lang.Object1849476028})
      org.apache.spark.network.shuffle.ExternalShuffleClient$1.createAndStart(ExternalShuffleClient.java:105)
      org.apache.spark.network.shuffle.RetryingBlockFetcher.fetchAllOutstanding(RetryingBlockFetcher.java:140)
      org.apache.spark.network.shuffle.RetryingBlockFetcher.start(RetryingBlockFetcher.java:120)
      org.apache.spark.network.shuffle.ExternalShuffleClient.fetchBlocks(ExternalShuffleClient.java:114)
      org.apache.spark.storage.ShuffleBlockFetcherIterator.sendRequest(ShuffleBlockFetcherIterator.scala:169)
      org.apache.spark.storage.ShuffleBlockFetcherIterator.fetchUpToMaxBytes(ShuffleBlockFetcherIterator.scala:
      350)
      org.apache.spark.storage.ShuffleBlockFetcherIterator.initialize(ShuffleBlockFetcherIterator.scala:286)
      org.apache.spark.storage.ShuffleBlockFetcherIterator.<init>(ShuffleBlockFetcherIterator.scala:120)
      org.apache.spark.shuffle.BlockStoreShuffleReader.read(BlockStoreShuffleReader.scala:45)
      org.apache.spark.sql.execution.ShuffledRowRDD.compute(ShuffledRowRDD.scala:169)
      org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
      org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
      [...]
      ```
      
      As far as I can tell, `awaitUninterruptibly()` might have been used in order to avoid having to declare that methods throw `InterruptedException` (this code is written in Java, hence the need to use checked exceptions). This patch simply replaces this with a regular, interruptible `await()` call,.
      
      This required several interface changes to declare a new checked exception (these are internal interfaces, though, and this change doesn't significantly impact binary compatibility).
      
      An alternative approach would be to wrap `InterruptedException` into `IOException` in order to avoid having to change interfaces. The problem with this approach is that the `network-shuffle` project's `RetryingBlockFetcher` code treats `IOExceptions` as transitive failures when deciding whether to retry fetches, so throwing a wrapped `IOException` might cause an interrupted shuffle fetch to be retried, further prolonging the lifetime of a cancelled zombie task.
      
      Note that there are three other `awaitUninterruptibly()` in the codebase, but those calls have a hard 10 second timeout and are waiting on a `close()` operation which is expected to complete near instantaneously, so the impact of uninterruptibility there is much smaller.
      
      Manually.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #16866 from JoshRosen/SPARK-19529.
      
      (cherry picked from commit 1c4d10b1)
      Signed-off-by: default avatarCheng Lian <lian@databricks.com>
      5db23473
    • Shixiong Zhu's avatar
    • Shixiong Zhu's avatar
      [SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using... · 328b2298
      Shixiong Zhu authored
      [SPARK-17714][CORE][TEST-MAVEN][TEST-HADOOP2.6] Avoid using ExecutorClassLoader to load Netty generated classes
      
      ## What changes were proposed in this pull request?
      
      Netty's `MessageToMessageEncoder` uses [Javassist](https://github.com/netty/netty/blob/91a0bdc17a8298437d6de08a8958d753799bd4a6/common/src/main/java/io/netty/util/internal/JavassistTypeParameterMatcherGenerator.java#L62
      
      ) to generate a matcher class and the implementation calls `Class.forName` to check if this class is already generated. If `MessageEncoder` or `MessageDecoder` is created in `ExecutorClassLoader.findClass`, it will cause `ClassCircularityError`. This is because loading this Netty generated class will call `ExecutorClassLoader.findClass` to search this class, and `ExecutorClassLoader` will try to use RPC to load it and cause to load the non-exist matcher class again. JVM will report `ClassCircularityError` to prevent such infinite recursion.
      
      ##### Why it only happens in Maven builds
      
      It's because Maven and SBT have different class loader tree. The Maven build will set a URLClassLoader as the current context class loader to run the tests and expose this issue. The class loader tree is as following:
      
      ```
      bootstrap class loader ------ ... ----- REPL class loader ---- ExecutorClassLoader
      |
      |
      URLClasssLoader
      ```
      
      The SBT build uses the bootstrap class loader directly and `ReplSuite.test("propagation of local properties")` is the first test in ReplSuite, which happens to load `io/netty/util/internal/__matchers__/org/apache/spark/network/protocol/MessageMatcher` into the bootstrap class loader (Note: in maven build, it's loaded into URLClasssLoader so it cannot be found in ExecutorClassLoader). This issue can be reproduced in SBT as well. Here are the produce steps:
      - Enable `hadoop.caller.context.enabled`.
      - Replace `Class.forName` with `Utils.classForName` in `object CallerContext`.
      - Ignore `ReplSuite.test("propagation of local properties")`.
      - Run `ReplSuite` using SBT.
      
      This PR just creates a singleton MessageEncoder and MessageDecoder and makes sure they are created before switching to ExecutorClassLoader. TransportContext will be created when creating RpcEnv and that happens before creating ExecutorClassLoader.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16859 from zsxwing/SPARK-17714.
      
      (cherry picked from commit 905fdf0c)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      328b2298
    • Shixiong Zhu's avatar
      [SPARK-19542][SS] Delete the temp checkpoint if a query is stopped without errors · c5a7cb02
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      When a query uses a temp checkpoint dir, it's better to delete it if it's stopped without errors.
      
      ## How was this patch tested?
      
      New unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16880 from zsxwing/delete-temp-checkpoint.
      
      (cherry picked from commit 3dbff9be)
      Signed-off-by: default avatarBurak Yavuz <brkyvz@gmail.com>
      c5a7cb02
    • zero323's avatar
      [SPARK-19506][ML][PYTHON] Import warnings in pyspark.ml.util · ef4fb7eb
      zero323 authored
      
      ## What changes were proposed in this pull request?
      
      Add missing `warnings` import.
      
      ## How was this patch tested?
      
      Manual tests.
      
      Author: zero323 <zero323@users.noreply.github.com>
      
      Closes #16846 from zero323/SPARK-19506.
      
      (cherry picked from commit 5e7cd332)
      Signed-off-by: default avatarHolden Karau <holden@us.ibm.com>
      ef4fb7eb
    • Xiao Li's avatar
      [SPARK-19574][ML][DOCUMENTATION] Fix Liquid Exception: Start indices amount is... · a3b67513
      Xiao Li authored
      [SPARK-19574][ML][DOCUMENTATION] Fix Liquid Exception: Start indices amount is not equal to end indices amount
      
      ### What changes were proposed in this pull request?
      ```
      Liquid Exception: Start indices amount is not equal to end indices amount, see /Users/xiao/IdeaProjects/sparkDelivery/docs/../examples/src/main/java/org/apache/spark/examples/ml/JavaTokenizerExample.java. in ml-features.md
      ```
      
      So far, the build is broken after merging https://github.com/apache/spark/pull/16789
      
      
      
      This PR is to fix it.
      
      ## How was this patch tested?
      Manual
      
      Author: Xiao Li <gatorsmile@gmail.com>
      
      Closes #16908 from gatorsmile/docMLFix.
      
      (cherry picked from commit 855a1b75)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      a3b67513
    • Liwei Lin's avatar
      [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not be in the same group · fe4fcc57
      Liwei Lin authored
      
      ## What changes were proposed in this pull request?
      
      In `KafkaOffsetReader`, when error occurs, we abort the existing consumer and create a new consumer. In our current implementation, the first consumer and the second consumer would be in the same group (which leads to SPARK-19559), **_violating our intention of the two consumers not being in the same group._**
      
      The cause is that, in our current implementation, the first consumer is created before `groupId` and `nextId` are initialized in the constructor. Then even if `groupId` and `nextId` are increased during the creation of that first consumer, `groupId` and `nextId` would still be initialized to default values in the constructor for the second consumer.
      
      We should make sure that `groupId` and `nextId` are initialized before any consumer is created.
      
      ## How was this patch tested?
      
      Ran 100 times of `KafkaSourceSuite`; all passed
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16902 from lw-lin/SPARK-19564-.
      
      (cherry picked from commit 2bdbc870)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      fe4fcc57
  5. Feb 12, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19319][BACKPORT-2.1][SPARKR] SparkR Kmeans summary returns error when... · 06e77e00
      wm624@hotmail.com authored
      [SPARK-19319][BACKPORT-2.1][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k
      
      ## What changes were proposed in this pull request?
      
      Backport fix of #16666
      
      ## How was this patch tested?
      
      Backport unit tests
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16761 from wangmiao1981/kmeansport.
      06e77e00
    • titicaca's avatar
      [SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column · 173c2387
      titicaca authored
      
      ## What changes were proposed in this pull request?
      
      Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:
      
      ```
      library(SparkR)
      sparkR.session(master = "local")
      df <- data.frame(col1 = c(0, 1, 2),
                       col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))
      
      sdf1 <- createDataFrame(df)
      print(dtypes(sdf1))
      df1 <- collect(sdf1)
      print(lapply(df1, class))
      
      sdf2 <- filter(sdf1, "col1 > 0")
      print(dtypes(sdf2))
      df2 <- collect(sdf2)
      print(lapply(df2, class))
      ```
      
      As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.
      
      This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.
      
      Therefore, we need to cast the data type of the vector explicitly.
      
      ## How was this patch tested?
      
      The patch can be tested manually with the same code above.
      
      Author: titicaca <fangzhou.yang@hotmail.com>
      
      Closes #16689 from titicaca/sparkr-dev.
      
      (cherry picked from commit bc0a0e63)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      173c2387
  6. Feb 10, 2017
    • Andrew Ray's avatar
      [SPARK-18717][SQL] Make code generation for Scala Map work with immutable.Map also · e580bb03
      Andrew Ray authored
      
      ## What changes were proposed in this pull request?
      
      Fixes compile errors in generated code when user has case class with a `scala.collections.immutable.Map` instead of a `scala.collections.Map`. Since ArrayBasedMapData.toScalaMap returns the immutable version we can make it work with both.
      
      ## How was this patch tested?
      
      Additional unit tests.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #16161 from aray/fix-map-codegen.
      
      (cherry picked from commit 46d30ac4)
      Signed-off-by: default avatarCheng Lian <lian@databricks.com>
      e580bb03
    • Burak Yavuz's avatar
      [SPARK-19543] from_json fails when the input row is empty · 7b5ea000
      Burak Yavuz authored
      
      ## What changes were proposed in this pull request?
      
      Using from_json on a column with an empty string results in: java.util.NoSuchElementException: head of empty list.
      
      This is because `parser.parse(input)` may return `Nil` when `input.trim.isEmpty`
      
      ## How was this patch tested?
      
      Regression test in `JsonExpressionsSuite`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #16881 from brkyvz/json-fix.
      
      (cherry picked from commit d5593f7f)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      7b5ea000
    • Bogdan Raducanu's avatar
      [SPARK-19512][BACKPORT-2.1][SQL] codegen for compare structs fails #16852 · ff5818b8
      Bogdan Raducanu authored
      ## What changes were proposed in this pull request?
      
      Set currentVars to null in GenerateOrdering.genComparisons before genCode is called. genCode ignores INPUT_ROW if currentVars is not null and in genComparisons we want it to use INPUT_ROW.
      
      ## How was this patch tested?
      
      Added test with 2 queries in WholeStageCodegenSuite
      
      Author: Bogdan Raducanu <bogdan.rdc@gmail.com>
      
      Closes #16875 from bogdanrdc/SPARK-19512-2.1.
      ff5818b8
  7. Feb 09, 2017
    • Stan Zhai's avatar
      [SPARK-19509][SQL] Grouping Sets do not respect nullable grouping columns · a3d5300a
      Stan Zhai authored
      ## What changes were proposed in this pull request?
      The analyzer currently does not check if a column used in grouping sets is actually nullable itself. This can cause the nullability of the column to be incorrect, which can cause null pointer exceptions down the line. This PR fixes that by also consider the nullability of the column.
      
      This is only a problem for Spark 2.1 and below. The latest master uses a different approach.
      
      Closes https://github.com/apache/spark/pull/16874
      
      ## How was this patch tested?
      Added a regression test to `SQLQueryTestSuite.grouping_set`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16873 from hvanhovell/SPARK-19509.
      a3d5300a
    • Shixiong Zhu's avatar
      [SPARK-19481] [REPL] [MAVEN] Avoid to leak SparkContext in Signaling.cancelOnInterrupt · b3fd36a1
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      `Signaling.cancelOnInterrupt` leaks a SparkContext per call and it makes ReplSuite unstable.
      
      This PR adds `SparkContext.getActive` to allow `Signaling.cancelOnInterrupt` to get the active `SparkContext` to avoid the leak.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16825 from zsxwing/SPARK-19481.
      
      (cherry picked from commit 303f00a4)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      b3fd36a1
  8. Feb 08, 2017
    • Tathagata Das's avatar
      [SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations for branch-2.1 · 502c927b
      Tathagata Das authored
      This is a follow up PR for merging #16758 to spark 2.1 branch
      
      ## What changes were proposed in this pull request?
      
      `mapGroupsWithState` is a new API for arbitrary stateful operations in Structured Streaming, similar to `DStream.mapWithState`
      
      *Requirements*
      - Users should be able to specify a function that can do the following
      - Access the input row corresponding to a key
      - Access the previous state corresponding to a key
      - Optionally, update or remove the state
      - Output any number of new rows (or none at all)
      
      *Proposed API*
      ```
      // ------------ New methods on KeyValueGroupedDataset ------------
      class KeyValueGroupedDataset[K, V] {
      	// Scala friendly
      	def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], KeyedState[S]) => U)
              def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, Iterator[V], KeyedState[S]) => Iterator[U])
      	// Java friendly
             def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
             def flatMapGroupsWithState[S, U](func: FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
      }
      
      // ------------------- New Java-friendly function classes -------------------
      public interface MapGroupsWithStateFunction<K, V, S, R> extends Serializable {
        R call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
      }
      public interface FlatMapGroupsWithStateFunction<K, V, S, R> extends Serializable {
        Iterator<R> call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
      }
      
      // ---------------------- Wrapper class for state data ----------------------
      trait KeyedState[S] {
      	def exists(): Boolean
        	def get(): S 			// throws Exception is state does not exist
      	def getOption(): Option[S]
      	def update(newState: S): Unit
      	def remove(): Unit		// exists() will be false after this
      }
      ```
      
      Key Semantics of the State class
      - The state can be null.
      - If the state.remove() is called, then state.exists() will return false, and getOption will returm None.
      - After that state.update(newState) is called, then state.exists() will return true, and getOption will return Some(...).
      - None of the operations are thread-safe. This is to avoid memory barriers.
      
      *Usage*
      ```
      val stateFunc = (word: String, words: Iterator[String, runningCount: KeyedState[Long]) => {
          val newCount = words.size + runningCount.getOption.getOrElse(0L)
          runningCount.update(newCount)
         (word, newCount)
      }
      
      dataset					                        // type is Dataset[String]
        .groupByKey[String](w => w)        	                // generates KeyValueGroupedDataset[String, String]
        .mapGroupsWithState[Long, (String, Long)](stateFunc)	// returns Dataset[(String, Long)]
      ```
      
      ## How was this patch tested?
      New unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16850 from tdas/mapWithState-branch-2.1.
      502c927b
    • Herman van Hovell's avatar
      [SPARK-18609][SPARK-18841][SQL][BACKPORT-2.1] Fix redundant Alias removal in the optimizer · 71b6eacf
      Herman van Hovell authored
      This is a backport of https://github.com/apache/spark/commit/73ee73945e369a862480ef4ac64e55c797bd7d90
      
      ## What changes were proposed in this pull request?
      The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE)  and the duplicated part contains the alias only project, in this case the rewrite will break the tree.
      
      This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.
      
      The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree.
      
      ## How was this patch tested?
      I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16843 from hvanhovell/SPARK-18609-2.1.
      71b6eacf
  9. Feb 07, 2017
  10. Feb 06, 2017
    • uncleGen's avatar
      [SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting it from uri scheme · 62fab5be
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      ```
      Caused by: java.lang.IllegalArgumentException: Wrong FS: s3a://**************/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata, expected: file:///
      	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
      	at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
      	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
      	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
      	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
      	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
      	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
      	at org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
      	at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:100)
      	at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
      	at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
      	at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
      ```
      
      Can easily replicate on spark standalone cluster by providing checkpoint location uri scheme anything other than "file://" and not overriding in config.
      
      WorkAround  --conf spark.hadoop.fs.defaultFS=s3a://somebucket
      
       or set it in sparkConf or spark-default.conf
      
      ## How was this patch tested?
      
      existing ut
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16815 from uncleGen/SPARK-19407.
      
      (cherry picked from commit 7a0a630e)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      62fab5be
    • Herman van Hovell's avatar
      [SPARK-19472][SQL] Parser should not mistake CASE WHEN(...) for a function call · f55bd4c7
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      The SQL parser can mistake a `WHEN (...)` used in `CASE` for a function call. This happens in cases like the following:
      ```sql
      select case when (1) + case when 1 > 0 then 1 else 0 end = 2 then 1 else 0 end
      from tb
      ```
      This PR fixes this by re-organizing the case related parsing rules.
      
      ## How was this patch tested?
      Added a regression test to the `ExpressionParserSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16821 from hvanhovell/SPARK-19472.
      
      (cherry picked from commit cb2677b8)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      f55bd4c7
  11. Feb 01, 2017
    • Shixiong Zhu's avatar
      [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout · 7c23bd49
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      When connecting timeout, `ask` may fail with a confusing message:
      
      ```
      17/02/01 23:15:19 INFO Worker: Connecting to master ...
      java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet been set.
              at scala.Predef$.require(Predef.scala:224)
              at org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
              at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
              at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
              at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
              at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
              at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
      ```
      
      It's better to provide a meaningful message.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16773 from zsxwing/connect-timeout.
      
      (cherry picked from commit 8303e20c)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      7c23bd49
    • Devaraj K's avatar
      [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED · f9464641
      Devaraj K authored
      
      ## What changes were proposed in this pull request?
      
      Copying of the killed status was missing while getting the newTaskInfo object by dropping the unnecessary details to reduce the memory usage. This patch adds the copying of the killed status to newTaskInfo object, this will correct the display of the status from wrong status to KILLED status in Web UI.
      
      ## How was this patch tested?
      
      Current behaviour of displaying tasks in stage UI page,
      
      | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
      | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
      |143	|10	|0	|SUCCESS	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
      |156	|11	|0	|SUCCESS	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
      
      Web UI display after applying the patch,
      
      | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
      | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
      |143	|10	|0	|KILLED	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  | 0.0 B / 0	| TaskKilled (killed intentionally)|
      |156	|11	|0	|KILLED	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  |0.0 B / 0	| TaskKilled (killed intentionally)|
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #16725 from devaraj-kavali/SPARK-19377.
      
      (cherry picked from commit df4a27cc)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      f9464641
    • Zheng RuiFeng's avatar
      [SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning · 61cdc8c7
      Zheng RuiFeng authored
      
      ## What changes were proposed in this pull request?
      Fix brokens links in ml-pipeline and ml-tuning
      `<div data-lang="scala">`  ->   `<div data-lang="scala" markdown="1">`
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16754 from zhengruifeng/doc_api_fix.
      
      (cherry picked from commit 04ee8cf6)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      61cdc8c7
  12. Jan 31, 2017
  13. Jan 30, 2017
    • gatorsmile's avatar
      [SPARK-19406][SQL] Fix function to_json to respect user-provided options · 07a1788e
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      Currently, the function `to_json` allows users to provide options for generating JSON. However, it does not pass it to `JacksonGenerator`. Thus, it ignores the user-provided options. This PR is to fix it. Below is an example.
      
      ```Scala
      val df = Seq(Tuple1(Tuple1(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0")))).toDF("a")
      val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
      df.select(to_json($"a", options)).show(false)
      ```
      The current output is like
      ```
      +--------------------------------------+
      |structtojson(a)                       |
      +--------------------------------------+
      |{"_1":"2015-08-26T18:00:00.000-07:00"}|
      +--------------------------------------+
      ```
      
      After the fix, the output is like
      ```
      +-------------------------+
      |structtojson(a)          |
      +-------------------------+
      |{"_1":"26/08/2015 18:00"}|
      +-------------------------+
      ```
      ### How was this patch tested?
      Added test cases for both `from_json` and `to_json`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16745 from gatorsmile/toJson.
      
      (cherry picked from commit f9156d29)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      07a1788e
Loading