Skip to content
Snippets Groups Projects
  1. Feb 13, 2017
    • Liwei Lin's avatar
      [SPARK-19564][SPARK-19559][SS][KAFKA] KafkaOffsetReader's consumers should not be in the same group · fe4fcc57
      Liwei Lin authored
      
      ## What changes were proposed in this pull request?
      
      In `KafkaOffsetReader`, when error occurs, we abort the existing consumer and create a new consumer. In our current implementation, the first consumer and the second consumer would be in the same group (which leads to SPARK-19559), **_violating our intention of the two consumers not being in the same group._**
      
      The cause is that, in our current implementation, the first consumer is created before `groupId` and `nextId` are initialized in the constructor. Then even if `groupId` and `nextId` are increased during the creation of that first consumer, `groupId` and `nextId` would still be initialized to default values in the constructor for the second consumer.
      
      We should make sure that `groupId` and `nextId` are initialized before any consumer is created.
      
      ## How was this patch tested?
      
      Ran 100 times of `KafkaSourceSuite`; all passed
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16902 from lw-lin/SPARK-19564-.
      
      (cherry picked from commit 2bdbc870)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      fe4fcc57
  2. Feb 12, 2017
    • wm624@hotmail.com's avatar
      [SPARK-19319][BACKPORT-2.1][SPARKR] SparkR Kmeans summary returns error when... · 06e77e00
      wm624@hotmail.com authored
      [SPARK-19319][BACKPORT-2.1][SPARKR] SparkR Kmeans summary returns error when the cluster size doesn't equal to k
      
      ## What changes were proposed in this pull request?
      
      Backport fix of #16666
      
      ## How was this patch tested?
      
      Backport unit tests
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #16761 from wangmiao1981/kmeansport.
      06e77e00
    • titicaca's avatar
      [SPARK-19342][SPARKR] bug fixed in collect method for collecting timestamp column · 173c2387
      titicaca authored
      
      ## What changes were proposed in this pull request?
      
      Fix a bug in collect method for collecting timestamp column, the bug can be reproduced as shown in the following codes and outputs:
      
      ```
      library(SparkR)
      sparkR.session(master = "local")
      df <- data.frame(col1 = c(0, 1, 2),
                       col2 = c(as.POSIXct("2017-01-01 00:00:01"), NA, as.POSIXct("2017-01-01 12:00:01")))
      
      sdf1 <- createDataFrame(df)
      print(dtypes(sdf1))
      df1 <- collect(sdf1)
      print(lapply(df1, class))
      
      sdf2 <- filter(sdf1, "col1 > 0")
      print(dtypes(sdf2))
      df2 <- collect(sdf2)
      print(lapply(df2, class))
      ```
      
      As we can see from the printed output, the column type of col2 in df2 is converted to numeric unexpectedly, when NA exists at the top of the column.
      
      This is caused by method `do.call(c, list)`, if we convert a list, i.e. `do.call(c, list(NA, as.POSIXct("2017-01-01 12:00:01"))`, the class of the result is numeric instead of POSIXct.
      
      Therefore, we need to cast the data type of the vector explicitly.
      
      ## How was this patch tested?
      
      The patch can be tested manually with the same code above.
      
      Author: titicaca <fangzhou.yang@hotmail.com>
      
      Closes #16689 from titicaca/sparkr-dev.
      
      (cherry picked from commit bc0a0e63)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      173c2387
  3. Feb 10, 2017
    • Andrew Ray's avatar
      [SPARK-18717][SQL] Make code generation for Scala Map work with immutable.Map also · e580bb03
      Andrew Ray authored
      
      ## What changes were proposed in this pull request?
      
      Fixes compile errors in generated code when user has case class with a `scala.collections.immutable.Map` instead of a `scala.collections.Map`. Since ArrayBasedMapData.toScalaMap returns the immutable version we can make it work with both.
      
      ## How was this patch tested?
      
      Additional unit tests.
      
      Author: Andrew Ray <ray.andrew@gmail.com>
      
      Closes #16161 from aray/fix-map-codegen.
      
      (cherry picked from commit 46d30ac4)
      Signed-off-by: default avatarCheng Lian <lian@databricks.com>
      e580bb03
    • Burak Yavuz's avatar
      [SPARK-19543] from_json fails when the input row is empty · 7b5ea000
      Burak Yavuz authored
      
      ## What changes were proposed in this pull request?
      
      Using from_json on a column with an empty string results in: java.util.NoSuchElementException: head of empty list.
      
      This is because `parser.parse(input)` may return `Nil` when `input.trim.isEmpty`
      
      ## How was this patch tested?
      
      Regression test in `JsonExpressionsSuite`
      
      Author: Burak Yavuz <brkyvz@gmail.com>
      
      Closes #16881 from brkyvz/json-fix.
      
      (cherry picked from commit d5593f7f)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      7b5ea000
    • Bogdan Raducanu's avatar
      [SPARK-19512][BACKPORT-2.1][SQL] codegen for compare structs fails #16852 · ff5818b8
      Bogdan Raducanu authored
      ## What changes were proposed in this pull request?
      
      Set currentVars to null in GenerateOrdering.genComparisons before genCode is called. genCode ignores INPUT_ROW if currentVars is not null and in genComparisons we want it to use INPUT_ROW.
      
      ## How was this patch tested?
      
      Added test with 2 queries in WholeStageCodegenSuite
      
      Author: Bogdan Raducanu <bogdan.rdc@gmail.com>
      
      Closes #16875 from bogdanrdc/SPARK-19512-2.1.
      ff5818b8
  4. Feb 09, 2017
    • Stan Zhai's avatar
      [SPARK-19509][SQL] Grouping Sets do not respect nullable grouping columns · a3d5300a
      Stan Zhai authored
      ## What changes were proposed in this pull request?
      The analyzer currently does not check if a column used in grouping sets is actually nullable itself. This can cause the nullability of the column to be incorrect, which can cause null pointer exceptions down the line. This PR fixes that by also consider the nullability of the column.
      
      This is only a problem for Spark 2.1 and below. The latest master uses a different approach.
      
      Closes https://github.com/apache/spark/pull/16874
      
      ## How was this patch tested?
      Added a regression test to `SQLQueryTestSuite.grouping_set`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16873 from hvanhovell/SPARK-19509.
      a3d5300a
    • Shixiong Zhu's avatar
      [SPARK-19481] [REPL] [MAVEN] Avoid to leak SparkContext in Signaling.cancelOnInterrupt · b3fd36a1
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      `Signaling.cancelOnInterrupt` leaks a SparkContext per call and it makes ReplSuite unstable.
      
      This PR adds `SparkContext.getActive` to allow `Signaling.cancelOnInterrupt` to get the active `SparkContext` to avoid the leak.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16825 from zsxwing/SPARK-19481.
      
      (cherry picked from commit 303f00a4)
      Signed-off-by: default avatarDavies Liu <davies.liu@gmail.com>
      b3fd36a1
  5. Feb 08, 2017
    • Tathagata Das's avatar
      [SPARK-19413][SS] MapGroupsWithState for arbitrary stateful operations for branch-2.1 · 502c927b
      Tathagata Das authored
      This is a follow up PR for merging #16758 to spark 2.1 branch
      
      ## What changes were proposed in this pull request?
      
      `mapGroupsWithState` is a new API for arbitrary stateful operations in Structured Streaming, similar to `DStream.mapWithState`
      
      *Requirements*
      - Users should be able to specify a function that can do the following
      - Access the input row corresponding to a key
      - Access the previous state corresponding to a key
      - Optionally, update or remove the state
      - Output any number of new rows (or none at all)
      
      *Proposed API*
      ```
      // ------------ New methods on KeyValueGroupedDataset ------------
      class KeyValueGroupedDataset[K, V] {
      	// Scala friendly
      	def mapGroupsWithState[S: Encoder, U: Encoder](func: (K, Iterator[V], KeyedState[S]) => U)
              def flatMapGroupsWithState[S: Encode, U: Encoder](func: (K, Iterator[V], KeyedState[S]) => Iterator[U])
      	// Java friendly
             def mapGroupsWithState[S, U](func: MapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
             def flatMapGroupsWithState[S, U](func: FlatMapGroupsWithStateFunction[K, V, S, R], stateEncoder: Encoder[S], resultEncoder: Encoder[U])
      }
      
      // ------------------- New Java-friendly function classes -------------------
      public interface MapGroupsWithStateFunction<K, V, S, R> extends Serializable {
        R call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
      }
      public interface FlatMapGroupsWithStateFunction<K, V, S, R> extends Serializable {
        Iterator<R> call(K key, Iterator<V> values, state: KeyedState<S>) throws Exception;
      }
      
      // ---------------------- Wrapper class for state data ----------------------
      trait KeyedState[S] {
      	def exists(): Boolean
        	def get(): S 			// throws Exception is state does not exist
      	def getOption(): Option[S]
      	def update(newState: S): Unit
      	def remove(): Unit		// exists() will be false after this
      }
      ```
      
      Key Semantics of the State class
      - The state can be null.
      - If the state.remove() is called, then state.exists() will return false, and getOption will returm None.
      - After that state.update(newState) is called, then state.exists() will return true, and getOption will return Some(...).
      - None of the operations are thread-safe. This is to avoid memory barriers.
      
      *Usage*
      ```
      val stateFunc = (word: String, words: Iterator[String, runningCount: KeyedState[Long]) => {
          val newCount = words.size + runningCount.getOption.getOrElse(0L)
          runningCount.update(newCount)
         (word, newCount)
      }
      
      dataset					                        // type is Dataset[String]
        .groupByKey[String](w => w)        	                // generates KeyValueGroupedDataset[String, String]
        .mapGroupsWithState[Long, (String, Long)](stateFunc)	// returns Dataset[(String, Long)]
      ```
      
      ## How was this patch tested?
      New unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #16850 from tdas/mapWithState-branch-2.1.
      502c927b
    • Herman van Hovell's avatar
      [SPARK-18609][SPARK-18841][SQL][BACKPORT-2.1] Fix redundant Alias removal in the optimizer · 71b6eacf
      Herman van Hovell authored
      This is a backport of https://github.com/apache/spark/commit/73ee73945e369a862480ef4ac64e55c797bd7d90
      
      ## What changes were proposed in this pull request?
      The optimizer tries to remove redundant alias only projections from the query plan using the `RemoveAliasOnlyProject` rule. The current rule identifies removes such a project and rewrites the project's attributes in the **entire** tree. This causes problems when parts of the tree are duplicated (for instance a self join on a temporary view/CTE)  and the duplicated part contains the alias only project, in this case the rewrite will break the tree.
      
      This PR fixes these problems by using a blacklist for attributes that are not to be moved, and by making sure that attribute remapping is only done for the parent tree, and not for unrelated parts of the query plan.
      
      The current tree transformation infrastructure works very well if the transformation at hand requires little or a global contextual information. In this case we need to know both the attributes that were not to be moved, and we also needed to know which child attributes were modified. This cannot be done easily using the current infrastructure, and solutions typically involves transversing the query plan multiple times (which is super slow). I have moved around some code in `TreeNode`, `QueryPlan` and `LogicalPlan`to make this much more straightforward; this basically allows you to manually traverse the tree.
      
      ## How was this patch tested?
      I have added unit tests to `RemoveRedundantAliasAndProjectSuite` and I have added integration tests to the `SQLQueryTestSuite.union` and `SQLQueryTestSuite.cte` test cases.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16843 from hvanhovell/SPARK-18609-2.1.
      71b6eacf
  6. Feb 07, 2017
  7. Feb 06, 2017
    • uncleGen's avatar
      [SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting it from uri scheme · 62fab5be
      uncleGen authored
      ## What changes were proposed in this pull request?
      
      ```
      Caused by: java.lang.IllegalArgumentException: Wrong FS: s3a://**************/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata, expected: file:///
      	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
      	at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
      	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
      	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
      	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
      	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
      	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
      	at org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
      	at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:100)
      	at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
      	at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
      	at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
      ```
      
      Can easily replicate on spark standalone cluster by providing checkpoint location uri scheme anything other than "file://" and not overriding in config.
      
      WorkAround  --conf spark.hadoop.fs.defaultFS=s3a://somebucket
      
       or set it in sparkConf or spark-default.conf
      
      ## How was this patch tested?
      
      existing ut
      
      Author: uncleGen <hustyugm@gmail.com>
      
      Closes #16815 from uncleGen/SPARK-19407.
      
      (cherry picked from commit 7a0a630e)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      62fab5be
    • Herman van Hovell's avatar
      [SPARK-19472][SQL] Parser should not mistake CASE WHEN(...) for a function call · f55bd4c7
      Herman van Hovell authored
      
      ## What changes were proposed in this pull request?
      The SQL parser can mistake a `WHEN (...)` used in `CASE` for a function call. This happens in cases like the following:
      ```sql
      select case when (1) + case when 1 > 0 then 1 else 0 end = 2 then 1 else 0 end
      from tb
      ```
      This PR fixes this by re-organizing the case related parsing rules.
      
      ## How was this patch tested?
      Added a regression test to the `ExpressionParserSuite`.
      
      Author: Herman van Hovell <hvanhovell@databricks.com>
      
      Closes #16821 from hvanhovell/SPARK-19472.
      
      (cherry picked from commit cb2677b8)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      f55bd4c7
  8. Feb 01, 2017
    • Shixiong Zhu's avatar
      [SPARK-19432][CORE] Fix an unexpected failure when connecting timeout · 7c23bd49
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      When connecting timeout, `ask` may fail with a confusing message:
      
      ```
      17/02/01 23:15:19 INFO Worker: Connecting to master ...
      java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet been set.
              at scala.Predef$.require(Predef.scala:224)
              at org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
              at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
              at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
              at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
              at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
              at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
      ```
      
      It's better to provide a meaningful message.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16773 from zsxwing/connect-timeout.
      
      (cherry picked from commit 8303e20c)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      7c23bd49
    • Devaraj K's avatar
      [SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED · f9464641
      Devaraj K authored
      
      ## What changes were proposed in this pull request?
      
      Copying of the killed status was missing while getting the newTaskInfo object by dropping the unnecessary details to reduce the memory usage. This patch adds the copying of the killed status to newTaskInfo object, this will correct the display of the status from wrong status to KILLED status in Web UI.
      
      ## How was this patch tested?
      
      Current behaviour of displaying tasks in stage UI page,
      
      | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
      | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
      |143	|10	|0	|SUCCESS	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
      |156	|11	|0	|SUCCESS	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
      
      Web UI display after applying the patch,
      
      | Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
      | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
      |143	|10	|0	|KILLED	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  | 0.0 B / 0	| TaskKilled (killed intentionally)|
      |156	|11	|0	|KILLED	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  |0.0 B / 0	| TaskKilled (killed intentionally)|
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #16725 from devaraj-kavali/SPARK-19377.
      
      (cherry picked from commit df4a27cc)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      f9464641
    • Zheng RuiFeng's avatar
      [SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning · 61cdc8c7
      Zheng RuiFeng authored
      
      ## What changes were proposed in this pull request?
      Fix brokens links in ml-pipeline and ml-tuning
      `<div data-lang="scala">`  ->   `<div data-lang="scala" markdown="1">`
      
      ## How was this patch tested?
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #16754 from zhengruifeng/doc_api_fix.
      
      (cherry picked from commit 04ee8cf6)
      Signed-off-by: default avatarSean Owen <sowen@cloudera.com>
      Unverified
      61cdc8c7
  9. Jan 31, 2017
  10. Jan 30, 2017
    • gatorsmile's avatar
      [SPARK-19406][SQL] Fix function to_json to respect user-provided options · 07a1788e
      gatorsmile authored
      
      ### What changes were proposed in this pull request?
      Currently, the function `to_json` allows users to provide options for generating JSON. However, it does not pass it to `JacksonGenerator`. Thus, it ignores the user-provided options. This PR is to fix it. Below is an example.
      
      ```Scala
      val df = Seq(Tuple1(Tuple1(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0")))).toDF("a")
      val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
      df.select(to_json($"a", options)).show(false)
      ```
      The current output is like
      ```
      +--------------------------------------+
      |structtojson(a)                       |
      +--------------------------------------+
      |{"_1":"2015-08-26T18:00:00.000-07:00"}|
      +--------------------------------------+
      ```
      
      After the fix, the output is like
      ```
      +-------------------------+
      |structtojson(a)          |
      +-------------------------+
      |{"_1":"26/08/2015 18:00"}|
      +-------------------------+
      ```
      ### How was this patch tested?
      Added test cases for both `from_json` and `to_json`
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16745 from gatorsmile/toJson.
      
      (cherry picked from commit f9156d29)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      07a1788e
    • gatorsmile's avatar
      [SPARK-19396][DOC] JDBC Options are Case In-sensitive · 445438c9
      gatorsmile authored
      ### What changes were proposed in this pull request?
      The case are not sensitive in JDBC options, after the PR https://github.com/apache/spark/pull/15884
      
       is merged to Spark 2.1.
      
      ### How was this patch tested?
      N/A
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #16734 from gatorsmile/fixDocCaseInsensitive.
      
      (cherry picked from commit c0eda7e8)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      445438c9
  11. Jan 27, 2017
    • Felix Cheung's avatar
      [SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR · 9a49f9af
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)
      
      Before:
      ```
      > a <- as.DataFrame(cars)
      > b <- group_by(a, "dist")
      > c <- count(b)
      > sparkR.callJMethod(c$countjc, "explain", TRUE)
      NULL
      ```
      
      After:
      ```
      > a <- as.DataFrame(cars)
      > b <- group_by(a, "dist")
      > c <- count(b)
      > sparkR.callJMethod(c$countjc, "explain", TRUE)
      count#11L
      NULL
      ```
      
      Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.
      
      ## How was this patch tested?
      
      manual
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16670 from felixcheung/rjvmstdout.
      
      (cherry picked from commit a7ab6f9a)
      Signed-off-by: default avatarShivaram Venkataraman <shivaram@cs.berkeley.edu>
      9a49f9af
    • Felix Cheung's avatar
      [SPARK-19333][SPARKR] Add Apache License headers to R files · 4002ee97
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      add header
      
      ## How was this patch tested?
      
      Manual run to check vignettes html is created properly
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16709 from felixcheung/rfilelicense.
      
      (cherry picked from commit 385d7384)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      4002ee97
  12. Jan 26, 2017
    • Felix Cheung's avatar
      [SPARK-18788][SPARKR] Add API for getNumPartitions · ba2a5ada
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      With doc to say this would convert DF into RDD
      
      ## How was this patch tested?
      
      unit tests, manual tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16668 from felixcheung/rgetnumpartitions.
      
      (cherry picked from commit 90817a6c)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      ba2a5ada
    • Marcelo Vanzin's avatar
      [SPARK-19220][UI] Make redirection to HTTPS apply to all URIs. (branch-2.1) · 59502bbc
      Marcelo Vanzin authored
      The redirect handler was installed only for the root of the server;
      any other context ended up being served directly through the HTTP
      port. Since every sub page (e.g. application UIs in the history
      server) is a separate servlet context, this meant that everything
      but the root was accessible via HTTP still.
      
      The change adds separate names to each connector, and binds contexts
      to specific connectors so that content is only served through the
      HTTPS connector when it's enabled. In that case, the only thing that
      binds to the HTTP connector is the redirect handler.
      
      Tested with new unit tests and by checking a live history server.
      
      (cherry picked from commit d3dcb63b)
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16711 from vanzin/SPARK-19220_2.1.
      59502bbc
    • Takeshi YAMAMURO's avatar
      [SPARK-19338][SQL] Add UDF names in explain · b12a76a4
      Takeshi YAMAMURO authored
      
      ## What changes were proposed in this pull request?
      This pr added a variable for a UDF name in `ScalaUDF`.
      Then, if the variable filled, `DataFrame#explain` prints the name.
      
      ## How was this patch tested?
      Added a test in `UDFSuite`.
      
      Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>
      
      Closes #16707 from maropu/SPARK-19338.
      
      (cherry picked from commit 9f523d31)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      b12a76a4
  13. Jan 25, 2017
    • Tathagata Das's avatar
      [SPARK-14804][SPARK][GRAPHX] Fix checkpointing of VertexRDD/EdgeRDD · 0d7e3852
      Tathagata Das authored
      
      ## What changes were proposed in this pull request?
      
      EdgeRDD/VertexRDD overrides checkpoint() and isCheckpointed() to forward these to the internal partitionRDD. So when checkpoint() is called on them, its the partitionRDD that actually gets checkpointed. However since isCheckpointed() also overridden to call partitionRDD.isCheckpointed, EdgeRDD/VertexRDD.isCheckpointed returns true even though this RDD is actually not checkpointed.
      
      This would have been fine except the RDD's internal logic for computing the RDD depends on isCheckpointed(). So for VertexRDD/EdgeRDD, since isCheckpointed is true, when computing Spark tries to read checkpoint data of VertexRDD/EdgeRDD even though they are not actually checkpointed. Through a crazy sequence of call forwarding, it reads checkpoint data of partitionsRDD and tries to cast it to types in Vertex/EdgeRDD. This leads to ClassCastException.
      
      The minimal fix that does not change any public behavior is to modify RDD internal to not use public override-able API for internal logic.
      ## How was this patch tested?
      
      New unit tests.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #15396 from tdas/SPARK-14804.
      
      (cherry picked from commit 47d5d0dd)
      Signed-off-by: default avatarTathagata Das <tathagata.das1565@gmail.com>
      0d7e3852
    • Holden Karau's avatar
      [SPARK-19064][PYSPARK] Fix pip installing of sub components · a5c10ff2
      Holden Karau authored
      
      ## What changes were proposed in this pull request?
      
      Fix instalation of mllib and ml sub components, and more eagerly cleanup cache files during test script & make-distribution.
      
      ## How was this patch tested?
      
      Updated sanity test script to import mllib and ml sub-components.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #16465 from holdenk/SPARK-19064-fix-pip-install-sub-components.
      
      (cherry picked from commit 965c82d8)
      Signed-off-by: default avatarHolden Karau <holden@us.ibm.com>
      a5c10ff2
    • Marcelo Vanzin's avatar
      [SPARK-18750][YARN] Follow up: move test to correct directory in 2.1 branch. · 97d3353e
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16704 from vanzin/SPARK-18750_2.1.
      97d3353e
    • Marcelo Vanzin's avatar
      [SPARK-19307][PYSPARK] Make sure user conf is propagated to SparkContext. · c9f075ab
      Marcelo Vanzin authored
      
      The code was failing to propagate the user conf in the case where the
      JVM was already initialized, which happens when a user submits a
      python script via spark-submit.
      
      Tested with new unit test and by running a python script in a real cluster.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16682 from vanzin/SPARK-19307.
      
      (cherry picked from commit 92afaa93)
      Signed-off-by: default avatarMarcelo Vanzin <vanzin@cloudera.com>
      c9f075ab
    • Nattavut Sutyanyong's avatar
      [SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a... · af954553
      Nattavut Sutyanyong authored
      [SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error
      
      ## What changes were proposed in this pull request?
      This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.
      
      ## How was this patch tested?
      Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.
      
      ````
      -- TC 01.01
      -- The column t2b in the SELECT of the subquery is invalid
      -- because it is neither an aggregate function nor a GROUP BY column.
      select t1a, t2b
      from   t1, t2
      where  t1b = t2c
      and    t2b = (select max(avg)
                    from   (select   t2b, avg(t2b) avg
                            from     t2
                            where    t2a = t1.t1b
                           )
                   )
      ;
      
      -- TC 01.02
      -- Invalid due to the column t2b not part of the output from table t2.
      select *
      from   t1
      where  t1a in (select   min(t2a)
                     from     t2
                     group by t2c
                     having   t2c in (select   max(t3c)
                                      from     t3
                                      group by t3b
                                      having   t3b > t2b ))
      ;
      ````
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #16572 from nsyca/18863.
      
      (cherry picked from commit f1ddca5f)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      af954553
    • Marcelo Vanzin's avatar
      [SPARK-18750][YARN] Avoid using "mapValues" when allocating containers. · f391ad2c
      Marcelo Vanzin authored
      
      That method is prone to stack overflows when the input map is really
      large; instead, use plain "map". Also includes a unit test that was
      tested and caused stack overflows without the fix.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16667 from vanzin/SPARK-18750.
      
      (cherry picked from commit 76db394f)
      Signed-off-by: default avatarTom Graves <tgraves@yahoo-inc.com>
      f391ad2c
    • aokolnychyi's avatar
      [SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide · e2f77392
      aokolnychyi authored
      ## What changes were proposed in this pull request?
      
      - A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
      - Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
      - Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
      - Python is not covered.
      - The PR might not resolve the ticket since I do not know what exactly was planned by the author.
      
      In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.
      
      ## How was this patch tested?
      
      The patch was tested locally by building the docs. The examples were run as well.
      
      ![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png
      
      )
      
      Author: aokolnychyi <okolnychyyanton@gmail.com>
      
      Closes #16329 from aokolnychyi/SPARK-16046.
      
      (cherry picked from commit 3fdce814)
      Signed-off-by: default avatargatorsmile <gatorsmile@gmail.com>
      e2f77392
  14. Jan 24, 2017
    • Liwei Lin's avatar
      [SPARK-19330][DSTREAMS] Also show tooltip for successful batches · c1337879
      Liwei Lin authored
      ## What changes were proposed in this pull request?
      
      ### Before
      ![_streaming_before](https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png)
      
      ### After
      ![_streaming_after](https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png
      
      )
      
      ## How was this patch tested?
      
      Manually
      
      Author: Liwei Lin <lwlin7@gmail.com>
      
      Closes #16673 from lw-lin/streaming.
      
      (cherry picked from commit 40a4cfc7)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      c1337879
    • Nattavut Sutyanyong's avatar
      [SPARK-19017][SQL] NOT IN subquery with more than one column may return incorrect results · b94fb284
      Nattavut Sutyanyong authored
      
      ## What changes were proposed in this pull request?
      
      This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.
      
      Example:
      The query
      
       select a1,b1
       from   t1
       where  (a1,b1) not in (select a2,b2
                              from   t2);
      
      has the (a1, b1) = (a2, b2) rewritten from (before this fix):
      
      Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))
      
      to (after this fix):
      
      Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))
      
      ## How was this patch tested?
      
      sql/test, catalyst/test and new test cases in SQLQueryTestSuite.
      
      Author: Nattavut Sutyanyong <nsy.can@gmail.com>
      
      Closes #16467 from nsyca/19017.
      
      (cherry picked from commit cdb691eb)
      Signed-off-by: default avatarHerman van Hovell <hvanhovell@databricks.com>
      b94fb284
    • Ilya Matiach's avatar
      [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case · d128b6a3
      Ilya Matiach authored
      [SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case where no children exist in updateAssignments
      
      ## What changes were proposed in this pull request?
      
      Fix a bug in which BisectingKMeans fails with error:
      java.util.NoSuchElementException: key not found: 166
              at scala.collection.MapLike$class.default(MapLike.scala:228)
              at scala.collection.AbstractMap.default(Map.scala:58)
              at scala.collection.MapLike$class.apply(MapLike.scala:141)
              at scala.collection.AbstractMap.apply(Map.scala:58)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
              at scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
              at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
              at scala.collection.immutable.List.foldLeft(List.scala:84)
              at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
              at scala.collection.immutable.List.reduceLeft(List.scala:84)
              at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231)
              at scala.collection.AbstractTraversable.minBy(Traversable.scala:105)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
              at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
              at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
              at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)
      
      ## How was this patch tested?
      
      The dataset was run against the code change to verify that the code works.  I will try to add unit tests to the code.
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Ilya Matiach <ilmat@microsoft.com>
      
      Closes #16355 from imatiach-msft/ilmat/fix-kmeans.
      Unverified
      d128b6a3
    • Felix Cheung's avatar
      [SPARK-18823][SPARKR] add support for assigning to column · 9c04e427
      Felix Cheung authored
      
      ## What changes were proposed in this pull request?
      
      Support for
      ```
      df[[myname]] <- 1
      df[[2]] <- df$eruptions
      ```
      
      ## How was this patch tested?
      
      manual tests, unit tests
      
      Author: Felix Cheung <felixcheung_m@hotmail.com>
      
      Closes #16663 from felixcheung/rcolset.
      
      (cherry picked from commit f27e0247)
      Signed-off-by: default avatarFelix Cheung <felixcheung@apache.org>
      9c04e427
    • Shixiong Zhu's avatar
      [SPARK-19268][SS] Disallow adaptive query execution for streaming queries · 570e5e11
      Shixiong Zhu authored
      
      ## What changes were proposed in this pull request?
      
      As adaptive query execution may change the number of partitions in different batches, it may break streaming queries. Hence, we should disallow this feature in Structured Streaming.
      
      ## How was this patch tested?
      
      `test("SPARK-19268: Adaptive query execution should be disallowed")`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #16683 from zsxwing/SPARK-19268.
      
      (cherry picked from commit 60bd91a3)
      Signed-off-by: default avatarShixiong Zhu <shixiong@databricks.com>
      570e5e11
Loading