Skip to content
Snippets Groups Projects
  1. Mar 11, 2016
    • Josh Rosen's avatar
      [SPARK-13294][PROJECT INFRA] Remove MiMa's dependency on spark-class / Spark assembly · 6ca990fb
      Josh Rosen authored
      This patch removes the need to build a full Spark assembly before running the `dev/mima` script.
      
      - I modified the `tools` project to remove a direct dependency on Spark, so `sbt/sbt tools/fullClasspath` will now return the classpath for the `GenerateMIMAIgnore` class itself plus its own dependencies.
         - This required me to delete two classes full of dead code that we don't use anymore
      - `GenerateMIMAIgnore` now uses [ClassUtil](http://software.clapper.org/classutil/) to find all of the Spark classes rather than our homemade JAR traversal code. The problem in our own code was that it didn't handle folders of classes properly, which is necessary in order to generate excludes with an assembly-free Spark build.
      - `./dev/mima` no longer runs through `spark-class`, eliminating the need to reason about classpath ordering between `SPARK_CLASSPATH` and the assembly.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11178 from JoshRosen/remove-assembly-in-run-tests.
      6ca990fb
    • Zheng RuiFeng's avatar
      [SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIB · d18276cb
      Zheng RuiFeng authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13672
      
      ## What changes were proposed in this pull request?
      
      add two python examples of BisectingKMeans for ml and mllib
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Zheng RuiFeng <ruifengz@foxmail.com>
      
      Closes #11515 from zhengruifeng/mllib_bkm_pe.
      d18276cb
    • Marcelo Vanzin's avatar
      [MINOR][CORE] Fix a duplicate "and" in a log message. · e33bc67c
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11642 from vanzin/spark-conf-typo.
      e33bc67c
  2. Mar 10, 2016
    • Wenchen Fan's avatar
      [HOT-FIX] fix compile · 74c4e265
      Wenchen Fan authored
      Fix the compilation failure introduced by https://github.com/apache/spark/pull/11555 because of a merge conflict.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11648 from cloud-fan/hotbug.
      74c4e265
    • Wenchen Fan's avatar
      [SPARK-12718][SPARK-13720][SQL] SQL generation support for window functions · 6871cc8f
      Wenchen Fan authored
      ## What changes were proposed in this pull request?
      
      Add SQL generation support for window functions. The idea is simple, just treat `Window` operator like `Project`, i.e. add subquery to its child when necessary, generate a `SELECT ... FROM ...` SQL string, implement `sql` method for window related expressions, e.g. `WindowSpecDefinition`, `WindowFrame`, etc.
      
      This PR also fixed SPARK-13720 by improving the process of adding extra `SubqueryAlias`(the `RecoverScopingInfo` rule). Before this PR, we update the qualifiers in project list while adding the subquery. However, this is incomplete as we need to update qualifiers in all ancestors that refer attributes here. In this PR, we split `RecoverScopingInfo` into 2 rules: `AddSubQuery` and `UpdateQualifier`. `AddSubQuery` only add subquery if necessary, and `UpdateQualifier` will re-propagate and update qualifiers bottom up.
      
      Ideally we should put the bug fix part in an individual PR, but this bug also blocks the window stuff, so I put them together here.
      
      Many thanks to gatorsmile for the initial discussion and test cases!
      
      ## How was this patch tested?
      
      new tests in `LogicalPlanToSQLSuite`
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #11555 from cloud-fan/window.
      6871cc8f
    • gatorsmile's avatar
      [SPARK-13732][SPARK-13797][SQL] Remove projectList from Window and Eliminate useless Window · 560489f4
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      `projectList` is useless. Its value is always the same as the child.output. Remove it from the class `Window`. Removal can simplify the codes in Analyzer and Optimizer.
      
      This PR is based on the discussion started by cloud-fan in a separate PR:
      https://github.com/apache/spark/pull/5604#discussion_r55140466
      
      This PR also eliminates useless `Window`.
      
      cloud-fan yhuai
      
      #### How was this patch tested?
      
      Existing test cases cover it.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      Author: xiaoli <lixiao1983@gmail.com>
      Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>
      
      Closes #11565 from gatorsmile/removeProjListWindow.
      560489f4
    • Yanbo Liang's avatar
      [SPARK-13389][SPARKR] SparkR support first/last with ignore NAs · 4d535d1f
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      
      SparkR support first/last with ignore NAs
      
      cc sun-rui felixcheung shivaram
      
      ## How was the this patch tested?
      
      unit tests
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11267 from yanboliang/spark-13389.
      4d535d1f
    • Sameer Agarwal's avatar
      [SPARK-13789] Infer additional constraints from attribute equality · c3a6269c
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for inferring an additional set of data constraints based on attribute equality. For e.g., if an operator has constraints of the form (`a = 5`, `a = b`), we can now automatically infer an additional constraint of the form `b = 5`
      
      ## How was this patch tested?
      
      Tested that new constraints are properly inferred for filters (by adding a new test) and equi-joins (by modifying an existing test)
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11618 from sameeragarwal/infer-isequal-constraints.
      c3a6269c
    • Oscar D. Lara Yejas's avatar
      [SPARK-13327][SPARKR] Added parameter validations for colnames<- · 416e71af
      Oscar D. Lara Yejas authored
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.attlocal.net>
      Author: Oscar D. Lara Yejas <odlaraye@oscars-mbp.usca.ibm.com>
      
      Closes #11220 from olarayej/SPARK-13312-3.
      416e71af
    • Dongjoon Hyun's avatar
      [MINOR][DOC] Fix supported hive version in doc · 88fa8666
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Today, Spark 1.6.1 and updated docs are release. Unfortunately, there is obsolete hive version information on docs: [Building Spark](http://spark.apache.org/docs/latest/building-spark.html#building-with-hive-and-jdbc-support). This PR fixes the following two lines.
      ```
      -By default Spark will build with Hive 0.13.1 bindings.
      +By default Spark will build with Hive 1.2.1 bindings.
      -# Apache Hadoop 2.4.X with Hive 13 support
      +# Apache Hadoop 2.4.X with Hive 1.2.1 support
      ```
      `sql/README.md` file also describe
      
      ## How was this patch tested?
      
      Manual.
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11639 from dongjoon-hyun/fix_doc_hive_version.
      88fa8666
    • Cheng Lian's avatar
      [SPARK-13244][SQL] Migrates DataFrame to Dataset · 1d542785
      Cheng Lian authored
      ## What changes were proposed in this pull request?
      
      This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.
      
      Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).
      
      There are several noticeable API changes related to those returning arrays:
      
      1.  `collect`/`take`
      
          -   Old APIs in class `DataFrame`:
      
              ```scala
              def collect(): Array[Row]
              def take(n: Int): Array[Row]
              ```
      
          -   New APIs in class `Dataset[T]`:
      
              ```scala
              def collect(): Array[T]
              def take(n: Int): Array[T]
      
              def collectRows(): Array[Row]
              def takeRows(n: Int): Array[Row]
              ```
      
          Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.
      
          Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).
      
      1.  `randomSplit`
      
          -   Old APIs in class `DataFrame`:
      
              ```scala
              def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
              def randomSplit(weights: Array[Double]): Array[DataFrame]
              ```
      
          -   New APIs in class `Dataset[T]`:
      
              ```scala
              def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
              def randomSplit(weights: Array[Double]): Array[Dataset[T]]
              ```
      
          Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.
      
      1.  `groupBy`
      
          Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.
      
      Other noticeable changes:
      
      1.  Dataset always do eager analysis now
      
          We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.
      
      ## How was this patch tested?
      
      Existing tests do the work.
      
      ## TODO
      
      - [ ] Fix all tests
      - [ ] Re-enable MiMA check
      - [ ] Update ScalaDoc (`since`, `group`, and example code)
      
      Author: Cheng Lian <lian@databricks.com>
      Author: Yin Huai <yhuai@databricks.com>
      Author: Wenchen Fan <wenchen@databricks.com>
      Author: Cheng Lian <liancheng@users.noreply.github.com>
      
      Closes #11443 from liancheng/ds-to-df.
      1d542785
    • Shixiong Zhu's avatar
      [SPARK-13604][CORE] Sync worker's state after registering with master · 27fe6bac
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      Here lists all cases that Master cannot talk with Worker for a while and then network is back.
      
      1. Master doesn't know the network issue (not yet timeout)
      
        a. Worker doesn't know the network issue (onDisconnected is not called)
          - Worker keeps sending Heartbeat. Both Worker and Master don't know the network issue. Nothing to do. (Finally, Master will notice the heartbeat timeout if network is not recovered)
      
        b. Worker knows the network issue (onDisconnected is called)
          - Worker stops sending Heartbeat and sends `RegisterWorker` to master. Master will reply `RegisterWorkerFailed("Duplicate worker ID")`. Worker calls "System.exit(1)" (Finally, Master will notice the heartbeat timeout if network is not recovered) (May leak driver processes. See [SPARK-13602](https://issues.apache.org/jira/browse/SPARK-13602))
      
      2. Worker timeout (Master knows the network issue). In such case,  master removes Worker and its executors and drivers.
      
        a. Worker doesn't know the network issue (onDisconnected is not called)
          - Worker keeps sending Heartbeat.
          - If the network is back, say Master receives Heartbeat, Master sends `ReconnectWorker` to Worker
          - Worker send `RegisterWorker` to master.
          - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors)
      
        b. Worker knows the network issue (onDisconnected is called)
          - Worker stop sending `Heartbeat`. Worker will send "RegisterWorker" to master.
          - Master accepts `RegisterWorker` but doesn't know executors and drivers in Worker. (may leak executors)
      
      This PR fixes executors and drivers leak in 2.a and 2.b when Worker reregisters with Master. The approach is making Worker send `WorkerLatestState` to sync the state after registering with master successfully. Then Master will ask Worker to kill unknown executors and drivers.
      
      Note:  Worker cannot just kill executors after registering with master because in the worker, `LaunchExecutor` and `RegisteredWorker` are processed in two threads. If `LaunchExecutor` happens before `RegisteredWorker`, Worker's executor list will contain new executors after Master accepts `RegisterWorker`. We should not kill these executors. So sending the list to Master and let Master tell Worker which executors should be killed.
      
      ## How was this patch tested?
      
      test("SPARK-13604: Master should ask Worker kill unknown executors and drivers")
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11455 from zsxwing/orphan-executors.
      27fe6bac
    • Davies Liu's avatar
      [SPARK-13751] [SQL] generate better code for Filter · 020ff8cd
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      This PR improve the codegen of Filter by:
      
      1. filter out the rows early if it have null value in it that will cause the condition result in null or false. After this, we could simplify the condition, because the input are not nullable anymore.
      
      2. Split the condition as conjunctive predicates, then check them one by one.
      
      Here is a piece of generated code for Filter in TPCDS Q55:
      ```java
      /* 109 */       /*** CONSUME: Filter ((((isnotnull(d_moy#149) && isnotnull(d_year#147)) && (d_moy#149 = 11)) && (d_year#147 = 1999)) && isnotnull(d_date_sk#141)) */
      /* 110 */       /* input[0, int] */
      /* 111 */       boolean project_isNull2 = rdd_row.isNullAt(0);
      /* 112 */       int project_value2 = project_isNull2 ? -1 : (rdd_row.getInt(0));
      /* 113 */       /* input[1, int] */
      /* 114 */       boolean project_isNull3 = rdd_row.isNullAt(1);
      /* 115 */       int project_value3 = project_isNull3 ? -1 : (rdd_row.getInt(1));
      /* 116 */       /* input[2, int] */
      /* 117 */       boolean project_isNull4 = rdd_row.isNullAt(2);
      /* 118 */       int project_value4 = project_isNull4 ? -1 : (rdd_row.getInt(2));
      /* 119 */
      /* 120 */       if (project_isNull3) continue;
      /* 121 */       if (project_isNull4) continue;
      /* 122 */       if (project_isNull2) continue;
      /* 123 */
      /* 124 */       /* (input[1, int] = 11) */
      /* 125 */       boolean filter_value6 = false;
      /* 126 */       filter_value6 = project_value3 == 11;
      /* 127 */       if (!filter_value6) continue;
      /* 128 */
      /* 129 */       /* (input[2, int] = 1999) */
      /* 130 */       boolean filter_value9 = false;
      /* 131 */       filter_value9 = project_value4 == 1999;
      /* 132 */       if (!filter_value9) continue;
      /* 133 */
      /* 134 */       filter_metricValue1.add(1);
      /* 135 */
      /* 136 */       /*** CONSUME: Project [d_date_sk#141] */
      /* 137 */
      /* 138 */       project_rowWriter1.write(0, project_value2);
      /* 139 */       append(project_result1.copy());
      ```
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11585 from davies/gen_filter.
      020ff8cd
    • Dongjoon Hyun's avatar
      [SPARK-3854][BUILD] Scala style: require spaces before `{`. · 91fed8e9
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      Since the opening curly brace, '{', has many usages as discussed in [SPARK-3854](https://issues.apache.org/jira/browse/SPARK-3854), this PR adds a ScalaStyle rule to prevent '){' pattern  for the following majority pattern and fixes the code accordingly. If we enforce this in ScalaStyle from now, it will improve the Scala code quality and reduce review time.
      ```
      // Correct:
      if (true) {
        println("Wow!")
      }
      
      // Incorrect:
      if (true){
         println("Wow!")
      }
      ```
      IntelliJ also shows new warnings based on this.
      
      ## How was this patch tested?
      
      Pass the Jenkins ScalaStyle test.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11637 from dongjoon-hyun/SPARK-3854.
      91fed8e9
    • Josh Rosen's avatar
      [SPARK-13696] Remove BlockStore class & simplify interfaces of mem. & disk stores · 81d48532
      Josh Rosen authored
      Today, both the MemoryStore and DiskStore implement a common `BlockStore` API, but I feel that this API is inappropriate because it abstracts away important distinctions between the behavior of these two stores.
      
      For instance, the disk store doesn't have a notion of storing deserialized objects, so it's confusing for it to expose object-based APIs like putIterator() and getValues() instead of only exposing binary APIs and pushing the responsibilities of serialization and deserialization to the client. Similarly, the DiskStore put() methods accepted a `StorageLevel` parameter even though the disk store can only store blocks in one form.
      
      As part of a larger BlockManager interface cleanup, this patch remove the BlockStore interface and refines the MemoryStore and DiskStore interfaces to reflect more narrow sets of responsibilities for those components. Some of the benefits of this interface cleanup are reflected in simplifications to several unit tests to eliminate now-unnecessary mocking, significant simplification of the BlockManager's `getLocal()` and `doPut()` methods, and a narrower API between the MemoryStore and DiskStore.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11534 from JoshRosen/remove-blockstore-interface.
      81d48532
    • Tathagata Das's avatar
      [SQL][TEST] Increased timeouts to reduce flakiness in ContinuousQueryManagerSuite · 3d2b6f56
      Tathagata Das authored
      ## What changes were proposed in this pull request?
      
      ContinuousQueryManager is sometimes flaky on Jenkins. I could not reproduce it on my machine, so I guess it about the waiting times which causes problems if Jenkins is loaded. I have increased the wait time in the hope that it will be less flaky.
      
      ## How was this patch tested?
      
      I reran the unit test many times on a loop in my machine. I am going to run it a few time in Jenkins, that's the real test.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #11638 from tdas/cqm-flaky-test.
      3d2b6f56
    • Nong Li's avatar
      [SPARK-13790] Speed up ColumnVector's getDecimal · 747d2f53
      Nong Li authored
      ## What changes were proposed in this pull request?
      
      We should reuse an object similar to the other non-primitive type getters. For
      a query that computes averages over decimal columns, this shows a 10% speedup
      on overall query times.
      
      ## How was this patch tested?
      
      Existing tests and this benchmark
      
      ```
      TPCDS Snappy:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)
      --------------------------------------------------------------------------------
      q27-agg (master)                       10627 / 11057         10.8          92.3
      q27-agg (this patch)                     9722 / 9832         11.8          84.4
      ```
      
      Author: Nong Li <nong@databricks.com>
      
      Closes #11624 from nongli/spark-13790.
      747d2f53
    • Sameer Agarwal's avatar
      [SPARK-13759][SQL] Add IsNotNull constraints for expressions with an inequality · 19f4ac6d
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR adds support for inferring `IsNotNull` constraints from expressions with an `!==`. More specifically, if an operator has a condition on `a !== b`, we know that both `a` and `b` in the operator output can no longer be null.
      
      ## How was this patch tested?
      
      1. Modified a test in `ConstraintPropagationSuite` to test for expressions with an inequality.
      2. Added a test in `NullFilteringSuite` for making sure an Inner join with a "non-equal" condition appropriately filters out null from their input.
      
      cc nongli
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11594 from sameeragarwal/isnotequal-constraints.
      19f4ac6d
    • bomeng's avatar
      [SPARK-13727][CORE] SparkConf.contains does not consider deprecated keys · 235f4ac6
      bomeng authored
      The contains() method does not return consistently with get() if the key is deprecated. For example,
      import org.apache.spark.SparkConf
      val conf = new SparkConf()
      conf.set("spark.io.compression.lz4.block.size", "12345")  # display some deprecated warning message
      conf.get("spark.io.compression.lz4.block.size") # return 12345
      conf.get("spark.io.compression.lz4.blockSize") # return 12345
      conf.contains("spark.io.compression.lz4.block.size") # return true
      conf.contains("spark.io.compression.lz4.blockSize") # return false
      
      The fix will make the contains() and get() more consistent.
      
      I've added a test case for this.
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      Unit tests should be sufficient.
      
      Author: bomeng <bmeng@us.ibm.com>
      
      Closes #11568 from bomeng/SPARK-13727.
      235f4ac6
    • Liang-Chi Hsieh's avatar
      [SPARK-13636] [SQL] Directly consume UnsafeRow in wholestage codegen plans · d24801ad
      Liang-Chi Hsieh authored
      JIRA: https://issues.apache.org/jira/browse/SPARK-13636
      
      ## What changes were proposed in this pull request?
      
      As shown in the wholestage codegen verion of Sort operator, when Sort is top of Exchange (or other operator that produce UnsafeRow), we will create variables from UnsafeRow, than create another UnsafeRow using these variables. We should avoid the unnecessary unpack and pack variables from UnsafeRows.
      
      ## How was this patch tested?
      
      All existing wholestage codegen tests should be passed.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #11484 from viirya/direct-consume-unsaferow.
      d24801ad
    • mwws's avatar
      [SPARK-13758][STREAMING][CORE] enhance exception message to avoid misleading · 74267beb
      mwws authored
      We have a recoverable Spark streaming job with checkpoint enabled, it could be executed correctly at first time, but throw following exception when restarted and recovered from checkpoint.
      ```
      org.apache.spark.SparkException: RDD transformations and actions can only be invoked by the driver, not inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
       	at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:87)
       	at org.apache.spark.rdd.RDD.withScope(RDD.scala:352)
       	at org.apache.spark.rdd.RDD.union(RDD.scala:565)
       	at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:23)
       	at org.apache.spark.streaming.Repo$$anonfun$createContext$1.apply(Repo.scala:19)
       	at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:627)
      ```
      
      According to exception, it shows I invoked transformations and actions in other transformations, but I did not. The real reason is that I used external RDD in DStream operation. External RDD data is not stored in checkpoint, so that during recovering, the initial value of _sc in this RDD is assigned to null and hit above exception. But you can find the error message is misleading, it indicates nothing about the real issue
      Here is the code to reproduce it.
      
      ```scala
      object Repo {
      
        def createContext(ip: String, port: Int, checkpointDirectory: String):StreamingContext = {
      
          println("Creating new context")
          val sparkConf = new SparkConf().setAppName("Repo").setMaster("local[2]")
          val ssc = new StreamingContext(sparkConf, Seconds(2))
          ssc.checkpoint(checkpointDirectory)
      
          var cached = ssc.sparkContext.parallelize(Seq("apple, banana"))
      
          val words = ssc.socketTextStream(ip, port).flatMap(_.split(" "))
          words.foreachRDD((rdd: RDD[String]) => {
            val res = rdd.map(word => (word, word.length)).collect()
            println("words: " + res.mkString(", "))
      
            cached = cached.union(rdd)
            cached.checkpoint()
            println("cached words: " + cached.collect.mkString(", "))
          })
          ssc
        }
      
        def main(args: Array[String]) {
      
          val ip = "localhost"
          val port = 9999
          val dir = "/home/maowei/tmp"
      
          val ssc = StreamingContext.getOrCreate(dir,
            () => {
              createContext(ip, port, dir)
            })
          ssc.start()
          ssc.awaitTermination()
        }
      }
      ```
      
      Author: mwws <wei.mao@intel.com>
      
      Closes #11595 from mwws/SPARK-MissleadingLog.
      74267beb
    • Sean Owen's avatar
      [SPARK-13663][CORE] Upgrade Snappy Java to 1.1.2.1 · 927e22ef
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Update snappy to 1.1.2.1 to pull in a single fix -- the OOM fix we already worked around.
      Supersedes https://github.com/apache/spark/pull/11524
      
      ## How was this patch tested?
      
      Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11631 from srowen/SPARK-13663.
      927e22ef
    • sethah's avatar
      [SPARK-11108][ML] OneHotEncoder should support other numeric types · 9fe38aba
      sethah authored
      Adding support for other numeric types:
      
      * Integer
      * Short
      * Long
      * Float
      * Decimal
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #9777 from sethah/SPARK-11108.
      9fe38aba
    • Dongjoon Hyun's avatar
      [MINOR][SQL] Replace DataFrameWriter.stream() with startStream() in comments. · 9525c563
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      According to #11627 , this PR replace `DataFrameWriter.stream()` with `startStream()` in comments of `ContinuousQueryListener.java`.
      
      ## How was this patch tested?
      
      Manual. (It changes on comments.)
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11629 from dongjoon-hyun/minor_rename.
      9525c563
    • JeremyNixon's avatar
      [SPARK-13706][ML] Add Python Example for Train Validation Split · 3e3c3d58
      JeremyNixon authored
      ## What changes were proposed in this pull request?
      
      This pull request adds a python example for train validation split.
      
      ## How was this patch tested?
      
      This was style tested through lint-python, generally tested with ./dev/run-tests, and run in notebook and shell environments. It was viewed in docs locally with jekyll serve.
      
      This contribution is my original work and I license it to Spark under its open source license.
      
      Author: JeremyNixon <jnixon2@gmail.com>
      
      Closes #11547 from JeremyNixon/tvs_example.
      3e3c3d58
  3. Mar 09, 2016
    • proflin's avatar
      [SPARK-7420][STREAMING][TESTS] Enable test: o.a.s.streaming.JobGeneratorSuite... · 8bcad28a
      proflin authored
      [SPARK-7420][STREAMING][TESTS] Enable test: o.a.s.streaming.JobGeneratorSuite "Do not clear received…
      
      ## How was this patch tested?
      
      unit test
      
      Author: proflin <proflin.me@gmail.com>
      
      Closes #11626 from lw-lin/SPARK-7420.
      8bcad28a
    • Reynold Xin's avatar
      [SPARK-13794][SQL] Rename DataFrameWriter.stream() DataFrameWriter.startStream() · 8a3acb79
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The new name makes it more obvious with the verb "start" that we are actually starting some execution.
      
      ## How was this patch tested?
      This is just a rename. Existing unit tests should cover it.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11627 from rxin/SPARK-13794.
      8a3acb79
    • hyukjinkwon's avatar
      [SPARK-13766][SQL] Consistent file extensions for files written by internal data sources · aa0eba2c
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      https://issues.apache.org/jira/browse/SPARK-13766
      This PR makes the file extensions (written by internal datasource) consistent.
      
      **Before**
      
      - TEXT, CSV and JSON
      ```
      [.COMPRESSION_CODEC_NAME]
      ```
      
      - Parquet
      ```
      [.COMPRESSION_CODEC_NAME].parquet
      ```
      
      - ORC
      ```
      .orc
      ```
      
      **After**
      
      - TEXT, CSV and JSON
      ```
      .txt[.COMPRESSION_CODEC_NAME]
      .csv[.COMPRESSION_CODEC_NAME]
      .json[.COMPRESSION_CODEC_NAME]
      ```
      
      - Parquet
      ```
      [.COMPRESSION_CODEC_NAME].parquet
      ```
      
      - ORC
      ```
      [.COMPRESSION_CODEC_NAME].orc
      ```
      
      When the compression codec is set,
      - For Parquet and ORC, each still stays in Parquet and ORC format but just have compressed data internally. So, I think it is okay to name `.parquet` and `.orc` at the end.
      
      - For Text, CSV and JSON, each does not stays in each format but it has different data format according to compression codec. So, each has the names `.json`, `.csv` and `.txt` before the compression extension.
      
      ## How was this patch tested?
      
      Unit tests are used and `./dev/run_tests` for coding style tests.
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      
      Closes #11604 from HyukjinKwon/SPARK-13766.
      aa0eba2c
    • Yin Huai's avatar
      Revert "[SPARK-13760][SQL] Fix BigDecimal constructor for FloatType" · 79064612
      Yin Huai authored
      This reverts commit 926e9c45.
      79064612
    • Sameer Agarwal's avatar
      [SPARK-13760][SQL] Fix BigDecimal constructor for FloatType · 926e9c45
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      A very minor change for using `BigDecimal.decimal(f: Float)` instead of `BigDecimal(f: float)`. The latter is deprecated and can result in inconsistencies due to an implicit conversion to `Double`.
      
      ## How was this patch tested?
      
      N/A
      
      cc yhuai
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11597 from sameeragarwal/bigdecimal.
      926e9c45
    • Sergiusz Urbaniak's avatar
      [SPARK-13492][MESOS] Configurable Mesos framework webui URL. · a4a0addc
      Sergiusz Urbaniak authored
      ## What changes were proposed in this pull request?
      
      Previously the Mesos framework webui URL was being derived only from the Spark UI address leaving no possibility to configure it. This commit makes it configurable. If unset it falls back to the previous behavior.
      
      Motivation:
      This change is necessary in order to be able to install Spark on DCOS and to be able to give it a custom service link. The configured `webui_url` is configured to point to a reverse proxy in the DCOS environment.
      
      ## How was this patch tested?
      
      Locally, using unit tests and on DCOS testing and stable revision.
      
      Author: Sergiusz Urbaniak <sur@mesosphere.io>
      
      Closes #11369 from s-urbaniak/sur-webui-url.
      a4a0addc
    • Tristan Reid's avatar
      [MINOR] Fix typo in 'hypot' docstring · 5f7dbdba
      Tristan Reid authored
      Minor typo:  docstring for pyspark.sql.functions: hypot has extra characters
      
      N/A
      
      Author: Tristan Reid <treid@netflix.com>
      
      Closes #11616 from tristanreid/master.
      5f7dbdba
    • zhuol's avatar
      [SPARK-13775] History page sorted by completed time desc by default. · 238447db
      zhuol authored
      ## What changes were proposed in this pull request?
      Originally the page is sorted by AppID by default.
      After tests with users' feedback, we think it might be best to sort by completed time (desc).
      
      ## How was this patch tested?
      Manually test, with screenshot as follows.
      ![sorted-by-complete-time-desc](https://cloud.githubusercontent.com/assets/11683054/13647686/d6dea924-e5fa-11e5-8fc5-68e039b74b6f.png)
      
      Author: zhuol <zhuol@yahoo-inc.com>
      
      Closes #11608 from zhuoliu/13775.
      238447db
    • Shixiong Zhu's avatar
      [SPARK-13778][CORE] Set the executor state for a worker when removing it · 40e06767
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      When a worker is lost, the executors on this worker are also lost. But Master's ApplicationPage still displays their states as running.
      
      This patch just sets the executor state to `LOST` when a worker is lost.
      
      ## How was this patch tested?
      
      manual tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11609 from zsxwing/SPARK-13778.
      40e06767
    • Andrew Or's avatar
      [SPARK-13747][SQL] Fix concurrent query with fork-join pool · 37fcda3e
      Andrew Or authored
      ## What changes were proposed in this pull request?
      
      Fix this use case, which was already fixed in SPARK-10548 in 1.6 but was broken in master due to #9264:
      
      ```
      (1 to 100).par.foreach { _ => sc.parallelize(1 to 5).map { i => (i, i) }.toDF("a", "b").count() }
      ```
      
      This threw `IllegalArgumentException` consistently before this patch. For more detail, see the JIRA.
      
      ## How was this patch tested?
      
      New test in `SQLExecutionSuite`.
      
      Author: Andrew Or <andrew@databricks.com>
      
      Closes #11586 from andrewor14/fix-concurrent-sql.
      37fcda3e
    • Sameer Agarwal's avatar
      [SPARK-13781][SQL] Use ExpressionSets in ConstraintPropagationSuite · dbf2a7cf
      Sameer Agarwal authored
      ## What changes were proposed in this pull request?
      
      This PR is a small follow up on https://github.com/apache/spark/pull/11338 (https://issues.apache.org/jira/browse/SPARK-13092) to use `ExpressionSet` as part of the verification logic in `ConstraintPropagationSuite`.
      ## How was this patch tested?
      
      No new tests added. Just changes the verification logic in `ConstraintPropagationSuite`.
      
      Author: Sameer Agarwal <sameer@databricks.com>
      
      Closes #11611 from sameeragarwal/expression-set.
      dbf2a7cf
    • sethah's avatar
      [SPARK-11861][ML] Add feature importances for decision trees · e1772d3f
      sethah authored
      This patch adds an API entry point for single decision tree feature importances.
      
      Author: sethah <seth.hendrickson16@gmail.com>
      
      Closes #9912 from sethah/SPARK-11861.
      e1772d3f
    • gatorsmile's avatar
      [SPARK-13527][SQL] Prune Filters based on Constraints · c6aa356c
      gatorsmile authored
      #### What changes were proposed in this pull request?
      
      Remove all the deterministic conditions in a [[Filter]] that are contained in the Child's Constraints.
      
      For example, the first query can be simplified to the second one.
      
      ```scala
          val queryWithUselessFilter = tr1
            .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
            .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
            .where(
              ("tr1.a".attr > 10 || "tr1.c".attr < 10) &&
              'd.attr < 100 &&
              "tr2.a".attr === "tr1.a".attr)
      ```
      ```scala
          val query = tr1
            .where("tr1.a".attr > 10 || "tr1.c".attr < 10)
            .join(tr2.where('d.attr < 100), Inner, Some("tr1.a".attr === "tr2.a".attr))
      ```
      #### How was this patch tested?
      
      Six test cases are added.
      
      Author: gatorsmile <gatorsmile@gmail.com>
      
      Closes #11406 from gatorsmile/FilterRemoval.
      c6aa356c
    • Davies Liu's avatar
      [SPARK-13523] [SQL] Reuse exchanges in a query · 3dc9ae2e
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      It’s possible to have common parts in a query, for example, self join, it will be good to avoid the duplicated part to same CPUs and memory (Broadcast or cache).
      
      Exchange will materialize the underlying RDD by shuffle or collect, it’s a great point to check duplicates and reuse them. Duplicated exchanges means they generate exactly the same result inside a query.
      
      In order to find out the duplicated exchanges, we should be able to compare SparkPlan to check that they have same results or not. We already have that for LogicalPlan, so we should move that into QueryPlan to make it available for SparkPlan.
      
      Once we can find the duplicated exchanges, we should replace all of them with same SparkPlan object (could be wrapped by ReusedExchage for explain), then the plan tree become a DAG. Since all the planner only work with tree, so this rule should be the last one for the entire planning.
      
      After the rule, the plan will looks like:
      
      ```
      WholeStageCodegen
      :  +- Project [id#0L]
      :     +- BroadcastHashJoin [id#0L], [id#2L], Inner, BuildRight, None
      :        :- Project [id#0L]
      :        :  +- BroadcastHashJoin [id#0L], [id#1L], Inner, BuildRight, None
      :        :     :- Range 0, 1, 4, 1024, [id#0L]
      :        :     +- INPUT
      :        +- INPUT
      :- BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
      :  +- WholeStageCodegen
      :     :  +- Range 0, 1, 4, 1024, [id#1L]
      +- ReusedExchange [id#2L], BroadcastExchange HashedRelationBroadcastMode(true,List(id#1L),List(id#1L))
      ```
      
      ![bjoin](https://cloud.githubusercontent.com/assets/40902/13414787/209e8c5c-df0a-11e5-8a0f-edff69d89e83.png)
      
      For three ways SortMergeJoin,
      ```
      == Physical Plan ==
      WholeStageCodegen
      :  +- Project [id#0L]
      :     +- SortMergeJoin [id#0L], [id#4L], None
      :        :- INPUT
      :        +- INPUT
      :- WholeStageCodegen
      :  :  +- Project [id#0L]
      :  :     +- SortMergeJoin [id#0L], [id#3L], None
      :  :        :- INPUT
      :  :        +- INPUT
      :  :- WholeStageCodegen
      :  :  :  +- Sort [id#0L ASC], false, 0
      :  :  :     +- INPUT
      :  :  +- Exchange hashpartitioning(id#0L, 200), None
      :  :     +- WholeStageCodegen
      :  :        :  +- Range 0, 1, 4, 33554432, [id#0L]
      :  +- WholeStageCodegen
      :     :  +- Sort [id#3L ASC], false, 0
      :     :     +- INPUT
      :     +- ReusedExchange [id#3L], Exchange hashpartitioning(id#0L, 200), None
      +- WholeStageCodegen
         :  +- Sort [id#4L ASC], false, 0
         :     +- INPUT
         +- ReusedExchange [id#4L], Exchange hashpartitioning(id#0L, 200), None
      ```
      ![sjoin](https://cloud.githubusercontent.com/assets/40902/13414790/27aea61c-df0a-11e5-8cbf-fbc985c31d95.png)
      
      If the same ShuffleExchange or BroadcastExchange, execute()/executeBroadcast() will be called by different parents, they should cached the RDD/Broadcast, return the same one for all the parents.
      
      ## How was this patch tested?
      
      Added some unit tests for this.  Had done some manual tests on TPCDS query Q59 and Q64, we can see some exchanges are re-used (this requires a change in PhysicalRDD to for sameResult, is be done in #11514 ).
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #11403 from davies/dedup.
      3dc9ae2e
    • Yanbo Liang's avatar
      [SPARK-13615][ML] GeneralizedLinearRegression supports save/load · 0dd06485
      Yanbo Liang authored
      ## What changes were proposed in this pull request?
      ```GeneralizedLinearRegression``` supports ```save/load```.
      cc mengxr
      ## How was this patch tested?
      unit test.
      
      Author: Yanbo Liang <ybliang8@gmail.com>
      
      Closes #11465 from yanboliang/spark-13615.
      0dd06485
Loading