Skip to content
Snippets Groups Projects
  1. Jan 04, 2015
    • zsxwing's avatar
      [SPARK-5067][Core] Use '===' to compare well-defined case class · 72396522
      zsxwing authored
      A simple fix would be adding `assert(e1.appId == e2.appId)` for `SparkListenerApplicationStart`. But actually we can use `===` for well-defined case class directly. Therefore, instead of fixing this issue, I use `===` to compare those well-defined case classes (all fields have implemented a correct `equals` method, such as primitive types)
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3886 from zsxwing/SPARK-5067 and squashes the following commits:
      
      0a51711 [zsxwing] Use '===' to compare well-defined case class
      72396522
    • Josh Rosen's avatar
      [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs · 939ba1f8
      Josh Rosen authored
      This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery.
      
      Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists.  SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat.
      
      In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times.  In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions.  When output spec. validation is enabled, the second calls to these actions will fail due to existing output.
      
      This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler.  This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits:
      
      36eaf35 [Josh Rosen] Add comment explaining use of transform() in test.
      6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform()
      7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide
      bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming.
      e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic.
      762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
      939ba1f8
    • bilna's avatar
      [SPARK-4631] unit test for MQTT · e767d7dd
      bilna authored
      Please review the unit test for MQTT
      
      Author: bilna <bilnap@am.amrita.edu>
      Author: Bilna P <bilna.p@gmail.com>
      
      Closes #3844 from Bilna/master and squashes the following commits:
      
      acea3a3 [bilna] Adding dependency with scope test
      28681fa [bilna] Merge remote-tracking branch 'upstream/master'
      fac3904 [bilna] Correction in Indentation and coding style
      ed9db4c [bilna] Merge remote-tracking branch 'upstream/master'
      4b34ee7 [Bilna P] Update MQTTStreamSuite.scala
      04503cf [bilna] Added embedded broker service for mqtt test
      89d804e [bilna] Merge remote-tracking branch 'upstream/master'
      fc8eb28 [bilna] Merge remote-tracking branch 'upstream/master'
      4b58094 [Bilna P] Update MQTTStreamSuite.scala
      b1ac4ad [bilna] Added BeforeAndAfter
      5f6bfd2 [bilna] Added BeforeAndAfter
      e8b6623 [Bilna P] Update MQTTStreamSuite.scala
      5ca6691 [Bilna P] Update MQTTStreamSuite.scala
      8616495 [bilna] [SPARK-4631] unit test for MQTT
      e767d7dd
    • Dale's avatar
      [SPARK-4787] Stop SparkContext if a DAGScheduler init error occurs · 3fddc946
      Dale authored
      Author: Dale <tigerquoll@outlook.com>
      
      Closes #3809 from tigerquoll/SPARK-4787 and squashes the following commits:
      
      5661e01 [Dale] [SPARK-4787] Ensure that call to stop() doesn't lose the exception by using a finally block.
      2172578 [Dale] [SPARK-4787] Stop context properly if an exception occurs during DAGScheduler initialization.
      3fddc946
    • Brennon York's avatar
      [SPARK-794][Core] Remove sleep() in ClusterScheduler.stop · b96008d5
      Brennon York authored
      Removed `sleep()` from the `stop()` method of the `TaskSchedulerImpl` class which, from the JIRA ticket, is believed to be a legacy artifact slowing down testing originally introduced in the `ClusterScheduler` class.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #3851 from brennonyork/SPARK-794 and squashes the following commits:
      
      04c3e64 [Brennon York] Removed sleep() from the stop() method
      b96008d5
  2. Jan 03, 2015
    • sigmoidanalytics's avatar
      [SPARK-5058] Updated broken links · 342612b6
      sigmoidanalytics authored
      Updated the broken link pointing to the KafkaWordCount example to the correct one.
      
      Author: sigmoidanalytics <mayur@sigmoidanalytics.com>
      
      Closes #3877 from sigmoidanalytics/patch-1 and squashes the following commits:
      
      3e19b31 [sigmoidanalytics] Updated broken links
      342612b6
  3. Jan 02, 2015
    • Akhil Das's avatar
      Fixed typos in streaming-kafka-integration.md · cdccc263
      Akhil Das authored
      Changed projrect to project :)
      
      Author: Akhil Das <akhld@darktech.ca>
      
      Closes #3876 from akhld/patch-1 and squashes the following commits:
      
      e0cf9ef [Akhil Das] Fixed typos in streaming-kafka-integration.md
      cdccc263
    • Yadong Qi's avatar
      [SPARK-3325][Streaming] Add a parameter to the method print in class DStream · bd88b718
      Yadong Qi authored
      This PR is a fixed version of the original PR #3237 by watermen and scwf.
      This adds the ability to specify how many elements to print in `DStream.print`.
      
      Author: Yadong Qi <qiyadong2010@gmail.com>
      Author: q00251598 <qiyadong@huawei.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3865 from tdas/print-num and squashes the following commits:
      
      cd34e9e [Tathagata Das] Fix bug
      7c09f16 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into HEAD
      bb35d1a [Yadong Qi] Update MimaExcludes.scala
      f8098ca [Yadong Qi] Update MimaExcludes.scala
      f6ac3cb [Yadong Qi] Update MimaExcludes.scala
      e4ed897 [Yadong Qi] Update MimaExcludes.scala
      3b9d5cf [wangfei] fix conflicts
      ec8a3af [q00251598] move to  Spark 1.3
      26a70c0 [q00251598] extend the Python DStream's print
      b589a4b [q00251598] add another print function
      bd88b718
  4. Jan 01, 2015
    • Josh Rosen's avatar
      [HOTFIX] Bind web UI to ephemeral port in DriverSuite · 01283980
      Josh Rosen authored
      The job launched by DriverSuite should bind the web UI to an ephemeral port, since it looks like port contention in this test has caused a large number of Jenkins failures when many builds are started simultaneously.  Our tests already disable the web UI, but this doesn't affect subprocesses launched by our tests.  In this case, I've opted to bind to an ephemeral port instead of disabling the UI because disabling features in this test may mask its ability to catch certain bugs.
      
      See also: e24d3a9a
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3873 from JoshRosen/driversuite-webui-port and squashes the following commits:
      
      48cd05c [Josh Rosen] [HOTFIX] Bind web UI to ephemeral port in DriverSuite.
      01283980
  5. Dec 31, 2014
    • Reynold Xin's avatar
      [SPARK-5038] Add explicit return type for implicit functions. · 7749dd6c
      Reynold Xin authored
      As we learned in #3580, not explicitly typing implicit functions can lead to compiler bugs and potentially unexpected runtime behavior.
      
      This is a follow up PR for rest of Spark (outside Spark SQL). The original PR for Spark SQL can be found at https://github.com/apache/spark/pull/3859
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #3860 from rxin/implicit and squashes the following commits:
      
      73702f9 [Reynold Xin] [SPARK-5038] Add explicit return type for implicit functions.
      7749dd6c
    • Sean Owen's avatar
      SPARK-2757 [BUILD] [STREAMING] Add Mima test for Spark Sink after 1.10 is released · 4bb12488
      Sean Owen authored
      Re-enable MiMa for Streaming Flume Sink module, now that 1.1.0 is released, per the JIRA TO-DO. That's pretty much all there is to this.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3842 from srowen/SPARK-2757 and squashes the following commits:
      
      50ff80e [Sean Owen] Exclude apparent false positive turned up by re-enabling MiMa checks for Streaming Flume Sink
      0e5ba5c [Sean Owen] Re-enable MiMa for Streaming Flume Sink module
      4bb12488
    • Josh Rosen's avatar
      [SPARK-5035] [Streaming] ReceiverMessage trait should extend Serializable · fe6efacc
      Josh Rosen authored
      Spark Streaming's ReceiverMessage trait should extend Serializable in order to fix a subtle bug that only occurs when running on a real cluster:
      
      If you attempt to send a fire-and-forget message to a remote Akka actor and that message cannot be serialized, then this seems to lead to more-or-less silent failures. As an optimization, Akka skips message serialization for messages sent within the same JVM. As a result, Spark's unit tests will never fail due to non-serializable Akka messages, but these will cause mostly-silent failures when running on a real cluster.
      
      Before this patch, here was the code for ReceiverMessage:
      
      ```
      /** Messages sent to the NetworkReceiver. */
      private[streaming] sealed trait ReceiverMessage
      private[streaming] object StopReceiver extends ReceiverMessage
      ```
      
      Since ReceiverMessage does not extend Serializable and StopReceiver is a regular `object`, not a `case object`, StopReceiver will throw serialization errors. As a result, graceful receiver shutdown is broken on real clusters (and local-cluster mode) but works in local modes. If you want to reproduce this, try running the word count example from the Streaming Programming Guide in the Spark shell:
      
      ```
      import org.apache.spark._
      import org.apache.spark.streaming._
      import org.apache.spark.streaming.StreamingContext._
      val ssc = new StreamingContext(sc, Seconds(10))
      // Create a DStream that will connect to hostname:port, like localhost:9999
      val lines = ssc.socketTextStream("localhost", 9999)
      // Split each line into words
      val words = lines.flatMap(_.split(" "))
      import org.apache.spark.streaming.StreamingContext._
      // Count each word in each batch
      val pairs = words.map(word => (word, 1))
      val wordCounts = pairs.reduceByKey(_ + _)
      // Print the first ten elements of each RDD generated in this DStream to the console
      wordCounts.print()
      ssc.start()
      Thread.sleep(10000)
      ssc.stop(true, true)
      ```
      
      Prior to this patch, this would work correctly in local mode but fail when running against a real cluster (it would report that some receivers were not shut down).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3857 from JoshRosen/SPARK-5035 and squashes the following commits:
      
      71d0eae [Josh Rosen] [SPARK-5035] ReceiverMessage trait should extend Serializable.
      fe6efacc
    • Travis Galoppo's avatar
      SPARK-5020 [MLlib] GaussianMixtureModel.predictMembership() should take an RDD only · c4f0b4f3
      Travis Galoppo authored
      Removed unnecessary parameters to predictMembership()
      
      CC: jkbradley
      
      Author: Travis Galoppo <tjg2107@columbia.edu>
      
      Closes #3854 from tgaloppo/spark-5020 and squashes the following commits:
      
      1bf4669 [Travis Galoppo] renamed predictMembership() to predictSoft()
      0f1d96e [Travis Galoppo] SPARK-5020 - Removed superfluous parameters from predictMembership()
      c4f0b4f3
    • jerryshao's avatar
      [SPARK-5028][Streaming]Add total received and processed records metrics to Streaming UI · fdc2aa49
      jerryshao authored
      This is a follow-up work of [SPARK-4537](https://issues.apache.org/jira/browse/SPARK-4537). Adding total received records and processed records metrics back to UI.
      
      ![screenshot](https://dl.dropboxusercontent.com/u/19230832/screenshot.png)
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3852 from jerryshao/SPARK-5028 and squashes the following commits:
      
      c8c4877 [jerryshao] Add total received and processed metrics to Streaming UI
      fdc2aa49
    • Hari Shreedharan's avatar
      [SPARK-4790][STREAMING] Fix ReceivedBlockTrackerSuite waits for old file... · 3610d3c6
      Hari Shreedharan authored
      ...s to get deleted before continuing.
      
      Since the deletes are happening asynchronously, the getFileStatus call might throw an exception in older HDFS
      versions, if the delete happens between the time listFiles is called on the directory and getFileStatus is called
      on the file in the getFileStatus method.
      
      This PR addresses this by adding an option to delete the files synchronously and then waiting for the deletion to
      complete before proceeding.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #3726 from harishreedharan/spark-4790 and squashes the following commits:
      
      bbbacd1 [Hari Shreedharan] Call cleanUpOldLogs only once in the tests.
      3255f17 [Hari Shreedharan] Add test for async deletion. Remove method from ReceiverTracker that does not take waitForCompletion.
      e4c83ec [Hari Shreedharan] Making waitForCompletion a mandatory param. Remove eventually from WALSuite since the cleanup method returns only after all files are deleted.
      af00fd1 [Hari Shreedharan] [SPARK-4790][STREAMING] Fix ReceivedBlockTrackerSuite waits for old files to get deleted before continuing.
      3610d3c6
    • Reynold Xin's avatar
      [SPARK-5038][SQL] Add explicit return type for implicit functions in Spark SQL · c88a3d7f
      Reynold Xin authored
      As we learned in https://github.com/apache/spark/pull/3580, not explicitly typing implicit functions can lead to compiler bugs and potentially unexpected runtime behavior.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #3859 from rxin/sql-implicits and squashes the following commits:
      
      30c2c24 [Reynold Xin] [SPARK-5038] Add explicit return type for implicit functions in Spark SQL.
      c88a3d7f
    • Josh Rosen's avatar
      [HOTFIX] Disable Spark UI in SparkSubmitSuite tests · e24d3a9a
      Josh Rosen authored
      This should fix a major cause of build breaks when running many parallel tests.
      e24d3a9a
    • Sean Owen's avatar
      SPARK-4547 [MLLIB] OOM when making bins in BinaryClassificationMetrics · 3d194cc7
      Sean Owen authored
      Now that I've implemented the basics here, I'm less convinced there is a need for this change, somehow. Callers can downsample before or after. Really the OOM is not in the ROC curve code, but in code that might `collect()` it for local analysis. Still, might be useful to down-sample since the ROC curve probably never needs millions of points.
      
      This is a first pass. Since the `(score,label)` are already grouped and sorted, I think it's sufficient to just take every Nth such pair, in order to downsample by a factor of N? this is just like retaining every Nth point on the curve, which I think is the goal. All of the data is still used to build the curve of course.
      
      What do you think about the API, and usefulness?
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3702 from srowen/SPARK-4547 and squashes the following commits:
      
      1d34d05 [Sean Owen] Indent and reorganize numBins scaladoc
      692d825 [Sean Owen] Change handling of large numBins, make 2nd consturctor instead of optional param, style change
      a03610e [Sean Owen] Add downsamplingFactor to BinaryClassificationMetrics
      3d194cc7
    • Brennon York's avatar
      [SPARK-4298][Core] - The spark-submit cannot read Main-Class from Manifest. · 8e14c5eb
      Brennon York authored
      Resolves a bug where the `Main-Class` from a .jar file wasn't being read in properly. This was caused by the fact that the `primaryResource` object was a URI and needed to be normalized through a call to `.getPath` before it could be passed into the `JarFile` object.
      
      Author: Brennon York <brennon.york@capitalone.com>
      
      Closes #3561 from brennonyork/SPARK-4298 and squashes the following commits:
      
      5e0fce1 [Brennon York] Use string interpolation for error messages, moved comment line from original code to above its necessary code segment
      14daa20 [Brennon York] pushed mainClass assignment into match statement, removed spurious spaces, removed { } from case statements, removed return values
      c6dad68 [Brennon York] Set case statement to support multiple jar URI's and enabled the 'file' URI to load the main-class
      8d20936 [Brennon York] updated to reset the error message back to the default
      a043039 [Brennon York] updated to split the uri and jar vals
      8da7cbf [Brennon York] fixes SPARK-4298
      8e14c5eb
    • Liang-Chi Hsieh's avatar
      [SPARK-4797] Replace breezeSquaredDistance · 06a9aa58
      Liang-Chi Hsieh authored
      This PR replaces slow breezeSquaredDistance.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #3643 from viirya/faster_squareddistance and squashes the following commits:
      
      f28b275 [Liang-Chi Hsieh] Move the implementation to linalg.Vectors and rename as sqdist.
      0bc48ee [Liang-Chi Hsieh] Merge branch 'master' into faster_squareddistance
      ba34422 [Liang-Chi Hsieh] Fix bug.
      91849d0 [Liang-Chi Hsieh] Modified for comment.
      44a65ad [Liang-Chi Hsieh] Modified for comments.
      35db395 [Liang-Chi Hsieh] Fix bug and some modifications for comments.
      f4f5ebb [Liang-Chi Hsieh] Follow BLAS.dot pattern to replace intersect, diff with while-loop.
      a36e09f [Liang-Chi Hsieh] Use while-loop to replace foreach for better performance.
      d3e0628 [Liang-Chi Hsieh] Make the methods private.
      dd415bc [Liang-Chi Hsieh] Consider different cases of SparseVector and DenseVector.
      13669db [Liang-Chi Hsieh] Replace breezeSquaredDistance.
      06a9aa58
  6. Dec 30, 2014
    • Josh Rosen's avatar
      [SPARK-1010] Clean up uses of System.setProperty in unit tests · 352ed6bb
      Josh Rosen authored
      Several of our tests call System.setProperty (or test code which implicitly sets system properties) and don't always reset/clear the modified properties, which can create ordering dependencies between tests and cause hard-to-diagnose failures.
      
      This patch removes most uses of System.setProperty from our tests, since in most cases we can use SparkConf to set these configurations (there are a few exceptions, including the tests of SparkConf itself).
      
      For the cases where we continue to use System.setProperty, this patch introduces a `ResetSystemProperties` ScalaTest mixin class which snapshots the system properties before individual tests and to automatically restores them on test completion / failure.  See the block comment at the top of the ResetSystemProperties class for more details.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3739 from JoshRosen/cleanup-system-properties-in-tests and squashes the following commits:
      
      0236d66 [Josh Rosen] Replace setProperty uses in two example programs / tools
      3888fe3 [Josh Rosen] Remove setProperty use in LocalJavaStreamingContext
      4f4031d [Josh Rosen] Add note on why SparkSubmitSuite needs ResetSystemProperties
      4742a5b [Josh Rosen] Clarify ResetSystemProperties trait inheritance ordering.
      0eaf0b6 [Josh Rosen] Remove setProperty call in TaskResultGetterSuite.
      7a3d224 [Josh Rosen] Fix trait ordering
      3fdb554 [Josh Rosen] Remove setProperty call in TaskSchedulerImplSuite
      bee20df [Josh Rosen] Remove setProperty calls in SparkContextSchedulerCreationSuite
      655587c [Josh Rosen] Remove setProperty calls in JobCancellationSuite
      3f2f955 [Josh Rosen] Remove System.setProperty calls in DistributedSuite
      cfe9cce [Josh Rosen] Remove use of system properties in SparkContextSuite
      8783ab0 [Josh Rosen] Remove TestUtils.setSystemProperty, since it is subsumed by the ResetSystemProperties trait.
      633a84a [Josh Rosen] Remove use of system properties in FileServerSuite
      25bfce2 [Josh Rosen] Use ResetSystemProperties in UtilsSuite
      1d1aa5a [Josh Rosen] Use ResetSystemProperties in SizeEstimatorSuite
      dd9492b [Josh Rosen] Use ResetSystemProperties in AkkaUtilsSuite
      b0daff2 [Josh Rosen] Use ResetSystemProperties in BlockManagerSuite
      e9ded62 [Josh Rosen] Use ResetSystemProperties in TaskSchedulerImplSuite
      5b3cb54 [Josh Rosen] Use ResetSystemProperties in SparkListenerSuite
      0995c4b [Josh Rosen] Use ResetSystemProperties in SparkContextSchedulerCreationSuite
      c83ded8 [Josh Rosen] Use ResetSystemProperties in SparkConfSuite
      51aa870 [Josh Rosen] Use withSystemProperty in ShuffleSuite
      60a63a1 [Josh Rosen] Use ResetSystemProperties in JobCancellationSuite
      14a92e4 [Josh Rosen] Use withSystemProperty in FileServerSuite
      628f46c [Josh Rosen] Use ResetSystemProperties in DistributedSuite
      9e3e0dd [Josh Rosen] Add ResetSystemProperties test fixture mixin; use it in SparkSubmitSuite.
      4dcea38 [Josh Rosen] Move withSystemProperty to TestUtils class.
      352ed6bb
    • Liu Jiongzhou's avatar
      [SPARK-4998][MLlib]delete the "train" function · 035bac88
      Liu Jiongzhou authored
      To make the functions with the same in "object" effective, specially when using java reflection.
      As the "train" function defined in "class DecisionTree" will hide the functions with the same name in "object DecisionTree".
      
      JIRA[SPARK-4998]
      
      Author: Liu Jiongzhou <ljzzju@163.com>
      
      Closes #3836 from ljzzju/master and squashes the following commits:
      
      4e13133 [Liu Jiongzhou] [MLlib]delete the "train" function
      035bac88
    • zsxwing's avatar
      [SPARK-4813][Streaming] Fix the issue that ContextWaiter didn't handle 'spurious wakeup' · 6a897829
      zsxwing authored
      Used `Condition` to rewrite `ContextWaiter` because it provides a convenient API `awaitNanos` for timeout.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3661 from zsxwing/SPARK-4813 and squashes the following commits:
      
      52247f5 [zsxwing] Add explicit unit type
      be42bcf [zsxwing] Update as per review suggestion
      e06bd4f [zsxwing] Fix the issue that ContextWaiter didn't handle 'spurious wakeup'
      6a897829
    • Jakub Dubovsky's avatar
      [Spark-4995] Replace Vector.toBreeze.activeIterator with foreachActive · 0f31992c
      Jakub Dubovsky authored
      New foreachActive method of vector was introduced by SPARK-4431 as more efficient alternative to vector.toBreeze.activeIterator. There are some parts of codebase where it was not yet replaced.
      
      dbtsai
      
      Author: Jakub Dubovsky <dubovsky@avast.com>
      
      Closes #3846 from james64/SPARK-4995-foreachActive and squashes the following commits:
      
      3eb7e37 [Jakub Dubovsky] Scalastyle fix
      32fe6c6 [Jakub Dubovsky] activeIterator removed - IndexedRowMatrix.toBreeze
      47a4777 [Jakub Dubovsky] activeIterator removed in RowMatrix.toBreeze
      90a7d98 [Jakub Dubovsky] activeIterator removed in MLUtils.saveAsLibSVMFile
      0f31992c
    • Sean Owen's avatar
      SPARK-3955 part 2 [CORE] [HOTFIX] Different versions between... · b239ea1c
      Sean Owen authored
      SPARK-3955 part 2 [CORE] [HOTFIX] Different versions between jackson-mapper-asl and jackson-core-asl
      
      pwendell https://github.com/apache/spark/commit/2483c1efb6429a7d8a20c96d18ce2fec93a1aff9 didn't actually add a reference to `jackson-core-asl` as intended, but a second redundant reference to `jackson-mapper-asl`, as markhamstra picked up on (https://github.com/apache/spark/pull/3716#issuecomment-68180192)  This just rectifies the typo. I missed it as well; the original PR https://github.com/apache/spark/pull/2818 had it correct and I also didn't see the problem.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3829 from srowen/SPARK-3955 and squashes the following commits:
      
      6cfdc4e [Sean Owen] Actually refer to jackson-core-asl
      b239ea1c
    • wangxiaojing's avatar
      [SPARK-4570][SQL]add BroadcastLeftSemiJoinHash · 07fa1910
      wangxiaojing authored
      JIRA issue: [SPARK-4570](https://issues.apache.org/jira/browse/SPARK-4570)
      We are planning to create a `BroadcastLeftSemiJoinHash` to implement the broadcast join for `left semijoin`
      In left semijoin :
      If the size of data from right side is smaller than the user-settable threshold `AUTO_BROADCASTJOIN_THRESHOLD`,
      the planner would mark it as the `broadcast` relation and mark the other relation as the stream side. The broadcast table will be broadcasted to all of the executors involved in the join, as a `org.apache.spark.broadcast.Broadcast` object. It will use `joins.BroadcastLeftSemiJoinHash`.,else it will use `joins.LeftSemiJoinHash`.
      
      The benchmark suggests these  made the optimized version 4x faster  when `left semijoin`
      <pre><code>
      Original:
      left semi join : 9288 ms
      Optimized:
      left semi join : 1963 ms
      </code></pre>
      The micro benchmark load `data1/kv3.txt` into a normal Hive table.
      Benchmark code:
      <pre><code>
       def benchmark(f: => Unit) = {
          val begin = System.currentTimeMillis()
          f
          val end = System.currentTimeMillis()
          end - begin
        }
        val sc = new SparkContext(
          new SparkConf()
            .setMaster("local")
            .setAppName(getClass.getSimpleName.stripSuffix("$")))
        val hiveContext = new HiveContext(sc)
        import hiveContext._
        sql("drop table if exists left_table")
        sql("drop table if exists right_table")
        sql( """create table left_table (key int, value string)
             """.stripMargin)
        sql( s"""load data local inpath "/data1/kv3.txt" into table left_table""")
        sql( """create table right_table (key int, value string)
             """.stripMargin)
        sql(
          """
            |from left_table
            |insert overwrite table right_table
            |select left_table.key, left_table.value
          """.stripMargin)
      
        val leftSimeJoin = sql(
          """select a.key from left_table a
            |left semi join right_table b on a.key = b.key""".stripMargin)
        val leftSemiJoinDuration = benchmark(leftSimeJoin.count())
        println(s"left semi join : $leftSemiJoinDuration ms ")
      </code></pre>
      
      Author: wangxiaojing <u9jing@gmail.com>
      
      Closes #3442 from wangxiaojing/SPARK-4570 and squashes the following commits:
      
      a4a43c9 [wangxiaojing] rebase
      f103983 [wangxiaojing] change style
      fbe4887 [wangxiaojing] change style
      ff2e618 [wangxiaojing] add testsuite
      1a8da2a [wangxiaojing] add BroadcastLeftSemiJoinHash
      07fa1910
    • wangfei's avatar
      [SPARK-4935][SQL] When hive.cli.print.header configured, spark-sql aborted if... · 8f29b7ca
      wangfei authored
      [SPARK-4935][SQL] When hive.cli.print.header configured, spark-sql aborted if passed in a invalid sql
      
      If we passed in a wrong sql like ```abdcdfsfs```, the spark-sql script aborted.
      
      Author: wangfei <wangfei1@huawei.com>
      Author: Fei Wang <wangfei1@huawei.com>
      
      Closes #3761 from scwf/patch-10 and squashes the following commits:
      
      46dc344 [Fei Wang] revert console.printError(rc.getErrorMessage())
      0330e07 [wangfei] avoid to print error message repeatedly
      1614a11 [wangfei] spark-sql abort when passed in a wrong sql
      8f29b7ca
    • Michael Davies's avatar
      [SPARK-4386] Improve performance when writing Parquet files · 7425bec3
      Michael Davies authored
      Convert type of RowWriteSupport.attributes to Array.
      
      Analysis of performance for writing very wide tables shows that time is spent predominantly in apply method on  attributes var. Type of attributes previously was LinearSeqOptimized and apply is O(N) which made write O(N squared).
      
      Measurements on 575 column table showed this change made a 6x improvement in write times.
      
      Author: Michael Davies <Michael.BellDavies@gmail.com>
      
      Closes #3843 from MickDavies/SPARK-4386 and squashes the following commits:
      
      892519d [Michael Davies] [SPARK-4386] Improve performance when writing Parquet files
      7425bec3
    • Cheng Lian's avatar
      [SPARK-4937][SQL] Normalizes conjunctions and disjunctions to eliminate common predicates · 61a99f6a
      Cheng Lian authored
      This PR is a simplified version of several filter optimization rules introduced in #3778 authored by scwf. Newly introduced optimizations include:
      
      1. `a && a` => `a`
      2. `a || a` => `a`
      3. `(a || b || c || ...) && (a || b || d || ...)` => `a && b && (c || d || ...)`
      
      The 3rd rule is particularly useful for optimizing the following query, which is planned into a cartesian product
      
      ```sql
      SELECT *
        FROM t1, t2
       WHERE (t1.key = t2.key AND t1.value > 10)
          OR (t1.key = t2.key AND t2.value < 20)
      ```
      
      to the following one, which is planned into an equi-join:
      
      ```sql
      SELECT *
        FROM t1, t2
       WHERE t1.key = t2.key
         AND (t1.value > 10 OR t2.value < 20)
      ```
      
      The example above is quite artificial, but common predicates are likely to appear in real life complex queries (like the one mentioned in #3778).
      
      A difference between this PR and #3778 is that these optimizations are not limited to `Filter`, but are generalized to all logical plan nodes. Thanks to scwf for bringing up these optimizations, and chenghao-intel for the generalization suggestion.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3784)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3784 from liancheng/normalize-filters and squashes the following commits:
      
      caca560 [Cheng Lian] Moves filter normalization into BooleanSimplification rule
      4ab3a58 [Cheng Lian] Fixes test failure, adds more tests
      5d54349 [Cheng Lian] Fixes typo in comment
      2abbf8e [Cheng Lian] Forgot our sacred Apache licence header...
      cf95639 [Cheng Lian] Adds an optimization rule for filter normalization
      61a99f6a
    • guowei2's avatar
      [SPARK-4928][SQL] Fix: Operator '>,<,>=,<=' with decimal between different precision report error · a75dd83b
      guowei2 authored
      case operator  with decimal between different precision, we need change them to unlimited
      
      Author: guowei2 <guowei2@asiainfo.com>
      
      Closes #3767 from guowei2/SPARK-4928 and squashes the following commits:
      
      c6a6e3e [guowei2] fix code style
      3214e0a [guowei2] add test case
      b4985a2 [guowei2] fix code style
      27adf42 [guowei2] Fix: Operation '>,<,>=,<=' with Decimal report error
      a75dd83b
    • luogankun's avatar
      [SPARK-4930][SQL][DOCS]Update SQL programming guide, CACHE TABLE is eager · 2deac748
      luogankun authored
      `CACHE TABLE tbl` is now __eager__ by default not __lazy__
      
      Author: luogankun <luogankun@gmail.com>
      
      Closes #3773 from luogankun/SPARK-4930 and squashes the following commits:
      
      cc17b7d [luogankun] [SPARK-4930][SQL][DOCS]Update SQL programming guide, add CACHE [LAZY] TABLE [AS SELECT] ...
      bffe0e8 [luogankun] [SPARK-4930][SQL][DOCS]Update SQL programming guide, CACHE TABLE tbl is eager
      2deac748
    • luogankun's avatar
      [SPARK-4916][SQL][DOCS]Update SQL programming guide about cache section · f7a41a0e
      luogankun authored
      `SchemeRDD.cache()` now uses in-memory columnar storage.
      
      Author: luogankun <luogankun@gmail.com>
      
      Closes #3759 from luogankun/SPARK-4916 and squashes the following commits:
      
      7b39864 [luogankun] [SPARK-4916]Update SQL programming guide
      6018122 [luogankun] Merge branch 'master' of https://github.com/apache/spark into SPARK-4916
      0b93785 [luogankun] [SPARK-4916]Update SQL programming guide
      99b2336 [luogankun] [SPARK-4916]Update SQL programming guide
      f7a41a0e
    • Cheng Lian's avatar
      [SPARK-4493][SQL] Tests for IsNull / IsNotNull in the ParquetFilterSuite · 19a8802e
      Cheng Lian authored
      This is a follow-up of #3367 and #3644.
      
      At the time #3644 was written, #3367 hadn't been merged yet, thus `IsNull` and `IsNotNull` filters are not covered in the first version of `ParquetFilterSuite`. This PR adds corresponding test cases.
      
      <!-- Reviewable:start -->
      [<img src="https://reviewable.io/review_button.png" height=40 alt="Review on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3748)
      <!-- Reviewable:end -->
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #3748 from liancheng/test-null-filters and squashes the following commits:
      
      1ab943f [Cheng Lian] IsNull and IsNotNull Parquet filter test case for boolean type
      bcd616b [Cheng Lian] Adds Parquet filter pushedown tests for IsNull and IsNotNull
      19a8802e
    • Cheng Hao's avatar
      [Spark-4512] [SQL] Unresolved Attribute Exception in Sort By · 53f0a00b
      Cheng Hao authored
      It will cause exception while do query like:
      SELECT key+key FROM src sort by value;
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3386 from chenghao-intel/sort and squashes the following commits:
      
      38c78cc [Cheng Hao] revert the SortPartition in SparkStrategies
      7e9dd15 [Cheng Hao] update the typo
      fcd1d64 [Cheng Hao] rebase the latest master and update the SortBy unit test
      53f0a00b
    • wangfei's avatar
      [SPARK-5002][SQL] Using ascending by default when not specify order in order by · daac2213
      wangfei authored
      spark sql does not support ```SELECT a, b FROM testData2 ORDER BY a desc, b```.
      
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3838 from scwf/orderby and squashes the following commits:
      
      114b64a [wangfei] remove nouse methods
      48145d3 [wangfei] fix order, using asc by default
      daac2213
    • Cheng Hao's avatar
      [SPARK-4904] [SQL] Remove the unnecessary code change in Generic UDF · 63b84b7d
      Cheng Hao authored
      Since #3429 has been merged, the bug of wrapping to Writable for HiveGenericUDF is resolved, we can safely remove the foldable checking in `HiveGenericUdf.eval`, which discussed in #2802.
      
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3745 from chenghao-intel/generic_udf and squashes the following commits:
      
      622ad03 [Cheng Hao] Remove the unnecessary code change in Generic UDF
      63b84b7d
    • Cheng Hao's avatar
      [SPARK-4959] [SQL] Attributes are case sensitive when using a select query from a projection · 5595eaa7
      Cheng Hao authored
      Author: Cheng Hao <hao.cheng@intel.com>
      
      Closes #3796 from chenghao-intel/spark_4959 and squashes the following commits:
      
      3ec08f8 [Cheng Hao] Replace the attribute in comparing its exprId other than itself
      5595eaa7
    • scwf's avatar
      [SPARK-4975][SQL] Fix HiveInspectorSuite test failure · 65357f11
      scwf authored
      HiveInspectorSuite test failure:
      [info] - wrap / unwrap null, constant null and writables *** FAILED *** (21 milliseconds)
      [info] 1 did not equal 0 (HiveInspectorSuite.scala:136)
      this is because the origin date(is 3914-10-23) not equals the date returned by ```unwrap```(is 3914-10-22).
      
      Setting TimeZone and Locale fix this.
      Another minor change here is rename ```def checkValues(v1: Any, v2: Any): Unit```  to  ```def checkValue(v1: Any, v2: Any): Unit ``` to make the code more clear
      
      Author: scwf <wangfei1@huawei.com>
      Author: Fei Wang <wangfei1@huawei.com>
      
      Closes #3814 from scwf/fix-inspectorsuite and squashes the following commits:
      
      d8531ef [Fei Wang] Delete test.log
      72b19a9 [scwf] fix HiveInspectorSuite test error
      65357f11
    • Daoyuan Wang's avatar
      [SQL] enable view test · 94d60b70
      Daoyuan Wang authored
      This is a follow up of #3396 , just add a test to white list.
      
      Author: Daoyuan Wang <daoyuan.wang@intel.com>
      
      Closes #3826 from adrian-wang/viewtest and squashes the following commits:
      
      f105f68 [Daoyuan Wang] enable view test
      94d60b70
    • Michael Armbrust's avatar
      [SPARK-4908][SQL] Prevent multiple concurrent hive native commands · 480bd1d2
      Michael Armbrust authored
      This is just a quick fix that locks when calling `runHive`.  If we can find a way to avoid the error without a global lock that would be better.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #3834 from marmbrus/hiveConcurrency and squashes the following commits:
      
      bf25300 [Michael Armbrust] prevent multiple concurrent hive native commands
      480bd1d2
Loading