Skip to content
Snippets Groups Projects
  1. Jan 25, 2015
    • Sean Owen's avatar
      SPARK-4430 [STREAMING] [TEST] Apache RAT Checks fail spuriously on test files · 0528b85c
      Sean Owen authored
      Another trivial one. The RAT failure was due to temp files from `FailureSuite` not being cleaned up. This just makes the cleanup more reliable by using the standard temp dir mechanism.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #4189 from srowen/SPARK-4430 and squashes the following commits:
      
      9ea63ff [Sean Owen] Properly acquire a temp directory to ensure it is cleaned up at shutdown, which helps avoid a RAT check failure
      0528b85c
  2. Jan 23, 2015
    • jerryshao's avatar
      [SPARK-5315][Streaming] Fix reduceByWindow Java API not work bug · e0f7fb7f
      jerryshao authored
      `reduceByWindow` for Java API is actually not Java compatible, change to make it Java compatible.
      
      Current solution is to deprecate the old one and add a new API, but since old API actually is not correct, so is keeping the old one meaningful? just to keep the binary compatible? Also even adding new API still need to add to Mima exclusion, I'm not sure to change the API, or deprecate the old API and add a new one, which is the best solution?
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #4104 from jerryshao/SPARK-5315 and squashes the following commits:
      
      5bc8987 [jerryshao] Address the comment
      c7aa1b4 [jerryshao] Deprecate the old one to keep binary compatible
      8e9dc67 [jerryshao] Fix JavaDStream reduceByWindow signature error
      e0f7fb7f
  3. Jan 22, 2015
    • jerryshao's avatar
      [SPARK-5233][Streaming] Fix error replaying of WAL introduced bug · 3c3fa632
      jerryshao authored
      Because of lacking of `BlockAllocationEvent` in WAL recovery, the dangled event will mix into the new batch, which will lead to the wrong result. Details can be seen in [SPARK-5233](https://issues.apache.org/jira/browse/SPARK-5233).
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #4032 from jerryshao/SPARK-5233 and squashes the following commits:
      
      f0b0c0b [jerryshao] Further address the comments
      a237c75 [jerryshao] Address the comments
      e356258 [jerryshao] Fix bug in unit test
      558bdc3 [jerryshao] Correctly replay the WAL log when recovering from failure
      3c3fa632
    • Tathagata Das's avatar
      [SPARK-5147][Streaming] Delete the received data WAL log periodically · 3027f06b
      Tathagata Das authored
      This is a refactored fix based on jerryshao 's PR #4037
      This enabled deletion of old WAL files containing the received block data.
      Improvements over #4037
      - Respecting the rememberDuration of all receiver streams. In #4037, if there were two receiver streams with multiple remember durations, the deletion would have delete based on the shortest remember duration, thus deleting data prematurely for the receiver stream with longer remember duration.
      - Added unit test to test creation of receiver WAL, automatic deletion, and respecting of remember duration.
      
      jerryshao I am going to merge this ASAP to make it 1.2.1 Thanks for the initial draft of this PR. Made my job much easier.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #4149 from tdas/SPARK-5147 and squashes the following commits:
      
      730798b [Tathagata Das] Added comments.
      c4cf067 [Tathagata Das] Minor fixes
      2579b27 [Tathagata Das] Refactored the fix to make sure that the cleanup respects the remember duration of all the receiver streams
      2736fd1 [jerryshao] Delete the old WAL log periodically
      3027f06b
  4. Jan 21, 2015
    • jerryshao's avatar
      [SPARK-5297][Streaming] Fix Java file stream type erasure problem · 424d8c6f
      jerryshao authored
      Current Java file stream doesn't support custom key/value type because of loss of type information, details can be seen in [SPARK-5297](https://issues.apache.org/jira/browse/SPARK-5297). Fix this problem by getting correct `ClassTag` from `Class[_]`.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #4101 from jerryshao/SPARK-5297 and squashes the following commits:
      
      e022ca3 [jerryshao] Add Mima exclusion
      ecd61b8 [jerryshao] Fix Java fileInputStream type erasure problem
      424d8c6f
    • Davies Liu's avatar
      [SPARK-5275] [Streaming] include python source code · bad6c572
      Davies Liu authored
      Include the python source code into assembly jar.
      
      cc mengxr pwendell
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #4128 from davies/build_streaming2 and squashes the following commits:
      
      546af4c [Davies Liu] fix indent
      48859b2 [Davies Liu] include python source code
      bad6c572
  5. Jan 20, 2015
    • Ilayaperumal Gopinathan's avatar
      [SPARK-4803] [streaming] Remove duplicate RegisterReceiver message · 4afad9c7
      Ilayaperumal Gopinathan authored
        - The ReceiverTracker receivers `RegisterReceiver` messages two times
           1) When the actor at `ReceiverSupervisorImpl`'s preStart is invoked
           2) After the receiver is started at the executor `onReceiverStart()` at `ReceiverSupervisorImpl`
      
      Though, RegisterReceiver message uses the same streamId and the receiverInfo gets updated everytime
      the message is processed at the `ReceiverTracker`, it makes sense to call register receiver only after the
      receiver is started.
      
      Author: Ilayaperumal Gopinathan <igopinathan@pivotal.io>
      
      Closes #3648 from ilayaperumalg/RTActor-remove-prestart and squashes the following commits:
      
      868efab [Ilayaperumal Gopinathan] Increase receiverInfo collector timeout to 2 secs
      3118e5e [Ilayaperumal Gopinathan] Fix StreamingListenerSuite's startedReceiverStreamIds size
      634abde [Ilayaperumal Gopinathan] Remove duplicate RegisterReceiver message
      4afad9c7
  6. Jan 12, 2015
    • jerryshao's avatar
      [SPARK-4999][Streaming] Change storeInBlockManager to false by default · 3aed3051
      jerryshao authored
      Currently WAL-backed block is read out from HDFS and put into BlockManger with storage level MEMORY_ONLY_SER by default, since WAL-backed block is already materialized in HDFS with fault-tolerance, no need to put into BlockManger again by default.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3906 from jerryshao/SPARK-4999 and squashes the following commits:
      
      b95f95e [jerryshao] Change storeInBlockManager to false by default
      3aed3051
  7. Jan 10, 2015
  8. Jan 08, 2015
    • Marcelo Vanzin's avatar
      [SPARK-4048] Enhance and extend hadoop-provided profile. · 48cecf67
      Marcelo Vanzin authored
      This change does a few things to make the hadoop-provided profile more useful:
      
      - Create new profiles for other libraries / services that might be provided by the infrastructure
      - Simplify and fix the poms so that the profiles are only activated while building assemblies.
      - Fix tests so that they're able to run when the profiles are activated
      - Add a new env variable to be used by distributions that use these profiles to provide the runtime
        classpath for Spark jobs and daemons.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #2982 from vanzin/SPARK-4048 and squashes the following commits:
      
      82eb688 [Marcelo Vanzin] Add a comment.
      eb228c0 [Marcelo Vanzin] Fix borked merge.
      4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes.
      371ebee [Marcelo Vanzin] Review feedback.
      52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      322f882 [Marcelo Vanzin] Fix merge fail.
      f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
      9640503 [Marcelo Vanzin] Cleanup child process log message.
      115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom).
      e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
      7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
      1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
      d1399ed [Marcelo Vanzin] Restore jetty dependency.
      82a54b9 [Marcelo Vanzin] Remove unused profile.
      5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles.
      1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
      f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list.
      9e4e001 [Marcelo Vanzin] Remove duplicate hive profile.
      d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log.
      4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn.
      417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH".
      2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing.
      1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects.
      284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.
      48cecf67
  9. Jan 06, 2015
    • Sean Owen's avatar
      SPARK-4159 [CORE] Maven build doesn't run JUnit test suites · 4cba6eb4
      Sean Owen authored
      This PR:
      
      - Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar)
      - Tells `surefire` to test only Java tests
      - Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication.
      
      For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3651 from srowen/SPARK-4159 and squashes the following commits:
      
      2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete
      12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit.
      e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent
      4cba6eb4
    • Josh Rosen's avatar
      [SPARK-1600] Refactor FileInputStream tests to remove Thread.sleep() calls and SystemClock usage · a6394bc2
      Josh Rosen authored
      This patch refactors Spark Streaming's FileInputStream tests to remove uses of Thread.sleep() and SystemClock, which should hopefully resolve some longstanding flakiness in these tests (see SPARK-1600).
      
      Key changes:
      
      - Modify FileInputDStream to use the scheduler's Clock instead of System.currentTimeMillis(); this allows it to be tested using ManualClock.
      - Fix a synchronization issue in ManualClock's `currentTime` method.
      - Add a StreamingTestWaiter class which allows callers to block until a certain number of batches have finished.
      - Change the FileInputStream tests so that files' modification times are manually set based off of ManualClock; this eliminates many Thread.sleep calls.
      - Update these tests to use the withStreamingContext fixture.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3801 from JoshRosen/SPARK-1600 and squashes the following commits:
      
      e4494f4 [Josh Rosen] Address a potential race when setting file modification times
      8340bd0 [Josh Rosen] Use set comparisons for output.
      0b9c252 [Josh Rosen] Fix some ManualClock usage problems.
      1cc689f [Josh Rosen] ConcurrentHashMap -> SynchronizedMap
      db26c3a [Josh Rosen] Use standard timeout in ScalaTest `eventually` blocks.
      3939432 [Josh Rosen] Rename StreamingTestWaiter to BatchCounter
      0b9c3a1 [Josh Rosen] Wait for checkpoint to complete
      863d71a [Josh Rosen] Remove Thread.sleep that was used to make task run slowly
      b4442c3 [Josh Rosen] batchTimeToSelectedFiles should be thread-safe
      15b48ee [Josh Rosen] Replace several TestWaiter methods w/ ScalaTest eventually.
      fffc51c [Josh Rosen] Revert "Remove last remaining sleep() call"
      dbb8247 [Josh Rosen] Remove last remaining sleep() call
      566a63f [Josh Rosen] Fix log message and comment typos
      da32f3f [Josh Rosen] Fix log message and comment typos
      3689214 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-1600
      c8f06b1 [Josh Rosen] Remove Thread.sleep calls in FileInputStream CheckpointSuite test.
      d4f2d87 [Josh Rosen] Refactor file input stream tests to not rely on SystemClock.
      dda1403 [Josh Rosen] Add StreamingTestWaiter class.
      3c3efc3 [Josh Rosen] Synchronize `currentTime` in ManualClock
      a95ddc4 [Josh Rosen] Modify FileInputDStream to use Clock class.
      a6394bc2
  10. Jan 04, 2015
    • Josh Rosen's avatar
      [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs · 939ba1f8
      Josh Rosen authored
      This patch disables output spec. validation for jobs launched through Spark Streaming, since this interferes with checkpoint recovery.
      
      Hadoop OutputFormats have a `checkOutputSpecs` method which performs certain checks prior to writing output, such as checking whether the output directory already exists.  SPARK-1100 added checks for FileOutputFormat, SPARK-1677 (#947) added a SparkConf configuration to disable these checks, and SPARK-2309 (#1088) extended these checks to run for all OutputFormats, not just FileOutputFormat.
      
      In Spark Streaming, we might have to re-process a batch during checkpoint recovery, so `save` actions may be called multiple times.  In addition to `DStream`'s own save actions, users might use `transform` or `foreachRDD` and call the `RDD` and `PairRDD` save actions.  When output spec. validation is enabled, the second calls to these actions will fail due to existing output.
      
      This patch automatically disables output spec. validation for jobs submitted by the Spark Streaming scheduler.  This is done by using Scala's `DynamicVariable` to propagate the bypass setting without having to mutate SparkConf or introduce a global variable.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3832 from JoshRosen/SPARK-4835 and squashes the following commits:
      
      36eaf35 [Josh Rosen] Add comment explaining use of transform() in test.
      6485cf8 [Josh Rosen] Add test case in Streaming; fix bug for transform()
      7b3e06a [Josh Rosen] Remove Streaming-specific setting to undo this change; update conf. guide
      bf9094d [Josh Rosen] Revise disableOutputSpecValidation() comment to not refer to Spark Streaming.
      e581d17 [Josh Rosen] Deduplicate isOutputSpecValidationEnabled logic.
      762e473 [Josh Rosen] [SPARK-4835] Disable validateOutputSpecs for Spark Streaming jobs.
      939ba1f8
  11. Jan 02, 2015
    • Yadong Qi's avatar
      [SPARK-3325][Streaming] Add a parameter to the method print in class DStream · bd88b718
      Yadong Qi authored
      This PR is a fixed version of the original PR #3237 by watermen and scwf.
      This adds the ability to specify how many elements to print in `DStream.print`.
      
      Author: Yadong Qi <qiyadong2010@gmail.com>
      Author: q00251598 <qiyadong@huawei.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: wangfei <wangfei1@huawei.com>
      
      Closes #3865 from tdas/print-num and squashes the following commits:
      
      cd34e9e [Tathagata Das] Fix bug
      7c09f16 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into HEAD
      bb35d1a [Yadong Qi] Update MimaExcludes.scala
      f8098ca [Yadong Qi] Update MimaExcludes.scala
      f6ac3cb [Yadong Qi] Update MimaExcludes.scala
      e4ed897 [Yadong Qi] Update MimaExcludes.scala
      3b9d5cf [wangfei] fix conflicts
      ec8a3af [q00251598] move to  Spark 1.3
      26a70c0 [q00251598] extend the Python DStream's print
      b589a4b [q00251598] add another print function
      bd88b718
  12. Dec 31, 2014
    • Josh Rosen's avatar
      [SPARK-5035] [Streaming] ReceiverMessage trait should extend Serializable · fe6efacc
      Josh Rosen authored
      Spark Streaming's ReceiverMessage trait should extend Serializable in order to fix a subtle bug that only occurs when running on a real cluster:
      
      If you attempt to send a fire-and-forget message to a remote Akka actor and that message cannot be serialized, then this seems to lead to more-or-less silent failures. As an optimization, Akka skips message serialization for messages sent within the same JVM. As a result, Spark's unit tests will never fail due to non-serializable Akka messages, but these will cause mostly-silent failures when running on a real cluster.
      
      Before this patch, here was the code for ReceiverMessage:
      
      ```
      /** Messages sent to the NetworkReceiver. */
      private[streaming] sealed trait ReceiverMessage
      private[streaming] object StopReceiver extends ReceiverMessage
      ```
      
      Since ReceiverMessage does not extend Serializable and StopReceiver is a regular `object`, not a `case object`, StopReceiver will throw serialization errors. As a result, graceful receiver shutdown is broken on real clusters (and local-cluster mode) but works in local modes. If you want to reproduce this, try running the word count example from the Streaming Programming Guide in the Spark shell:
      
      ```
      import org.apache.spark._
      import org.apache.spark.streaming._
      import org.apache.spark.streaming.StreamingContext._
      val ssc = new StreamingContext(sc, Seconds(10))
      // Create a DStream that will connect to hostname:port, like localhost:9999
      val lines = ssc.socketTextStream("localhost", 9999)
      // Split each line into words
      val words = lines.flatMap(_.split(" "))
      import org.apache.spark.streaming.StreamingContext._
      // Count each word in each batch
      val pairs = words.map(word => (word, 1))
      val wordCounts = pairs.reduceByKey(_ + _)
      // Print the first ten elements of each RDD generated in this DStream to the console
      wordCounts.print()
      ssc.start()
      Thread.sleep(10000)
      ssc.stop(true, true)
      ```
      
      Prior to this patch, this would work correctly in local mode but fail when running against a real cluster (it would report that some receivers were not shut down).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3857 from JoshRosen/SPARK-5035 and squashes the following commits:
      
      71d0eae [Josh Rosen] [SPARK-5035] ReceiverMessage trait should extend Serializable.
      fe6efacc
    • jerryshao's avatar
      [SPARK-5028][Streaming]Add total received and processed records metrics to Streaming UI · fdc2aa49
      jerryshao authored
      This is a follow-up work of [SPARK-4537](https://issues.apache.org/jira/browse/SPARK-4537). Adding total received records and processed records metrics back to UI.
      
      ![screenshot](https://dl.dropboxusercontent.com/u/19230832/screenshot.png)
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3852 from jerryshao/SPARK-5028 and squashes the following commits:
      
      c8c4877 [jerryshao] Add total received and processed metrics to Streaming UI
      fdc2aa49
    • Hari Shreedharan's avatar
      [SPARK-4790][STREAMING] Fix ReceivedBlockTrackerSuite waits for old file... · 3610d3c6
      Hari Shreedharan authored
      ...s to get deleted before continuing.
      
      Since the deletes are happening asynchronously, the getFileStatus call might throw an exception in older HDFS
      versions, if the delete happens between the time listFiles is called on the directory and getFileStatus is called
      on the file in the getFileStatus method.
      
      This PR addresses this by adding an option to delete the files synchronously and then waiting for the deletion to
      complete before proceeding.
      
      Author: Hari Shreedharan <hshreedharan@apache.org>
      
      Closes #3726 from harishreedharan/spark-4790 and squashes the following commits:
      
      bbbacd1 [Hari Shreedharan] Call cleanUpOldLogs only once in the tests.
      3255f17 [Hari Shreedharan] Add test for async deletion. Remove method from ReceiverTracker that does not take waitForCompletion.
      e4c83ec [Hari Shreedharan] Making waitForCompletion a mandatory param. Remove eventually from WALSuite since the cleanup method returns only after all files are deleted.
      af00fd1 [Hari Shreedharan] [SPARK-4790][STREAMING] Fix ReceivedBlockTrackerSuite waits for old files to get deleted before continuing.
      3610d3c6
  13. Dec 30, 2014
    • Josh Rosen's avatar
      [SPARK-1010] Clean up uses of System.setProperty in unit tests · 352ed6bb
      Josh Rosen authored
      Several of our tests call System.setProperty (or test code which implicitly sets system properties) and don't always reset/clear the modified properties, which can create ordering dependencies between tests and cause hard-to-diagnose failures.
      
      This patch removes most uses of System.setProperty from our tests, since in most cases we can use SparkConf to set these configurations (there are a few exceptions, including the tests of SparkConf itself).
      
      For the cases where we continue to use System.setProperty, this patch introduces a `ResetSystemProperties` ScalaTest mixin class which snapshots the system properties before individual tests and to automatically restores them on test completion / failure.  See the block comment at the top of the ResetSystemProperties class for more details.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3739 from JoshRosen/cleanup-system-properties-in-tests and squashes the following commits:
      
      0236d66 [Josh Rosen] Replace setProperty uses in two example programs / tools
      3888fe3 [Josh Rosen] Remove setProperty use in LocalJavaStreamingContext
      4f4031d [Josh Rosen] Add note on why SparkSubmitSuite needs ResetSystemProperties
      4742a5b [Josh Rosen] Clarify ResetSystemProperties trait inheritance ordering.
      0eaf0b6 [Josh Rosen] Remove setProperty call in TaskResultGetterSuite.
      7a3d224 [Josh Rosen] Fix trait ordering
      3fdb554 [Josh Rosen] Remove setProperty call in TaskSchedulerImplSuite
      bee20df [Josh Rosen] Remove setProperty calls in SparkContextSchedulerCreationSuite
      655587c [Josh Rosen] Remove setProperty calls in JobCancellationSuite
      3f2f955 [Josh Rosen] Remove System.setProperty calls in DistributedSuite
      cfe9cce [Josh Rosen] Remove use of system properties in SparkContextSuite
      8783ab0 [Josh Rosen] Remove TestUtils.setSystemProperty, since it is subsumed by the ResetSystemProperties trait.
      633a84a [Josh Rosen] Remove use of system properties in FileServerSuite
      25bfce2 [Josh Rosen] Use ResetSystemProperties in UtilsSuite
      1d1aa5a [Josh Rosen] Use ResetSystemProperties in SizeEstimatorSuite
      dd9492b [Josh Rosen] Use ResetSystemProperties in AkkaUtilsSuite
      b0daff2 [Josh Rosen] Use ResetSystemProperties in BlockManagerSuite
      e9ded62 [Josh Rosen] Use ResetSystemProperties in TaskSchedulerImplSuite
      5b3cb54 [Josh Rosen] Use ResetSystemProperties in SparkListenerSuite
      0995c4b [Josh Rosen] Use ResetSystemProperties in SparkContextSchedulerCreationSuite
      c83ded8 [Josh Rosen] Use ResetSystemProperties in SparkConfSuite
      51aa870 [Josh Rosen] Use withSystemProperty in ShuffleSuite
      60a63a1 [Josh Rosen] Use ResetSystemProperties in JobCancellationSuite
      14a92e4 [Josh Rosen] Use withSystemProperty in FileServerSuite
      628f46c [Josh Rosen] Use ResetSystemProperties in DistributedSuite
      9e3e0dd [Josh Rosen] Add ResetSystemProperties test fixture mixin; use it in SparkSubmitSuite.
      4dcea38 [Josh Rosen] Move withSystemProperty to TestUtils class.
      352ed6bb
    • zsxwing's avatar
      [SPARK-4813][Streaming] Fix the issue that ContextWaiter didn't handle 'spurious wakeup' · 6a897829
      zsxwing authored
      Used `Condition` to rewrite `ContextWaiter` because it provides a convenient API `awaitNanos` for timeout.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3661 from zsxwing/SPARK-4813 and squashes the following commits:
      
      52247f5 [zsxwing] Add explicit unit type
      be42bcf [zsxwing] Update as per review suggestion
      e06bd4f [zsxwing] Fix the issue that ContextWaiter didn't handle 'spurious wakeup'
      6a897829
  14. Dec 26, 2014
  15. Dec 25, 2014
    • zsxwing's avatar
      [SPARK-4608][Streaming] Reorganize StreamingContext implicit to improve API convenience · f9ed2b66
      zsxwing authored
      There is only one implicit function `toPairDStreamFunctions` in `StreamingContext`. This PR did similar reorganization like [SPARK-4397](https://issues.apache.org/jira/browse/SPARK-4397).
      
      Compiled the following codes with Spark Streaming 1.1.0 and ran it with this PR. Everything is fine.
      ```Scala
      import org.apache.spark._
      import org.apache.spark.streaming._
      import org.apache.spark.streaming.StreamingContext._
      
      object StreamingApp {
      
        def main(args: Array[String]) {
          val conf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount")
          val ssc = new StreamingContext(conf, Seconds(10))
          val lines = ssc.textFileStream("/some/path")
          val words = lines.flatMap(_.split(" "))
          val pairs = words.map(word => (word, 1))
          val wordCounts = pairs.reduceByKey(_ + _)
          wordCounts.print()
      
          ssc.start()
          ssc.awaitTermination()
        }
      }
      ```
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3464 from zsxwing/SPARK-4608 and squashes the following commits:
      
      aa6d44a [zsxwing] Fix a copy-paste error
      f74c190 [zsxwing] Merge branch 'master' into SPARK-4608
      e6f9cc9 [zsxwing] Update the docs
      27833bb [zsxwing] Remove `import StreamingContext._`
      c15162c [zsxwing] Reorganize StreamingContext implicit to improve API convenience
      f9ed2b66
    • jerryshao's avatar
      [SPARK-4537][Streaming] Expand StreamingSource to add more metrics · f205fe47
      jerryshao authored
      Add `processingDelay`, `schedulingDelay` and `totalDelay` for the last completed batch. Add `lastReceivedBatchRecords` and `totalReceivedBatchRecords` to the received records counting.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3466 from jerryshao/SPARK-4537 and squashes the following commits:
      
      00f5f7f [jerryshao] Change the code style and add totalProcessedRecords
      44721a6 [jerryshao] Further address the comments
      c097ddc [jerryshao] Address the comments
      02dd44f [jerryshao] Fix the addressed comments
      c7a9376 [jerryshao] Expand StreamingSource to add more metrics
      f205fe47
  16. Dec 24, 2014
    • zsxwing's avatar
      [SPARK-4873][Streaming] Use `Future.zip` instead of `Future.flatMap`(for-loop)... · b4d0db80
      zsxwing authored
      [SPARK-4873][Streaming] Use `Future.zip` instead of `Future.flatMap`(for-loop) in WriteAheadLogBasedBlockHandler
      
      Use `Future.zip` instead of `Future.flatMap`(for-loop). `zip` implies these two Futures will run concurrently, while `flatMap` usually means one Future depends on the other one.
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3721 from zsxwing/SPARK-4873 and squashes the following commits:
      
      46a2cd9 [zsxwing] Use Future.zip instead of Future.flatMap(for-loop)
      b4d0db80
    • Sean Owen's avatar
      SPARK-4297 [BUILD] Build warning fixes omnibus · 29fabb1b
      Sean Owen authored
      There are a number of warnings generated in a normal, successful build right now. They're mostly Java unchecked cast warnings, which can be suppressed. But there's a grab bag of other Scala language warnings and so on that can all be easily fixed. The forthcoming PR fixes about 90% of the build warnings I see now.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3157 from srowen/SPARK-4297 and squashes the following commits:
      
      8c9e469 [Sean Owen] Suppress unchecked cast warnings, and several other build warning fixes
      29fabb1b
  17. Dec 23, 2014
    • jerryshao's avatar
      [SPARK-4671][Streaming]Do not replicate streaming block when WAL is enabled · 3f5f4cc4
      jerryshao authored
      Currently streaming block will be replicated when specific storage level is set, since WAL is already fault tolerant, so replication is needless and will hurt the throughput of streaming application.
      
      Hi tdas , as per discussed about this issue, I fixed with this implementation, I'm not is this the way you want, would you mind taking a look at it? Thanks a lot.
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3534 from jerryshao/SPARK-4671 and squashes the following commits:
      
      500b456 [jerryshao] Do not replicate streaming block when WAL is enabled
      3f5f4cc4
    • Ilayaperumal Gopinathan's avatar
      [SPARK-4802] [streaming] Remove receiverInfo once receiver is de-registered · 10d69e9c
      Ilayaperumal Gopinathan authored
        Once the streaming receiver is de-registered at executor, the `ReceiverTrackerActor` needs to
      remove the corresponding reveiverInfo from the `receiverInfo` map at `ReceiverTracker`.
      
      Author: Ilayaperumal Gopinathan <igopinathan@pivotal.io>
      
      Closes #3647 from ilayaperumalg/receiverInfo-RTracker and squashes the following commits:
      
      6eb97d5 [Ilayaperumal Gopinathan] Polishing based on the review
      3640c86 [Ilayaperumal Gopinathan] Remove receiverInfo once receiver is de-registered
      10d69e9c
  18. Dec 15, 2014
    • Ryan Williams's avatar
      [SPARK-4668] Fix some documentation typos. · 8176b7a0
      Ryan Williams authored
      Author: Ryan Williams <ryan.blake.williams@gmail.com>
      
      Closes #3523 from ryan-williams/tweaks and squashes the following commits:
      
      d2eddaa [Ryan Williams] code review feedback
      ce27fc1 [Ryan Williams] CoGroupedRDD comment nit
      c6cfad9 [Ryan Williams] remove unnecessary if statement
      b74ea35 [Ryan Williams] comment fix
      b0221f0 [Ryan Williams] fix a gendered pronoun
      c71ffed [Ryan Williams] use names on a few boolean parameters
      89954aa [Ryan Williams] clarify some comments in {Security,Shuffle}Manager
      e465dac [Ryan Williams] Saved building-spark.md with Dillinger.io
      83e8358 [Ryan Williams] fix pom.xml typo
      dc4662b [Ryan Williams] typo fixes in tuning.md, configuration.md
      8176b7a0
    • Josh Rosen's avatar
      [SPARK-4826] Fix generation of temp file names in WAL tests · f6b8591a
      Josh Rosen authored
      This PR should fix SPARK-4826, an issue where a bug in how we generate temp. file names was causing spurious test failures in the write ahead log suites.
      
      Closes #3695.
      Closes #3701.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3704 from JoshRosen/SPARK-4826 and squashes the following commits:
      
      f2307f5 [Josh Rosen] Use Spark Utils class for directory creation/deletion
      a693ddb [Josh Rosen] remove unused Random import
      b275e41 [Josh Rosen] Move creation of temp. dir to beforeEach/afterEach.
      9362919 [Josh Rosen] [SPARK-4826] Fix bug in generation of temp file names. in WAL suites.
      86c1944 [Josh Rosen] Revert "HOTFIX: Disabling failing block manager test"
      f6b8591a
    • Patrick Wendell's avatar
      4c067387
  19. Nov 25, 2014
    • Tathagata Das's avatar
      [SPARK-4196][SPARK-4602][Streaming] Fix serialization issue in... · 8838ad7c
      Tathagata Das authored
      [SPARK-4196][SPARK-4602][Streaming] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles
      
      Solves two JIRAs in one shot
      - Makes the ForechDStream created by saveAsNewAPIHadoopFiles serializable for checkpoints
      - Makes the default configuration object used saveAsNewAPIHadoopFiles be the Spark's hadoop configuration
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3457 from tdas/savefiles-fix and squashes the following commits:
      
      bb4729a [Tathagata Das] Same treatment for saveAsHadoopFiles
      b382ea9 [Tathagata Das] Fix serialization issue in PairDStreamFunctions.saveAsNewAPIHadoopFiles.
      8838ad7c
    • Tathagata Das's avatar
      [SPARK-4601][Streaming] Set correct call site for streaming jobs so that it is... · 69cd53ea
      Tathagata Das authored
      [SPARK-4601][Streaming] Set correct call site for streaming jobs so that it is displayed correctly on the Spark UI
      
      When running the NetworkWordCount, the description of the word count jobs are set as "getCallsite at DStream:xxx" . This should be set to the line number of the streaming application that has the output operation that led to the job being created. This is because the callsite is incorrectly set in the thread launching the jobs. This PR fixes that.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3455 from tdas/streaming-callsite-fix and squashes the following commits:
      
      69fc26f [Tathagata Das] Set correct call site for streaming jobs so that it is displayed correctly on the Spark UI
      69cd53ea
    • jerryshao's avatar
      [SPARK-4381][Streaming]Add warning log when user set spark.master to local in... · fef27b29
      jerryshao authored
      [SPARK-4381][Streaming]Add warning log when user set spark.master to local in Spark Streaming and there's no job executed
      
      Author: jerryshao <saisai.shao@intel.com>
      
      Closes #3244 from jerryshao/SPARK-4381 and squashes the following commits:
      
      d2486c7 [jerryshao] Improve the warning log
      d726e85 [jerryshao] Add local[1] to the filter condition
      eca428b [jerryshao] Add warning log
      fef27b29
    • q00251598's avatar
      [SPARK-4535][Streaming] Fix the error in comments · a51118a3
      q00251598 authored
      change `NetworkInputDStream` to `ReceiverInputDStream`
      change `ReceiverInputTracker` to `ReceiverTracker`
      
      Author: q00251598 <qiyadong@huawei.com>
      
      Closes #3400 from watermen/fix-comments and squashes the following commits:
      
      75d795c [q00251598] change 'NetworkInputDStream' to 'ReceiverInputDStream' && change 'ReceiverInputTracker' to 'ReceiverTracker'
      a51118a3
  20. Nov 24, 2014
    • Tathagata Das's avatar
      [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files... · cb0e9b09
      Tathagata Das authored
      [SPARK-4518][SPARK-4519][Streaming] Refactored file stream to prevent files from being processed multiple times
      
      Because of a corner case, a file already selected for batch t can get considered again for batch t+2. This refactoring fixes it by remembering all the files selected in the last 1 minute, so that this corner case does not arise. Also uses spark context's hadoop configuration to access the file system API for listing directories.
      
      pwendell Please take look. I still have not run long-running integration tests, so I cannot say for sure whether this has indeed solved the issue. You could do a first pass on this in the meantime.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3419 from tdas/filestream-fix2 and squashes the following commits:
      
      c19dd8a [Tathagata Das] Addressed PR comments.
      513b608 [Tathagata Das] Updated docs.
      d364faf [Tathagata Das] Added the current time condition back
      5526222 [Tathagata Das] Removed unnecessary imports.
      38bb736 [Tathagata Das] Fix long line.
      203bbc7 [Tathagata Das] Un-ignore tests.
      eaef4e1 [Tathagata Das] Fixed SPARK-4519
      9dbd40a [Tathagata Das] Refactored FileInputDStream to remember last few batches.
      cb0e9b09
  21. Nov 19, 2014
    • Yadong Qi's avatar
      [SPARK-4294][Streaming] UnionDStream stream should express the requirements in... · c3002c4a
      Yadong Qi authored
      [SPARK-4294][Streaming] UnionDStream stream should express the requirements in the same way as TransformedDStream
      
      In class TransformedDStream:
      ```scala
      require(parents.length > 0, "List of DStreams to transform is empty")
      require(parents.map(.ssc).distinct.size == 1, "Some of the DStreams have different contexts")
      require(parents.map(.slideDuration).distinct.size == 1,
      "Some of the DStreams have different slide durations")
      ```
      
      In class UnionDStream:
      ```scala
      if (parents.length == 0)
      { throw new IllegalArgumentException("Empty array of parents") }
      if (parents.map(.ssc).distinct.size > 1)
      { throw new IllegalArgumentException("Array of parents have different StreamingContexts") }
      if (parents.map(.slideDuration).distinct.size > 1)
      { throw new IllegalArgumentException("Array of parents have different slide times") }
      ```
      
      The function is the same, but the realization is not. I think they shoule be the same.
      
      Author: Yadong Qi <qiyadong2010@gmail.com>
      
      Closes #3152 from watermen/bug-fix1 and squashes the following commits:
      
      ed66db6 [Yadong Qi] Change transform to union
      b6b3b8b [Yadong Qi] The same function should have the same realization.
      c3002c4a
    • zsxwing's avatar
      [SPARK-4481][Streaming][Doc] Fix the wrong description of updateFunc · 3bf7ceeb
      zsxwing authored
      Removed `If `this` function returns None, then corresponding state key-value pair will be eliminated.` for the description of `updateFunc: (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)]`
      
      Author: zsxwing <zsxwing@gmail.com>
      
      Closes #3356 from zsxwing/SPARK-4481 and squashes the following commits:
      
      76a9891 [zsxwing] Add a note that keys may be added or removed
      0ebc42a [zsxwing] Fix the wrong description of updateFunc
      3bf7ceeb
    • Tathagata Das's avatar
      [SPARK-4482][Streaming] Disable ReceivedBlockTracker's write ahead log by default · 22fc4e75
      Tathagata Das authored
      The write ahead log of ReceivedBlockTracker gets enabled as soon as checkpoint directory is set. This should not happen, as the WAL should be enabled only if the WAL is enabled in the Spark configuration.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #3358 from tdas/SPARK-4482 and squashes the following commits:
      
      b740136 [Tathagata Das] Fixed bug in ReceivedBlockTracker
      22fc4e75
  22. Nov 18, 2014
    • Marcelo Vanzin's avatar
      Bumping version to 1.3.0-SNAPSHOT. · 397d3aae
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #3277 from vanzin/version-1.3 and squashes the following commits:
      
      7c3c396 [Marcelo Vanzin] Added temp repo to sbt build.
      5f404ff [Marcelo Vanzin] Add another exclusion.
      19457e7 [Marcelo Vanzin] Update old version to 1.2, add temporary 1.2 repo.
      3c8d705 [Marcelo Vanzin] Workaround for MIMA checks.
      e940810 [Marcelo Vanzin] Bumping version to 1.3.0-SNAPSHOT.
      397d3aae
  23. Nov 17, 2014
    • Josh Rosen's avatar
      [SPARK-4180] [Core] Prevent creation of multiple active SparkContexts · 0f3ceb56
      Josh Rosen authored
      This patch adds error-detection logic to throw an exception when attempting to create multiple active SparkContexts in the same JVM, since this is currently unsupported and has been known to cause confusing behavior (see SPARK-2243 for more details).
      
      **The solution implemented here is only a partial fix.**  A complete fix would have the following properties:
      
      1. Only one SparkContext may ever be under construction at any given time.
      2. Once a SparkContext has been successfully constructed, any subsequent construction attempts should fail until the active SparkContext is stopped.
      3. If the SparkContext constructor throws an exception, then all resources created in the constructor should be cleaned up (SPARK-4194).
      4. If a user attempts to create a SparkContext but the creation fails, then the user should be able to create new SparkContexts.
      
      This PR only provides 2) and 4); we should be able to provide all of these properties, but the correct fix will involve larger changes to SparkContext's construction / initialization, so we'll target it for a different Spark release.
      
      ### The correct solution:
      
      I think that the correct way to do this would be to move the construction of SparkContext's dependencies into a static method in the SparkContext companion object.  Specifically, we could make the default SparkContext constructor `private` and change it to accept a `SparkContextDependencies` object that contains all of SparkContext's dependencies (e.g. DAGScheduler, ContextCleaner, etc.).  Secondary constructors could call a method on the SparkContext companion object to create the `SparkContextDependencies` and pass the result to the primary SparkContext constructor.  For example:
      
      ```scala
      class SparkContext private (deps: SparkContextDependencies) {
        def this(conf: SparkConf) {
          this(SparkContext.getDeps(conf))
        }
      }
      
      object SparkContext(
        private[spark] def getDeps(conf: SparkConf): SparkContextDependencies = synchronized {
          if (anotherSparkContextIsActive) { throw Exception(...) }
          var dagScheduler: DAGScheduler = null
          try {
              dagScheduler = new DAGScheduler(...)
              [...]
          } catch {
            case e: Exception =>
               Option(dagScheduler).foreach(_.stop())
                [...]
          }
          SparkContextDependencies(dagScheduler, ....)
        }
      }
      ```
      
      This gives us mutual exclusion and ensures that any resources created during the failed SparkContext initialization are properly cleaned up.
      
      This indirection is necessary to maintain binary compatibility.  In retrospect, it would have been nice if SparkContext had no private constructors and could only be created through builder / factory methods on its companion object, since this buys us lots of flexibility and makes dependency injection easier.
      
      ### Alternative solutions:
      
      As an alternative solution, we could refactor SparkContext's primary constructor to perform all object creation in a giant `try-finally` block.  Unfortunately, this will require us to turn a bunch of `vals` into `vars` so that they can be assigned from the `try` block.  If we still want `vals`, we could wrap each `val` in its own `try` block (since the try block can return a value), but this will lead to extremely messy code and won't guard against the introduction of future code which doesn't properly handle failures.
      
      The more complex approach outlined above gives us some nice dependency injection benefits, so I think that might be preferable to a `var`-ification.
      
      ### This PR's solution:
      
      - At the start of the constructor, check whether some other SparkContext is active; if so, throw an exception.
      - If another SparkContext might be under construction (or has thrown an exception during construction), allow the new SparkContext to begin construction but log a warning (since resources might have been leaked from a failed creation attempt).
      - At the end of the SparkContext constructor, check whether some other SparkContext constructor has raced and successfully created an active context.  If so, throw an exception.
      
      This guarantees that no two SparkContexts will ever be active and exposed to users (since we check at the very end of the constructor).  If two threads race to construct SparkContexts, then one of them will win and another will throw an exception.
      
      This exception can be turned into a warning by setting `spark.driver.allowMultipleContexts = true`.  The exception is disabled in unit tests, since there are some suites (such as Hive) that may require more significant refactoring to clean up their SparkContexts.  I've made a few changes to other suites' test fixtures to properly clean up SparkContexts so that the unit test logs contain fewer warnings.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #3121 from JoshRosen/SPARK-4180 and squashes the following commits:
      
      23c7123 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      d38251b [Josh Rosen] Address latest round of feedback.
      c0987d3 [Josh Rosen] Accept boolean instead of SparkConf in methods.
      85a424a [Josh Rosen] Incorporate more review feedback.
      372d0d3 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      f5bb78c [Josh Rosen] Update mvn build, too.
      d809cb4 [Josh Rosen] Improve handling of failed SparkContext creation attempts.
      79a7e6f [Josh Rosen] Fix commented out test
      a1cba65 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      7ba6db8 [Josh Rosen] Add utility to set system properties in tests.
      4629d5c [Josh Rosen] Set spark.driver.allowMultipleContexts=true in tests.
      ed17e14 [Josh Rosen] Address review feedback; expose hack workaround for existing unit tests.
      1c66070 [Josh Rosen] Merge remote-tracking branch 'origin/master' into SPARK-4180
      06c5c54 [Josh Rosen] Add / improve SparkContext cleanup in streaming BasicOperationsSuite
      d0437eb [Josh Rosen] StreamingContext.stop() should stop SparkContext even if StreamingContext has not been started yet.
      c4d35a2 [Josh Rosen] Log long form of creation site to aid debugging.
      918e878 [Josh Rosen] Document "one SparkContext per JVM" limitation.
      afaa7e3 [Josh Rosen] [SPARK-4180] Prevent creations of multiple active SparkContexts.
      0f3ceb56
  24. Nov 14, 2014
    • jerryshao's avatar
      [SPARK-4062][Streaming]Add ReliableKafkaReceiver in Spark Streaming Kafka connector · 5930f64b
      jerryshao authored
      Add ReliableKafkaReceiver in Kafka connector to prevent data loss if WAL in Spark Streaming is enabled. Details and design doc can be seen in [SPARK-4062](https://issues.apache.org/jira/browse/SPARK-4062).
      
      Author: jerryshao <saisai.shao@intel.com>
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      Author: Saisai Shao <saisai.shao@intel.com>
      
      Closes #2991 from jerryshao/kafka-refactor and squashes the following commits:
      
      5461f1c [Saisai Shao] Merge pull request #8 from tdas/kafka-refactor3
      eae4ad6 [Tathagata Das] Refectored KafkaStreamSuiteBased to eliminate KafkaTestUtils and made Java more robust.
      fab14c7 [Tathagata Das] minor update.
      149948b [Tathagata Das] Fixed mistake
      14630aa [Tathagata Das] Minor updates.
      d9a452c [Tathagata Das] Minor updates.
      ec2e95e [Tathagata Das] Removed the receiver's locks and essentially reverted to Saisai's original design.
      2a20a01 [jerryshao] Address some comments
      9f636b3 [Saisai Shao] Merge pull request #5 from tdas/kafka-refactor
      b2b2f84 [Tathagata Das] Refactored Kafka receiver logic and Kafka testsuites
      e501b3c [jerryshao] Add Mima excludes
      b798535 [jerryshao] Fix the missed issue
      e5e21c1 [jerryshao] Change to while loop
      ea873e4 [jerryshao] Further address the comments
      98f3d07 [jerryshao] Fix comment style
      4854ee9 [jerryshao] Address all the comments
      96c7a1d [jerryshao] Update the ReliableKafkaReceiver unit test
      8135d31 [jerryshao] Fix flaky test
      a949741 [jerryshao] Address the comments
      16bfe78 [jerryshao] Change the ordering of imports
      0894aef [jerryshao] Add some comments
      77c3e50 [jerryshao] Code refactor and add some unit tests
      dd9aeeb [jerryshao] Initial commit for reliable Kafka receiver
      5930f64b
Loading