Skip to content
Snippets Groups Projects
  1. Mar 13, 2016
    • Sean Owen's avatar
      [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <->... · 18408528
      Sean Owen authored
      [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
      
      ## What changes were proposed in this pull request?
      
      - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
      - Same for `InputStreamReader` and `OutputStreamWriter` constructors
      - Standardizes on UTF-8 everywhere
      - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
      - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit https://github.com/srowen/spark/commit/1deecd8d9ca986d8adb1a42d315890ce5349d29c )
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11657 from srowen/SPARK-13823.
      18408528
  2. Mar 09, 2016
    • Sean Owen's avatar
      [SPARK-13595][BUILD] Move docker, extras modules into external · 256704c7
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`
      
      ## How was this patch tested?
      
      This is tested with Jenkins tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11523 from srowen/SPARK-13595.
      256704c7
  3. Mar 04, 2016
    • Jason White's avatar
      [SPARK-12073][STREAMING] backpressure rate controller consumes events preferentially from lagg… · f19228ee
      Jason White authored
      …ing partitions
      
      I'm pretty sure this is the reason we couldn't easily recover from an unbalanced Kafka partition under heavy load when using backpressure.
      
      `maxMessagesPerPartition` calculates an appropriate limit for the message rate from all partitions, and then divides by the number of partitions to determine how many messages to retrieve per partition. The problem with this approach is that when one partition is behind by millions of records (due to random Kafka issues), but the rate estimator calculates only 100k total messages can be retrieved, each partition (out of say 32) only retrieves max 100k/32=3125 messages.
      
      This PR (still needing a test) determines a per-partition desired message count by using the current lag for each partition to preferentially weight the total message limit among the partitions. In this situation, if each partition gets 1k messages, but 1 partition starts 1M behind, then the total number of messages to retrieve is (32 * 1k + 1M) = 1032000 messages, of which the one partition needs 1001000. So, it gets (1001000 / 1032000) = 97% of the 100k messages, and the other 31 partitions share the remaining 3%.
      
      Assuming all of 100k the messages are retrieved and processed within the batch window, the rate calculator will increase the number of messages to retrieve in the next batch, until it reaches a new stable point or the backlog is finished processed.
      
      We're going to try deploying this internally at Shopify to see if this resolves our issue.
      
      tdas koeninger holdenk
      
      Author: Jason White <jason.white@shopify.com>
      
      Closes #10089 from JasonMWhite/rate_controller_offsets.
      f19228ee
  4. Mar 03, 2016
    • Dongjoon Hyun's avatar
      [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule · b5f02d67
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
      This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
      
      ## How was this patch tested?
      ```
      ./dev/lint-java
      ./build/sbt compile
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11438 from dongjoon-hyun/SPARK-13583.
      b5f02d67
    • Sean Owen's avatar
      [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x · e97fc7f1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
      
      - Inner class should be static
      - Mismatched hashCode/equals
      - Overflow in compareTo
      - Unchecked warnings
      - Misuse of assert, vs junit.assert
      - get(a) + getOrElse(b) -> getOrElse(a,b)
      - Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
      - Dead code
      - tailrec
      - exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
      - reduce(_+_) -> sum map + flatten -> map
      
      The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
      
      ## How was the this patch tested?
      
      Existing Jenkins unit tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11292 from srowen/SPARK-13423.
      e97fc7f1
  5. Feb 25, 2016
  6. Feb 22, 2016
    • Huaxin Gao's avatar
      [SPARK-13186][STREAMING] migrate away from SynchronizedMap · 8f35d3ea
      Huaxin Gao authored
      trait SynchronizedMap in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Change to java.util.concurrent.ConcurrentHashMap instead.
      
      Author: Huaxin Gao <huaxing@us.ibm.com>
      
      Closes #11250 from huaxingao/spark__13186.
      8f35d3ea
  7. Feb 09, 2016
    • Holden Karau's avatar
      [SPARK-13165][STREAMING] Replace deprecated synchronizedBuffer in streaming · 159198ef
      Holden Karau authored
      Building with Scala 2.11 results in the warning trait SynchronizedBuffer in package mutable is deprecated: Synchronization via traits is deprecated as it is inherently unreliable. Consider java.util.concurrent.ConcurrentLinkedQueue as an alternative - we already use ConcurrentLinkedQueue elsewhere so lets replace it.
      
      Some notes about how behaviour is different for reviewers:
      The Seq from a SynchronizedBuffer that was implicitly converted would continue to receive updates - however when we do the same conversion explicitly on the ConcurrentLinkedQueue this isn't the case. Hence changing some of the (internal & test) APIs to pass an Iterable. toSeq is safe to use if there are no more updates.
      
      Author: Holden Karau <holden@us.ibm.com>
      Author: tedyu <yuzhihong@gmail.com>
      
      Closes #11067 from holdenk/SPARK-13165-replace-deprecated-synchronizedBuffer-in-streaming.
      159198ef
  8. Feb 07, 2016
  9. Jan 30, 2016
    • Josh Rosen's avatar
      [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version · 289373b2
      Josh Rosen authored
      This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
      
      The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
      
      After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10608 from JoshRosen/SPARK-6363.
      289373b2
  10. Jan 22, 2016
  11. Jan 20, 2016
    • Shixiong Zhu's avatar
      [SPARK-7799][SPARK-12786][STREAMING] Add "streaming-akka" project · b7d74a60
      Shixiong Zhu authored
      Include the following changes:
      
      1. Add "streaming-akka" project and org.apache.spark.streaming.akka.AkkaUtils for creating an actorStream
      2. Remove "StreamingContext.actorStream" and "JavaStreamingContext.actorStream"
      3. Update the ActorWordCount example and add the JavaActorWordCount example
      4. Make "streaming-zeromq" depend on "streaming-akka" and update the codes accordingly
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10744 from zsxwing/streaming-akka-2.
      b7d74a60
  12. Jan 11, 2016
  13. Jan 08, 2016
    • Josh Rosen's avatar
      [SPARK-4628][BUILD] Remove all non-Maven-Central repositories from build · 090d6913
      Josh Rosen authored
      This patch removes all non-Maven-central repositories from Spark's build, thereby avoiding any risk of future build-breaks due to us accidentally depending on an artifact which is not present in an immutable public Maven repository.
      
      I tested this by running
      
      ```
      build/mvn \
              -Phive \
              -Phive-thriftserver \
              -Pkinesis-asl \
              -Pspark-ganglia-lgpl \
              -Pyarn \
              dependency:go-offline
      ```
      
      inside of a fresh Ubuntu Docker container with no Ivy or Maven caches (I did a similar test for SBT).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10659 from JoshRosen/SPARK-4628.
      090d6913
    • Sean Owen's avatar
      [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition · b9c83533
      Sean Owen authored
      Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #10570 from srowen/SPARK-12618.
      b9c83533
  14. Jan 07, 2016
    • Shixiong Zhu's avatar
      [SPARK-12510][STREAMING] Refactor ActorReceiver to support Java · c0c39750
      Shixiong Zhu authored
      This PR includes the following changes:
      
      1. Rename `ActorReceiver` to `ActorReceiverSupervisor`
      2. Remove `ActorHelper`
      3. Add a new `ActorReceiver` for Scala and `JavaActorReceiver` for Java
      4. Add `JavaActorWordCount` example
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10457 from zsxwing/java-actor-stream.
      c0c39750
  15. Jan 05, 2016
  16. Dec 31, 2015
  17. Dec 19, 2015
  18. Dec 08, 2015
  19. Dec 04, 2015
    • Shixiong Zhu's avatar
      [SPARK-12084][CORE] Fix codes that uses ByteBuffer.array incorrectly · 3af53e61
      Shixiong Zhu authored
      `ByteBuffer` doesn't guarantee all contents in `ByteBuffer.array` are valid. E.g, a ByteBuffer returned by `ByteBuffer.slice`. We should not use the whole content of `ByteBuffer` unless we know that's correct.
      
      This patch fixed all places that use `ByteBuffer.array` incorrectly.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10083 from zsxwing/bytebuffer-array.
      3af53e61
    • Josh Rosen's avatar
      [SPARK-12112][BUILD] Upgrade to SBT 0.13.9 · b7204e1d
      Josh Rosen authored
      We should upgrade to SBT 0.13.9, since this is a requirement in order to use SBT's new Maven-style resolution features (which will be done in a separate patch, because it's blocked by some binary compatibility issues in the POM reader plugin).
      
      I also upgraded Scalastyle to version 0.8.0, which was necessary in order to fix a Scala 2.10.5 compatibility issue (see https://github.com/scalastyle/scalastyle/issues/156). The newer Scalastyle is slightly stricter about whitespace surrounding tokens, so I fixed the new style violations.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10112 from JoshRosen/upgrade-to-sbt-0.13.9.
      b7204e1d
  20. Dec 03, 2015
  21. Nov 30, 2015
  22. Nov 17, 2015
  23. Nov 13, 2015
  24. Nov 10, 2015
    • Tathagata Das's avatar
      [SPARK-11361][STREAMING] Show scopes of RDD operations inside... · 6600786d
      Tathagata Das authored
      [SPARK-11361][STREAMING] Show scopes of RDD operations inside DStream.foreachRDD and DStream.transform in DAG viz
      
      Currently, when a DStream sets the scope for RDD generated by it, that scope is not allowed to be overridden by the RDD operations. So in case of `DStream.foreachRDD`, all the RDDs generated inside the foreachRDD get the same scope - `foreachRDD  <time>`, as set by the `ForeachDStream`. So it is hard to debug generated RDDs in the RDD DAG viz in the Spark UI.
      
      This patch allows the RDD operations inside `DStream.transform` and `DStream.foreachRDD` to append their own scopes to the earlier DStream scope.
      
      I have also slightly tweaked how callsites are set such that the short callsite reflects the RDD operation name and line number. This tweak is necessary as callsites are not managed through scopes (which support nesting and overriding) and I didnt want to add another local property to control nesting and overriding of callsites.
      
      ## Before:
      ![image](https://cloud.githubusercontent.com/assets/663212/10808548/fa71c0c4-7da9-11e5-9af0-5737793a146f.png)
      
      ## After:
      ![image](https://cloud.githubusercontent.com/assets/663212/10808659/37bc45b6-7dab-11e5-8041-c20be6a9bc26.png)
      
      The code that was used to generate this is:
      ```
          val lines = ssc.socketTextStream(args(0), args(1).toInt, StorageLevel.MEMORY_AND_DISK_SER)
          val words = lines.flatMap(_.split(" "))
          val wordCounts = words.map(x => (x, 1)).reduceByKey(_ + _)
          wordCounts.foreachRDD { rdd =>
            val temp = rdd.map { _ -> 1 }.reduceByKey( _ + _)
            val temp2 = temp.map { _ -> 1}.reduceByKey(_ + _)
            val count = temp2.count
            println(count)
          }
      ```
      
      Note
      - The inner scopes of the RDD operations map/reduceByKey inside foreachRDD is visible
      - The short callsites of stages refers to the line number of the RDD ops rather than the same line number of foreachRDD in all three cases.
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #9315 from tdas/SPARK-11361.
      6600786d
  25. Oct 24, 2015
  26. Oct 08, 2015
  27. Oct 07, 2015
  28. Sep 15, 2015
  29. Sep 12, 2015
  30. Sep 09, 2015
    • Luc Bourlier's avatar
      [SPARK-10227] fatal warnings with sbt on Scala 2.11 · c1bc4f43
      Luc Bourlier authored
      The bulk of the changes are on `transient` annotation on class parameter. Often the compiler doesn't generate a field for this parameters, so the the transient annotation would be unnecessary.
      But if the class parameter are used in methods, then fields are created. So it is safer to keep the annotations.
      
      The remainder are some potential bugs, and deprecated syntax.
      
      Author: Luc Bourlier <luc.bourlier@typesafe.com>
      
      Closes #8433 from skyluc/issue/sbt-2.11.
      c1bc4f43
  31. Aug 25, 2015
  32. Aug 24, 2015
    • Tathagata Das's avatar
      [SPARK-9791] [PACKAGE] Change private class to private class to prevent... · 7478c8b6
      Tathagata Das authored
      [SPARK-9791] [PACKAGE] Change private class to private class to prevent unnecessary classes from showing up in the docs
      
      In addition, some random cleanup of import ordering
      
      Author: Tathagata Das <tathagata.das1565@gmail.com>
      
      Closes #8387 from tdas/SPARK-9791 and squashes the following commits:
      
      67f3ee9 [Tathagata Das] Change private class to private[package] class to prevent them from showing up in the docs
      7478c8b6
Loading