Skip to content
Snippets Groups Projects
  1. Mar 21, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14011][CORE][SQL] Enable `LineLength` Java checkstyle rule · 20fd2541
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      [Spark Coding Style Guide](https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide) has 100-character limit on lines, but it's disabled for Java since 11/09/15. This PR enables **LineLength** checkstyle again. To help that, this also introduces **RedundantImport** and **RedundantModifier**, too. The following is the diff on `checkstyle.xml`.
      
      ```xml
      -        <!-- TODO: 11/09/15 disabled - the lengths are currently > 100 in many places -->
      -        <!--
               <module name="LineLength">
                   <property name="max" value="100"/>
                   <property name="ignorePattern" value="^package.*|^import.*|a href|href|http://|https://|ftp://"/>
               </module>
      -        -->
               <module name="NoLineWrap"/>
               <module name="EmptyBlock">
                   <property name="option" value="TEXT"/>
       -167,5 +164,7
               </module>
               <module name="CommentsIndentation"/>
               <module name="UnusedImports"/>
      +        <module name="RedundantImport"/>
      +        <module name="RedundantModifier"/>
      ```
      
      ## How was this patch tested?
      
      Currently, `lint-java` is disabled in Jenkins. It needs a manual test.
      After passing the Jenkins tests, `dev/lint-java` should passes locally.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11831 from dongjoon-hyun/SPARK-14011.
      20fd2541
  2. Mar 19, 2016
    • Shixiong Zhu's avatar
      [SPARK-10680][TESTS] Increase 'connectionTimeout' to make... · d630a203
      Shixiong Zhu authored
      [SPARK-10680][TESTS] Increase 'connectionTimeout' to make RequestTimeoutIntegrationSuite more stable
      
      ## What changes were proposed in this pull request?
      
      Increase 'connectionTimeout' to make RequestTimeoutIntegrationSuite more stable
      
      ## How was this patch tested?
      
      Existing unit tests
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #11833 from zsxwing/SPARK-10680.
      d630a203
  3. Mar 17, 2016
    • Josh Rosen's avatar
      [SPARK-13921] Store serialized blocks as multiple chunks in MemoryStore · 6c2d894a
      Josh Rosen authored
      This patch modifies the BlockManager, MemoryStore, and several other storage components so that serialized cached blocks are stored as multiple small chunks rather than as a single contiguous ByteBuffer.
      
      This change will help to improve the efficiency of memory allocation and the accuracy of memory accounting when serializing blocks. Our current serialization code uses a ByteBufferOutputStream, which doubles and re-allocates its backing byte array; this increases the peak memory requirements during serialization (since we need to hold extra memory while expanding the array). In addition, we currently don't account for the extra wasted space at the end of the ByteBuffer's backing array, so a 129 megabyte serialized block may actually consume 256 megabytes of memory. After switching to storing blocks in multiple chunks, we'll be able to efficiently trim the backing buffers so that no space is wasted.
      
      This change is also a prerequisite to being able to cache blocks which are larger than 2GB (although full support for that depends on several other changes which have not bee implemented yet).
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11748 from JoshRosen/chunked-block-serialization.
      6c2d894a
  4. Mar 16, 2016
    • Sean Owen's avatar
      [SPARK-13823][SPARK-13397][SPARK-13395][CORE] More warnings, StandardCharset follow up · 3b461d9e
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Follow up to https://github.com/apache/spark/pull/11657
      
      - Also update `String.getBytes("UTF-8")` to use `StandardCharsets.UTF_8`
      - And fix one last new Coverity warning that turned up (use of unguarded `wait()` replaced by simpler/more robust `java.util.concurrent` classes in tests)
      - And while we're here cleaning up Coverity warnings, just fix about 15 more build warnings
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11725 from srowen/SPARK-13823.2.
      3b461d9e
  5. Mar 14, 2016
    • Bjorn Jonsson's avatar
      [MINOR][COMMON] Fix copy-paste oversight in variable naming · e06493cb
      Bjorn Jonsson authored
      ## What changes were proposed in this pull request?
      
      JavaUtils.java has methods to convert time and byte strings for internal use, this change renames a variable used in byteStringAs(), from timeError to byteError.
      
      Author: Bjorn Jonsson <bjornjon@gmail.com>
      
      Closes #11695 from bjornjon/master.
      e06493cb
    • Bertrand Bossy's avatar
      [SPARK-12583][MESOS] Mesos shuffle service: Don't delete shuffle files before... · 310981d4
      Bertrand Bossy authored
      [SPARK-12583][MESOS] Mesos shuffle service: Don't delete shuffle files before application has stopped
      
      ## Problem description:
      
      Mesos shuffle service is completely unusable since Spark 1.6.0 . The problem seems to occur since the move from akka to netty in the networking layer. Until now, a connection from the driver to each shuffle service was used as a signal for the shuffle service to determine, whether the driver is still running. Since 1.6.0, this connection is closed after spark.shuffle.io.connectionTimeout (or spark.network.timeout if the former is not set) due to it being idle. The shuffle service interprets this as a signal that the driver has stopped, despite the driver still being alive. Thus, shuffle files are deleted before the application has stopped.
      
      ### Context and analysis:
      
      spark shuffle fails with mesos after 2mins: https://issues.apache.org/jira/browse/SPARK-12583
      External shuffle service broken w/ Mesos: https://issues.apache.org/jira/browse/SPARK-13159
      
      This is a follow up on #11207 .
      
      ## What changes were proposed in this pull request?
      
      This PR adds a heartbeat signal from the Driver (in MesosExternalShuffleClient) to all registered external mesos shuffle service instances. In MesosExternalShuffleBlockHandler, a thread periodically checks whether a driver has timed out and cleans an application's shuffle files if this is the case.
      
      ## How was the this patch tested?
      
      This patch has been tested on a small mesos test cluster using the spark-shell. Log output from mesos shuffle service:
      ```
      16/02/19 15:13:45 INFO mesos.MesosExternalShuffleBlockHandler: Received registration request from app 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 (remote address /xxx.xxx.xxx.xxx:52391, heartbeat timeout 120000 ms).
      16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-c84c0697-a3f9-4f61-9c64-4d3ee227c047], subDirsPerLocalDir=64, shuffleManager=sort}
      16/02/19 15:13:47 INFO shuffle.ExternalShuffleBlockResolver: Registered executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7} with ExecutorShuffleInfo{localDirs=[/foo/blockmgr-bf46497a-de80-47b9-88f9-563123b59e03], subDirsPerLocalDir=64, shuffleManager=sort}
      16/02/19 15:16:02 INFO mesos.MesosExternalShuffleBlockHandler: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 timed out. Removing shuffle files.
      16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Application 294def07-3249-4e0f-8d71-bf8c83c58a50-0018 removed, cleanupLocalDirs = true
      16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=3}'s 1 local dirs
      16/02/19 15:16:02 INFO shuffle.ExternalShuffleBlockResolver: Cleaning up executor AppExecId{appId=294def07-3249-4e0f-8d71-bf8c83c58a50-0018, execId=7}'s 1 local dirs
      ```
      Note: there are 2 executors running on this slave.
      
      Author: Bertrand Bossy <bertrand.bossy@teralytics.net>
      
      Closes #11272 from bbossy/SPARK-12583-mesos-shuffle-service-heartbeat.
      310981d4
  6. Mar 13, 2016
    • Sean Owen's avatar
      [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <->... · 18408528
      Sean Owen authored
      [SPARK-13823][CORE][STREAMING][SQL] Always specify Charset in String <-> byte[] conversions (and remaining Coverity items)
      
      ## What changes were proposed in this pull request?
      
      - Fixes calls to `new String(byte[])` or `String.getBytes()` that rely on platform default encoding, to use UTF-8
      - Same for `InputStreamReader` and `OutputStreamWriter` constructors
      - Standardizes on UTF-8 everywhere
      - Standardizes specifying the encoding with `StandardCharsets.UTF-8`, not the Guava constant or "UTF-8" (which means handling `UnuspportedEncodingException`)
      - (also addresses the other remaining Coverity scan issues, which are pretty trivial; these are separated into commit https://github.com/srowen/spark/commit/1deecd8d9ca986d8adb1a42d315890ce5349d29c )
      
      ## How was this patch tested?
      
      Jenkins tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11657 from srowen/SPARK-13823.
      18408528
  7. Mar 09, 2016
    • Dongjoon Hyun's avatar
      [SPARK-13702][CORE][SQL][MLLIB] Use diamond operator for generic instance creation in Java code. · c3689bc2
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      In order to make `docs/examples` (and other related code) more simple/readable/user-friendly, this PR replaces existing codes like the followings by using `diamond` operator.
      
      ```
      -    final ArrayList<Product2<Object, Object>> dataToWrite =
      -      new ArrayList<Product2<Object, Object>>();
      +    final ArrayList<Product2<Object, Object>> dataToWrite = new ArrayList<>();
      ```
      
      Java 7 or higher supports **diamond** operator which replaces the type arguments required to invoke the constructor of a generic class with an empty set of type parameters (<>). Currently, Spark Java code use mixed usage of this.
      
      ## How was this patch tested?
      
      Manual.
      Pass the existing tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11541 from dongjoon-hyun/SPARK-13702.
      c3689bc2
    • Dongjoon Hyun's avatar
      [SPARK-13692][CORE][SQL] Fix trivial Coverity/Checkstyle defects · f3201aee
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This issue fixes the following potential bugs and Java coding style detected by Coverity and Checkstyle.
      
      - Implement both null and type checking in equals functions.
      - Fix wrong type casting logic in SimpleJavaBean2.equals.
      - Add `implement Cloneable` to `UTF8String` and `SortedIterator`.
      - Remove dereferencing before null check in `AbstractBytesToBytesMapSuite`.
      - Fix coding style: Add '{}' to single `for` statement in mllib examples.
      - Remove unused imports in `ColumnarBatch` and `JavaKinesisStreamSuite`.
      - Remove unused fields in `ChunkFetchIntegrationSuite`.
      - Add `stop()` to prevent resource leak.
      
      Please note that the last two checkstyle errors exist on newly added commits after [SPARK-13583](https://issues.apache.org/jira/browse/SPARK-13583).
      
      ## How was this patch tested?
      
      manual via `./dev/lint-java` and Coverity site.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11530 from dongjoon-hyun/SPARK-13692.
      f3201aee
  8. Mar 07, 2016
    • Marcelo Vanzin's avatar
      [SPARK-529][CORE][YARN] Add type-safe config keys to SparkConf. · e1fb8579
      Marcelo Vanzin authored
      This is, in a way, the basics to enable SPARK-529 (which was closed as
      won't fix but I think is still valuable). In fact, Spark SQL created
      something for that, and this change basically factors out that code
      and inserts it into SparkConf, with some extra bells and whistles.
      
      To showcase the usage of this pattern, I modified the YARN backend
      to use the new config keys (defined in the new `config` package object
      under `o.a.s.deploy.yarn`). Most of the changes are mechanic, although
      logic had to be slightly modified in a handful of places.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #10205 from vanzin/conf-opts.
      e1fb8579
  9. Mar 04, 2016
  10. Mar 03, 2016
    • Dongjoon Hyun's avatar
      [SPARK-13583][CORE][STREAMING] Remove unused imports and add checkstyle rule · b5f02d67
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      After SPARK-6990, `dev/lint-java` keeps Java code healthy and helps PR review by saving much time.
      This issue aims remove unused imports from Java/Scala code and add `UnusedImports` checkstyle rule to help developers.
      
      ## How was this patch tested?
      ```
      ./dev/lint-java
      ./build/sbt compile
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11438 from dongjoon-hyun/SPARK-13583.
      b5f02d67
    • Sean Owen's avatar
      [SPARK-13423][WIP][CORE][SQL][STREAMING] Static analysis fixes for 2.x · e97fc7f1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Make some cross-cutting code improvements according to static analysis. These are individually up for discussion since they exist in separate commits that can be reverted. The changes are broadly:
      
      - Inner class should be static
      - Mismatched hashCode/equals
      - Overflow in compareTo
      - Unchecked warnings
      - Misuse of assert, vs junit.assert
      - get(a) + getOrElse(b) -> getOrElse(a,b)
      - Array/String .size -> .length (occasionally, -> .isEmpty / .nonEmpty) to avoid implicit conversions
      - Dead code
      - tailrec
      - exists(_ == ) -> contains find + nonEmpty -> exists filter + size -> count
      - reduce(_+_) -> sum map + flatten -> map
      
      The most controversial may be .size -> .length simply because of its size. It is intended to avoid implicits that might be expensive in some places.
      
      ## How was the this patch tested?
      
      Existing Jenkins unit tests.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #11292 from srowen/SPARK-13423.
      e97fc7f1
  11. Mar 01, 2016
    • Reynold Xin's avatar
      [SPARK-13548][BUILD] Move tags and unsafe modules into common · b0ee7d43
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch moves tags and unsafe modules into common directory to remove 2 top level non-user-facing directories.
      
      ## How was this patch tested?
      Jenkins should suffice.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11426 from rxin/SPARK-13548.
      b0ee7d43
  12. Feb 28, 2016
    • Reynold Xin's avatar
      [SPARK-13529][BUILD] Move network/* modules into common/network-* · 9e01dcc6
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      As the title says, this moves the three modules currently in network/ into common/network-*. This removes one top level, non-user-facing folder.
      
      ## How was this patch tested?
      Compilation and existing tests. We should run both SBT and Maven.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #11409 from rxin/SPARK-13529.
      9e01dcc6
  13. Feb 26, 2016
    • Dongjoon Hyun's avatar
      [MINOR][SQL] Fix modifier order. · 727e7801
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR fixes the order of modifier from `abstract public` into `public abstract`.
      Currently, when we run `./dev/lint-java`, it shows the error.
      ```
      Checkstyle checks failed at following occurrences:
      [ERROR] src/main/java/org/apache/spark/util/sketch/CountMinSketch.java:[53,10] (modifier) ModifierOrder: 'public' modifier out of order with the JLS suggestions.
      ```
      
      ## How was this patch tested?
      
      ```
      $ ./dev/lint-java
      Checkstyle checks passed.
      ```
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11390 from dongjoon-hyun/fix_modifier_order.
      727e7801
  14. Feb 22, 2016
  15. Jan 30, 2016
    • Josh Rosen's avatar
      [SPARK-6363][BUILD] Make Scala 2.11 the default Scala version · 289373b2
      Josh Rosen authored
      This patch changes Spark's build to make Scala 2.11 the default Scala version. To be clear, this does not mean that Spark will stop supporting Scala 2.10: users will still be able to compile Spark for Scala 2.10 by following the instructions on the "Building Spark" page; however, it does mean that Scala 2.11 will be the default Scala version used by our CI builds (including pull request builds).
      
      The Scala 2.11 compiler is faster than 2.10, so I think we'll be able to look forward to a slight speedup in our CI builds (it looks like it's about 2X faster for the Maven compile-only builds, for instance).
      
      After this patch is merged, I'll update Jenkins to add new compile-only jobs to ensure that Scala 2.10 compilation doesn't break.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #10608 from JoshRosen/SPARK-6363.
      289373b2
  16. Jan 29, 2016
  17. Jan 28, 2016
  18. Jan 27, 2016
    • Wenchen Fan's avatar
      [SPARK-12938][SQL] DataFrame API for Bloom filter · 680afabe
      Wenchen Fan authored
      This PR integrates Bloom filter from spark-sketch into DataFrame. This version resorts to RDD.aggregate for building the filter. A more performant UDAF version can be built in future follow-up PRs.
      
      This PR also add 2 specify `put` version(`putBinary` and `putLong`) into `BloomFilter`, which makes it easier to build a Bloom filter over a `DataFrame`.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10937 from cloud-fan/bloom-filter.
      680afabe
  19. Jan 26, 2016
    • Cheng Lian's avatar
      [SPARK-12935][SQL] DataFrame API for Count-Min Sketch · ce38a35b
      Cheng Lian authored
      This PR integrates Count-Min Sketch from spark-sketch into DataFrame. This version resorts to `RDD.aggregate` for building the sketch. A more performant UDAF version can be built in future follow-up PRs.
      
      Author: Cheng Lian <lian@databricks.com>
      
      Closes #10911 from liancheng/cms-df-api.
      ce38a35b
    • Wenchen Fan's avatar
      [SPARK-12937][SQL] bloom filter serialization · 6743de3a
      Wenchen Fan authored
      This PR adds serialization support for BloomFilter.
      
      A version number is added to version the serialized binary format.
      
      Author: Wenchen Fan <wenchen@databricks.com>
      
      Closes #10920 from cloud-fan/bloom-filter.
      6743de3a
  20. Jan 25, 2016
  21. Jan 23, 2016
Loading