Skip to content
Snippets Groups Projects
  1. Sep 06, 2016
  2. Sep 02, 2016
    • Thomas Graves's avatar
      [SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade · e79962f2
      Thomas Graves authored
      The Spark Yarn Shuffle Service doesn't re-initialize the application credentials early enough which causes any other spark executors trying to fetch from that node during a rolling upgrade to fail with "java.lang.NullPointerException: Password cannot be null if SASL is enabled".  Right now the spark shuffle service relies on the Yarn nodemanager to re-register the applications, unfortunately this is after we open the port for other executors to connect. If other executors connected before the re-register they get a null pointer exception which isn't a re-tryable exception and cause them to fail pretty quickly. To solve this I added another leveldb file so that it can save and re-initialize all the applications before opening the port for other executors to connect to it.  Adding another leveldb was simpler from the code structure point of view.
      
      Most of the code changes are moving things to common util class.
      
      Patch was tested manually on a Yarn cluster with rolling upgrade was happing while spark job was running. Without the patch I consistently get the NullPointerException, with the patch the job gets a few Connection refused exceptions but the retries kick in and the it succeeds.
      
      Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
      
      Closes #14718 from tgravescs/SPARK-16711.
      e79962f2
  3. Sep 01, 2016
    • Sean Owen's avatar
      [SPARK-17331][CORE][MLLIB] Avoid allocating 0-length arrays · 3893e8c5
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Avoid allocating some 0-length arrays, esp. in UTF8String, and by using Array.empty in Scala over Array[T]()
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14895 from srowen/SPARK-17331.
      3893e8c5
  4. Aug 31, 2016
    • Sean Owen's avatar
      [SPARK-17332][CORE] Make Java Loggers static members · 5d84c7fd
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Make all Java Loggers static members
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14896 from srowen/SPARK-17332.
      5d84c7fd
  5. Aug 30, 2016
  6. Aug 25, 2016
  7. Aug 22, 2016
    • Richael's avatar
      [SPARK-17127] Make unaligned access in unsafe available for AArch64 · 083de00c
      Richael authored
      ## # What changes were proposed in this pull request?
      
      From the spark of version 2.0.0 , when MemoryMode.OFF_HEAP is set , whether the architecture supports unaligned access or not is checked. If the check doesn't pass, exception is raised.
      
      We know that AArch64 also supports unaligned access , but now only i386, x86, amd64, and X86_64 are included.
      
      I think we should include aarch64 when performing the check.
      
      ## How was this patch tested?
      
      Unit test suite
      
      Author: Richael <Richael.Zhuang@arm.com>
      
      Closes #14700 from yimuxi/zym_change_unsafe.
      083de00c
  8. Aug 04, 2016
    • Josh Rosen's avatar
      [HOTFIX] Remove unnecessary imports from #12944 that broke build · d91c6755
      Josh Rosen authored
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14499 from JoshRosen/hotfix.
      d91c6755
    • Sital Kedia's avatar
      [SPARK-15074][SHUFFLE] Cache shuffle index file to speedup shuffle fetch · 9c15d079
      Sital Kedia authored
      ## What changes were proposed in this pull request?
      
      Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch
      
      ## How was this patch tested?
      
      Tested by running a job on the cluster and the shuffle read time was reduced by 50%.
      
      Author: Sital Kedia <skedia@fb.com>
      
      Closes #12944 from sitalkedia/shuffle_service.
      9c15d079
  9. Jul 19, 2016
  10. Jul 14, 2016
  11. Jul 13, 2016
    • Xin Ren's avatar
      [MINOR] Fix Java style errors and remove unused imports · f73891e0
      Xin Ren authored
      ## What changes were proposed in this pull request?
      
      Fix Java style errors and remove unused imports, which are randomly found
      
      ## How was this patch tested?
      
      Tested on my local machine.
      
      Author: Xin Ren <iamshrek@126.com>
      
      Closes #14161 from keypointt/SPARK-16437.
      f73891e0
  12. Jul 12, 2016
    • Yangyang Liu's avatar
      [SPARK-16405] Add metrics and source for external shuffle service · 68df47ac
      Yangyang Liu authored
      ## What changes were proposed in this pull request?
      
      Since externalShuffleService is essential for spark, better monitoring for shuffle service is necessary. In order to do so, we added various metrics in shuffle service and imported into ExternalShuffleServiceSource for metric system.
      Metrics added in shuffle service:
      * registeredExecutorsSize
      * openBlockRequestLatencyMillis
      * registerExecutorRequestLatencyMillis
      * blockTransferRateBytes
      
      JIRA Issue: https://issues.apache.org/jira/browse/SPARK-16405
      
      ## How was this patch tested?
      
      Some test cases are added to verify metrics as expected in metric system. Those unit test cases are shown in `ExternalShuffleBlockHandlerSuite `
      
      Author: Yangyang Liu <yangyangliu@fb.com>
      
      Closes #14080 from lovexi/yangyang-metrics.
      68df47ac
  13. Jul 11, 2016
    • Reynold Xin's avatar
      [SPARK-16477] Bump master version to 2.1.0-SNAPSHOT · ffcb6e05
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      After SPARK-16476 (committed earlier today as #14128), we can finally bump the version number.
      
      ## How was this patch tested?
      N/A
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #14130 from rxin/SPARK-16477.
      ffcb6e05
  14. Jul 08, 2016
    • Ryan Blue's avatar
      [SPARK-16420] Ensure compression streams are closed. · 67e085ef
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This uses the try/finally pattern to ensure streams are closed after use. `UnsafeShuffleWriter` wasn't closing compression streams, causing them to leak resources until garbage collected. This was causing a problem with codecs that use off-heap memory.
      
      ## How was this patch tested?
      
      Current tests are sufficient. This should not change behavior.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #14093 from rdblue/SPARK-16420-unsafe-shuffle-writer-leak.
      67e085ef
  15. Jul 06, 2016
  16. Jun 17, 2016
    • Dhruve Ashar's avatar
      [SPARK-16018][SHUFFLE] Shade netty to load shuffle jar in Nodemanger · 298c4ae8
      Dhruve Ashar authored
      ## What changes were proposed in this pull request?
      Shade the netty.io namespace so that we can use it in shuffle independent of the dependencies being pulled by hadoop jars.
      
      ## How was this patch tested?
      Ran a decent job involving shuffle write/read and tested the new spark-x-yarn-shuffle jar. After shading netty.io namespace, the nodemanager loads and shuffle job completes successfully.
      
      Author: Dhruve Ashar <dhruveashar@gmail.com>
      
      Closes #13739 from dhruve/bug/SPARK-16018.
      298c4ae8
  17. Jun 03, 2016
    • Davies Liu's avatar
      [SPARK-15391] [SQL] manage the temporary memory of timsort · 3074f575
      Davies Liu authored
      ## What changes were proposed in this pull request?
      
      Currently, the memory for temporary buffer used by TimSort is always allocated as on-heap without bookkeeping, it could cause OOM both in on-heap and off-heap mode.
      
      This PR will try to manage that by preallocate it together with the pointer array, same with RadixSort. It both works for on-heap and off-heap mode.
      
      This PR also change the loadFactor of BytesToBytesMap to 0.5 (it was 0.70), it enables use to radix sort also makes sure that we have enough memory for timsort.
      
      ## How was this patch tested?
      
      Existing tests.
      
      Author: Davies Liu <davies@databricks.com>
      
      Closes #13318 from davies/fix_timsort.
      3074f575
  18. May 29, 2016
    • Sean Owen's avatar
      [MINOR] Resolve a number of miscellaneous build warnings · ce1572d1
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      This change resolves a number of build warnings that have accumulated, before 2.x. It does not address a large number of deprecation warnings, especially related to the Accumulator API. That will happen separately.
      
      ## How was this patch tested?
      
      Jenkins
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #13377 from srowen/BuildWarnings.
      ce1572d1
  19. May 18, 2016
  20. May 17, 2016
  21. May 10, 2016
    • jerryshao's avatar
      [SPARK-14963][YARN] Using recoveryPath if NM recovery is enabled · aab99d31
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      From Hadoop 2.5+, Yarn NM supports NM recovery which using recovery path for auxiliary services such as spark_shuffle, mapreduce_shuffle. So here change to use this path install of NM local dir if NM recovery is enabled.
      
      ## How was this patch tested?
      
      Unit test + local test.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #12994 from jerryshao/SPARK-14963.
      aab99d31
  22. May 07, 2016
    • Sandeep Singh's avatar
      [SPARK-15178][CORE] Remove LazyFileRegion instead use netty's DefaultFileRegion · 6e268b9e
      Sandeep Singh authored
      ## What changes were proposed in this pull request?
      Remove LazyFileRegion instead use netty's DefaultFileRegion, since It was created so that we didn't create a file descriptor before having to send the file.
      
      ## How was this patch tested?
      Existing tests
      
      Author: Sandeep Singh <sandeep@techaddict.me>
      
      Closes #12977 from techaddict/SPARK-15178.
      6e268b9e
  23. May 05, 2016
    • Dongjoon Hyun's avatar
      [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update... · 2c170dd3
      Dongjoon Hyun authored
      [SPARK-15134][EXAMPLE] Indent SparkSession builder patterns and update binary_classification_metrics_example.py
      
      ## What changes were proposed in this pull request?
      
      This issue addresses the comments in SPARK-15031 and also fix java-linter errors.
      - Use multiline format in SparkSession builder patterns.
      - Update `binary_classification_metrics_example.py` to use `SparkSession`.
      - Fix Java Linter errors (in SPARK-13745, SPARK-15031, and so far)
      
      ## How was this patch tested?
      
      After passing the Jenkins tests and run `dev/lint-java` manually.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12911 from dongjoon-hyun/SPARK-15134.
      2c170dd3
  24. May 04, 2016
    • Thomas Graves's avatar
      [SPARK-15121] Improve logging of external shuffle handler · 0c00391f
      Thomas Graves authored
      ## What changes were proposed in this pull request?
      
      Add more informative logging in the external shuffle service to aid in debugging who is connecting to the YARN Nodemanager when the external shuffle service runs under it.
      
      ## How was this patch tested?
      
      Ran and saw logs coming out in log file.
      
      Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
      
      Closes #12900 from tgravescs/SPARK-15121.
      0c00391f
  25. Apr 28, 2016
  26. Apr 26, 2016
    • Azeem Jiva's avatar
      [SPARK-14756][CORE] Use parseLong instead of valueOf · de6e6334
      Azeem Jiva authored
      ## What changes were proposed in this pull request?
      
      Use Long.parseLong which returns a primative.
      Use a series of appends() reduces the creation of an extra StringBuilder type
      
      ## How was this patch tested?
      
      Unit tests
      
      Author: Azeem Jiva <azeemj@gmail.com>
      
      Closes #12520 from javawithjiva/minor.
      de6e6334
  27. Apr 25, 2016
  28. Apr 18, 2016
    • Reynold Xin's avatar
      [SPARK-14667] Remove HashShuffleManager · 5e92583d
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      The sort shuffle manager has been the default since Spark 1.2. It is time to remove the old hash shuffle manager.
      
      ## How was this patch tested?
      Removed some tests related to the old manager.
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12423 from rxin/SPARK-14667.
      5e92583d
  29. Apr 12, 2016
    • Reynold Xin's avatar
      [SPARK-14547] Avoid DNS resolution for reusing connections · c439d88e
      Reynold Xin authored
      ## What changes were proposed in this pull request?
      This patch changes the connection creation logic in the network client module to avoid DNS resolution when reusing connections.
      
      ## How was this patch tested?
      Testing in production. This is too difficult to test in isolation (for high fidelity unit tests, we'd need to change the DNS resolution behavior in the JVM).
      
      Author: Reynold Xin <rxin@databricks.com>
      
      Closes #12315 from rxin/SPARK-14547.
      c439d88e
  30. Apr 06, 2016
    • Marcelo Vanzin's avatar
      [SPARK-14134][CORE] Change the package name used for shading classes. · 21d5ca12
      Marcelo Vanzin authored
      The current package name uses a dash, which is a little weird but seemed
      to work. That is, until a new test tried to mock a class that references
      one of those shaded types, and then things started failing.
      
      Most changes are just noise to fix the logging configs.
      
      For reference, SPARK-8815 also raised this issue, although at the time it
      did not cause any issues in Spark, so it was not addressed.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11941 from vanzin/SPARK-14134.
      21d5ca12
    • Zhang, Liye's avatar
      [SPARK-14290][CORE][NETWORK] avoid significant memory copy in netty's transferTo · c4bb02ab
      Zhang, Liye authored
      ## What changes were proposed in this pull request?
      When netty transfer data that is not `FileRegion`, data will be in format of `ByteBuf`, If the data is large, there will occur significant performance issue because there is memory copy underlying in `sun.nio.ch.IOUtil.write`, the CPU is 100% used, and network is very low.
      
      In this PR, if data size is large, we will split it into small chunks to call `WritableByteChannel.write()`, so that avoid wasting of memory copy. Because the data can't be written within a single write, and it will call `transferTo` multiple times.
      
      ## How was this patch tested?
      Spark unit test and manual test.
      Manual test:
      `sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length`
      
      For more details, please refer to [SPARK-14290](https://issues.apache.org/jira/browse/SPARK-14290)
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #12083 from liyezhang556520/spark-14290.
      c4bb02ab
  31. Apr 03, 2016
    • Dongjoon Hyun's avatar
      [SPARK-14355][BUILD] Fix typos in Exception/Testcase/Comments and static analysis results · 3f749f7e
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR contains the following 5 types of maintenance fix over 59 files (+94 lines, -93 lines).
      - Fix typos(exception/log strings, testcase name, comments) in 44 lines.
      - Fix lint-java errors (MaxLineLength) in 6 lines. (New codes after SPARK-14011)
      - Use diamond operators in 40 lines. (New codes after SPARK-13702)
      - Fix redundant semicolon in 5 lines.
      - Rename class `InferSchemaSuite` to `CSVInferSchemaSuite` in CSVInferSchemaSuite.scala.
      
      ## How was this patch tested?
      
      Manual and pass the Jenkins tests.
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #12139 from dongjoon-hyun/SPARK-14355.
      3f749f7e
  32. Apr 01, 2016
    • Josh Rosen's avatar
      [SPARK-13992] Add support for off-heap caching · e41acb75
      Josh Rosen authored
      This patch adds support for caching blocks in the executor processes using direct / off-heap memory.
      
      ## User-facing changes
      
      **Updated semantics of `OFF_HEAP` storage level**: In Spark 1.x, the `OFF_HEAP` storage level indicated that an RDD should be cached in Tachyon. Spark 2.x removed the external block store API that Tachyon caching was based on (see #10752 / SPARK-12667), so `OFF_HEAP` became an alias for `MEMORY_ONLY_SER`. As of this patch, `OFF_HEAP` means "serialized and cached in off-heap memory or on disk". Via the `StorageLevel` constructor, `useOffHeap` can be set if `serialized == true` and can be used to construct custom storage levels which support replication.
      
      **Storage UI reporting**: the storage UI will now report whether in-memory blocks are stored on- or off-heap.
      
      **Only supported by UnifiedMemoryManager**: for simplicity, this feature is only supported when the default UnifiedMemoryManager is used; applications which use the legacy memory manager (`spark.memory.useLegacyMode=true`) are not currently able to allocate off-heap storage memory, so using off-heap caching will fail with an error when legacy memory management is enabled. Given that we plan to eventually remove the legacy memory manager, this is not a significant restriction.
      
      **Memory management policies:** the policies for dividing available memory between execution and storage are the same for both on- and off-heap memory. For off-heap memory, the total amount of memory available for use by Spark is controlled by `spark.memory.offHeap.size`, which is an absolute size. Off-heap storage memory obeys `spark.memory.storageFraction` in order to control the amount of unevictable storage memory. For example, if `spark.memory.offHeap.size` is 1 gigabyte and Spark uses the default `storageFraction` of 0.5, then up to 500 megabytes of off-heap cached blocks will be protected from eviction due to execution memory pressure. If necessary, we can split `spark.memory.storageFraction` into separate on- and off-heap configurations, but this doesn't seem necessary now and can be done later without any breaking changes.
      
      **Use of off-heap memory does not imply use of off-heap execution (or vice-versa)**: for now, the settings controlling the use of off-heap execution memory (`spark.memory.offHeap.enabled`) and off-heap caching are completely independent, so Spark SQL can be configured to use off-heap memory for execution while continuing to cache blocks on-heap. If desired, we can change this in a followup patch so that `spark.memory.offHeap.enabled` affect the default storage level for cached SQL tables.
      
      ## Internal changes
      
      - Rename `ByteArrayChunkOutputStream` to `ChunkedByteBufferOutputStream`
        - It now returns a `ChunkedByteBuffer` instead of an array of byte arrays.
        - Its constructor now accept an `allocator` function which is called to allocate `ByteBuffer`s. This allows us to control whether it allocates regular ByteBuffers or off-heap DirectByteBuffers.
        - Because block serialization is now performed during the unroll process, a `ChunkedByteBufferOutputStream` which is configured with a `DirectByteBuffer` allocator will use off-heap memory for both unroll and storage memory.
      - The `MemoryStore`'s MemoryEntries now tracks whether blocks are stored on- or off-heap.
        - `evictBlocksToFreeSpace()` now accepts a `MemoryMode` parameter so that we don't try to evict off-heap blocks in response to on-heap memory pressure (or vice-versa).
      - Make sure that off-heap buffers are properly de-allocated during MemoryStore eviction.
      - The JVM limits the total size of allocated direct byte buffers using the `-XX:MaxDirectMemorySize` flag and the default tends to be fairly low (< 512 megabytes in some JVMs). To work around this limitation, this patch adds a custom DirectByteBuffer allocator which ignores this memory limit.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #11805 from JoshRosen/off-heap-caching.
      e41acb75
  33. Mar 31, 2016
    • Zhang, Liye's avatar
      [SPARK-14242][CORE][NETWORK] avoid copy in compositeBuffer for frame decoder · 96941b12
      Zhang, Liye authored
      ## What changes were proposed in this pull request?
      In this patch, we set the initial `maxNumComponents` to `Integer.MAX_VALUE` instead of the default size ( which is 16) when allocating `compositeBuffer` in `TransportFrameDecoder` because `compositeBuffer` will introduce too many memory copies underlying if `compositeBuffer` is with default `maxNumComponents` when the frame size is large (which result in many transport messages). For details, please refer to [SPARK-14242](https://issues.apache.org/jira/browse/SPARK-14242).
      
      ## How was this patch tested?
      spark unit tests and manual tests.
      For manual tests, we can reproduce the performance issue with following code:
      `sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length`
      It's easy to see the performance gain, both from the running time and CPU usage.
      
      Author: Zhang, Liye <liye.zhang@intel.com>
      
      Closes #12038 from liyezhang556520/spark-14242.
      96941b12
  34. Mar 29, 2016
    • Shixiong Zhu's avatar
      [SPARK-14254][CORE] Add logs to help investigate the network performance · 7320f9bd
      Shixiong Zhu authored
      ## What changes were proposed in this pull request?
      
      It would be very helpful for network performance investigation if we log the time spent on connecting and resolving host.
      
      ## How was this patch tested?
      
      Jenkins unit tests.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #12046 from zsxwing/connection-time.
      7320f9bd
Loading