Skip to content
Snippets Groups Projects
  1. Feb 28, 2015
  2. Dec 08, 2014
    • Sean Owen's avatar
      SPARK-3926 [CORE] Reopened: result of JavaRDD collectAsMap() is not serializable · e829bfa1
      Sean Owen authored
      My original 'fix' didn't fix at all. Now, there's a unit test to check whether it works. Of the two options to really fix it -- copy the `Map` to a `java.util.HashMap`, or copy and modify Scala's implementation in `Wrappers.MapWrapper`, I went with the latter.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #3587 from srowen/SPARK-3926 and squashes the following commits:
      
      8586bb9 [Sean Owen] Remove unneeded no-arg constructor, and add additional note about copied code in LICENSE
      7bb0e66 [Sean Owen] Make SerializableMapWrapper actually serialize, and add unit test
      e829bfa1
  3. Nov 05, 2014
    • Aaron Davidson's avatar
      [SPARK-4242] [Core] Add SASL to external shuffle service · 4c42986c
      Aaron Davidson authored
      Does three things: (1) Adds SASL to ExternalShuffleClient, (2) puts SecurityManager in BlockManager's constructor, and (3) adds unit test.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #3108 from aarondav/sasl-client and squashes the following commits:
      
      48b622d [Aaron Davidson] Screw it, let's just get LimitedInputStream
      3543b70 [Aaron Davidson] Back out of pom change due to unknown test issue?
      b58518a [Aaron Davidson] ByteStreams.limit() not available :(
      cbe451a [Aaron Davidson] Address comments
      2bf2908 [Aaron Davidson] [SPARK-4242] [Core] Add SASL to external shuffle service
      4c42986c
  4. Oct 27, 2014
    • Sean Owen's avatar
      SPARK-4022 [CORE] [MLLIB] Replace colt dependency (LGPL) with commons-math · bfa614b1
      Sean Owen authored
      This change replaces usages of colt with commons-math3 equivalents, and makes some minor necessary adjustments to related code and tests to match.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #2928 from srowen/SPARK-4022 and squashes the following commits:
      
      61a232f [Sean Owen] Fix failure due to different sampling in JavaAPISuite.sample()
      16d66b8 [Sean Owen] Simplify seeding with call to reseedRandomGenerator
      a1a78e0 [Sean Owen] Use Well19937c
      31c7641 [Sean Owen] Fix Python Poisson test by choosing a different seed; about 88% of seeds should work but 1 didn't, it seems
      5c9c67f [Sean Owen] Additional test fixes from review
      d8f88e0 [Sean Owen] Replace colt with commons-math3. Some tests do not pass yet.
      bfa614b1
  5. Aug 26, 2014
    • Davies Liu's avatar
      [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey() · f1e71d4c
      Davies Liu authored
      Using external sort to support sort large datasets in reduce stage.
      
      Author: Davies Liu <davies.liu@gmail.com>
      
      Closes #1978 from davies/sort and squashes the following commits:
      
      bbcd9ba [Davies Liu] check spilled bytes in tests
      b125d2f [Davies Liu] add test for external sort in rdd
      eae0176 [Davies Liu] choose different disks from different processes and instances
      1f075ed [Davies Liu] Merge branch 'master' into sort
      eb53ca6 [Davies Liu] Merge branch 'master' into sort
      644abaf [Davies Liu] add license in LICENSE
      19f7873 [Davies Liu] improve tests
      55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
      f1e71d4c
  6. Aug 02, 2014
  7. Jul 29, 2014
  8. Jul 22, 2014
    • Aaron Davidson's avatar
      SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage · 85d3596e
      Aaron Davidson authored
      ### Why and what?
      Currently, the AppendOnlyMap performs an "in-place" sort by converting its array of [key, value, key, value] pairs into a an array of [(key, value), (key, value)] pairs. However, this causes us to allocate many Tuple2 objects, which come at a nontrivial overhead.
      
      This patch adds a Sorter API, intended for in memory sorts, which simply ports the Android Timsort implementation (available under Apache v2) and abstracts the interface in a way which introduces no more than 1 virtual function invocation of overhead at each abstraction point.
      
      Please compare our port of the Android Timsort sort with the original implementation: http://www.diffchecker.com/wiwrykcl
      
      ### Memory implications
      An AppendOnlyMap contains N kv pairs, which results in roughly 2N elements within its underlying array. Each of these elements is 4 bytes wide in a [compressed OOPS](https://wikis.oracle.com/display/HotSpotInternals/CompressedOops) system, which is the default.
      
      Today's approach immediately allocates N Tuple2 objects, which take up 24N bytes in total (exposed via YourKit), and undergoes a Java sort. The Java 6 version immediately copies the entire array (4N bytes here), while the Java 7 version has a worst-case allocation of half the array (2N bytes).
      This results in a worst-case sorting overhead of 24N + 2N = 26N bytes (for Java 7).
      
      The Sorter does not require allocating any tuples, but since it uses Timsort, it may copy up to half the entire array in the worst case.
      This results in a worst-case sorting overhead of 4N bytes.
      
      Thus, we have reduced the worst-case overhead of the sort by roughly 22 bytes times the number of elements.
      
      ### Performance implications
      As the destructiveSortedIterator is used for spilling in an ExternalAppendOnlyMap, the purpose of this patch is to provide stability by reducing memory usage rather than improve performance. However, because it implements Timsort, it also brings a substantial performance boost over our prior implementation.
      
      Here are the results of a microbenchmark that sorted 25 million, randomly distributed (Float, Int) pairs. The Java Arrays.sort() tests were run **only on the keys**, and thus moved less data. Our current implementation is called "Tuple-sort using Arrays.sort()" while the new implementation is "KV-array using Sorter".
      
      <table>
      <tr><th>Test</th><th>First run (JDK6)</th><th>Average of 10 (JDK6)</th><th>First run (JDK7)</th><th>Average of 10 (JDK7)</th></tr>
      <tr><td>primitive Arrays.sort()</td><td>3216 ms</td><td>1190 ms</td><td>2724 ms</td><td>131 ms (!!)</td></tr>
      <tr><td>Arrays.sort()</td><td>18564 ms</td><td>2006 ms</td><td>13201 ms</td><td>878 ms</td></tr>
      <tr><td>Tuple-sort using Arrays.sort()</td><td>31813 ms</td><td>3550 ms</td><td>20990 ms</td><td>1919 ms</td></tr>
      <tr><td><b>KV-array using Sorter</b></td><td></td><td></td><td><b>15020 ms</b></td><td><b>834 ms</b></td></tr>
      </table>
      
      The results show that this Sorter performs exactly as expected (after the first run) -- it is as fast as the Java 7 Arrays.sort() (which shares the same algorithm), but is significantly faster than the Tuple-sort on Java 6 or 7.
      
      In short, this patch should significantly improve performance for users running either Java 6 or 7.
      
      Author: Aaron Davidson <aaron@databricks.com>
      
      Closes #1502 from aarondav/sort and squashes the following commits:
      
      652d936 [Aaron Davidson] Update license, move Sorter to java src
      a7b5b1c [Aaron Davidson] fix licenses
      5c0efaf [Aaron Davidson] Update tmpLength
      ec395c8 [Aaron Davidson] Ignore benchmark (again) and fix docs
      034bf10 [Aaron Davidson] Change to Apache v2 Timsort
      b97296c [Aaron Davidson] Don't try to run benchmark on Jenkins + private[spark]
      6307338 [Aaron Davidson] SPARK-2047: Introduce an in-mem Sorter, and use it to reduce mem usage
      85d3596e
  9. May 14, 2014
    • Sean Owen's avatar
      SPARK-1827. LICENSE and NOTICE files need a refresh to contain transitive dependency info · 2e5a7cde
      Sean Owen authored
      LICENSE and NOTICE policy is explained here:
      
      http://www.apache.org/dev/licensing-howto.html
      http://www.apache.org/legal/3party.html
      
      This leads to the following changes.
      
      First, this change enables two extensions to maven-shade-plugin in assembly/ that will try to include and merge all NOTICE and LICENSE files. This can't hurt.
      
      This generates a consolidated NOTICE file that I manually added to NOTICE.
      
      Next, a list of all dependencies and their licenses was generated:
      `mvn ... license:aggregate-add-third-party`
      to create: `target/generated-sources/license/THIRD-PARTY.txt`
      
      Each dependency is listed with one or more licenses. Determine the most-compatible license for each if there is more than one.
      
      For "unknown" license dependencies, I manually evaluateD their license. Many are actually Apache projects or components of projects covered already. The only non-trivial one was Colt, which has its own (compatible) license.
      
      I ignored Apache-licensed and public domain dependencies as these require no further action (beyond NOTICE above).
      
      BSD and MIT licenses (permissive Category A licenses) are evidently supposed to be mentioned in LICENSE, so I added a section without output from the THIRD-PARTY.txt file appropriately.
      
      Everything else, Category B licenses, are evidently mentioned in NOTICE (?) Same there.
      
      LICENSE contained some license statements for source code that is redistributed. I left this as I think that is the right place to put it.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #770 from srowen/SPARK-1827 and squashes the following commits:
      
      a764504 [Sean Owen] Add LICENSE and NOTICE info for all transitive dependencies as of 1.0
      2e5a7cde
  10. Mar 02, 2014
    • Michael Armbrust's avatar
      Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic. · 012bd5fb
      Michael Armbrust authored
      This allows developers to pass options (such as -D) to sbt.  I also modified the SparkBuild to ensure spark specific properties are propagated to forked test JVMs.
      
      Author: Michael Armbrust <michael@databricks.com>
      
      Closes #14 from marmbrus/sbtScripts and squashes the following commits:
      
      c008b18 [Michael Armbrust] Merge the old sbt-launch-lib.bash with the new sbt-launcher jar downloading logic.
      012bd5fb
  11. Sep 02, 2013
  12. Jul 16, 2013
  13. Dec 07, 2010
Loading