Skip to content
Snippets Groups Projects
  1. Jan 12, 2014
  2. Jan 11, 2014
  3. Jan 10, 2014
    • Matei Zaharia's avatar
      Merge pull request #381 from mateiz/default-ttl · 1d7bef0c
      Matei Zaharia authored
      Fix default TTL for metadata cleaner
      
      It seems to have been set to 3500 in a previous commit for debugging, but it should be off by default.
      1d7bef0c
    • Patrick Wendell's avatar
      Merge pull request #382 from RongGu/master · 44d6a8e3
      Patrick Wendell authored
      Fix a type error in comment lines
      
      Fix a type error in comment lines
      44d6a8e3
    • Patrick Wendell's avatar
      Small typo fix · 08370a52
      Patrick Wendell authored
      08370a52
    • Patrick Wendell's avatar
    • Matei Zaharia's avatar
    • Patrick Wendell's avatar
      Merge pull request #383 from tdas/driver-test · f2655310
      Patrick Wendell authored
      API for automatic driver recovery for streaming programs and other bug fixes
      
      1. Added Scala and Java API for automatically loading checkpoint if it exists in the provided checkpoint directory.
      
        Scala API: `StreamingContext.getOrCreate(<checkpoint dir>, <function to create new StreamingContext>)` returns a StreamingContext
        Java API: `JavaStreamingContext.getOrCreate(<checkpoint dir>, <factory obj of type JavaStreamingContextFactory>)`, return a JavaStreamingContext
      
        See the RecoverableNetworkWordCount below as an example of how to use it.
      
      2. Refactored streaming.Checkpoint*** code to fix bugs and make the DStream metadata checkpoint writing and reading more robust. Specifically, it fixes and improves the logic behind backing up and writing metadata checkpoint files. Also, it ensure that spark.driver.* and spark.hostPort is cleared from SparkConf before being written to checkpoint.
      
      3. Fixed bug in cleaning up of checkpointed RDDs created by DStream. Specifically, this fix ensures that checkpointed RDD's files are not prematurely cleaned up, thus ensuring reliable recovery.
      
      4. TimeStampedHashMap is upgraded to optionally update the timestamp on map.get(key). This allows clearing of data based on access time (i.e., clear records were last accessed before a threshold timestamp).
      
      5. Added caching for file modification time in FileInputDStream using the updated TimeStampedHashMap. Without the caching, enumerating the mod times to find new files can take seconds if there are 1000s of files. This cache is automatically cleared.
      
      This PR is not entirely final as I may make some minor additions - a Java examples, and adding StreamingContext.getOrCreate to unit test.
      
      Edit: Java example to be added later, unit test added.
      f2655310
    • Patrick Wendell's avatar
      Merge pull request #377 from andrewor14/master · d37408f3
      Patrick Wendell authored
      External Sorting for Aggregator and CoGroupedRDDs (Revisited)
      
      (This pull request is re-opened from https://github.com/apache/incubator-spark/pull/303, which was closed because Jenkins / github was misbehaving)
      
      The target issue for this patch is the out-of-memory exceptions triggered by aggregate operations such as reduce, groupBy, join, and cogroup. The existing AppendOnlyMap used by these operations resides purely in memory, and grows with the size of the input data until the amount of allocated memory is exceeded. Under large workloads, this problem is aggravated by the fact that OOM frequently occurs only after a very long (> 1 hour) map phase, in which case the entire job must be restarted.
      
      The solution is to spill the contents of this map to disk once a certain memory threshold is exceeded. This functionality is provided by ExternalAppendOnlyMap, which additionally sorts this buffer before writing it out to disk, and later merges these buffers back in sorted order.
      
      Under normal circumstances in which OOM is not triggered, ExternalAppendOnlyMap is simply a wrapper around AppendOnlyMap and incurs little overhead. Only when the memory usage is expected to exceed the given threshold does ExternalAppendOnlyMap spill to disk.
      d37408f3
    • Tathagata Das's avatar
      Merge remote-tracking branch 'apache/master' into driver-test · 4f39e79c
      Tathagata Das authored
      Conflicts:
      	streaming/src/main/scala/org/apache/spark/streaming/DStreamGraph.scala
      4f39e79c
    • Andrew Or's avatar
      Update documentation for externalSorting · 2e393cd5
      Andrew Or authored
      2e393cd5
    • Tathagata Das's avatar
    • Reynold Xin's avatar
      Merge pull request #369 from pillis/master · 0eaf01c5
      Reynold Xin authored
      SPARK-961 Add a Vector.random() method
      
      Added method and testcases
      0eaf01c5
    • Andrew Or's avatar
      Address Patrick's and Reynold's comments · e4c51d21
      Andrew Or authored
      Aside from trivial formatting changes, use nulls instead of Options for
      DiskMapIterator, and add documentation for spark.shuffle.externalSorting
      and spark.shuffle.memoryFraction.
      
      Also, set spark.shuffle.memoryFraction to 0.3, and spark.storage.memoryFraction = 0.6.
      e4c51d21
    • RongGu's avatar
      fix a type error in comment lines · 94776f75
      RongGu authored
      94776f75
Loading