Skip to content
Snippets Groups Projects
  1. Jan 14, 2014
    • Tathagata Das's avatar
      Fixed loose ends in docs. · f8bd828c
      Tathagata Das authored
      f8bd828c
    • Tathagata Das's avatar
      Merge remote-tracking branch 'apache/master' into filestream-fix · f8e239e0
      Tathagata Das authored
      Conflicts:
      	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
      f8e239e0
    • Tathagata Das's avatar
      Removed StreamingContext.registerInputStream and registerOutputStream - they... · 4e497db8
      Tathagata Das authored
      Removed StreamingContext.registerInputStream and registerOutputStream - they were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation.
      4e497db8
    • Patrick Wendell's avatar
      Merge pull request #380 from mateiz/py-bayes · fdaabdc6
      Patrick Wendell authored
      Add Naive Bayes to Python MLlib, and some API fixes
      
      - Added a Python wrapper for Naive Bayes
      - Updated the Scala Naive Bayes to match the style of our other
        algorithms better and in particular make it easier to call from Java
        (added builder pattern, removed default value in train method)
      - Updated Python MLlib functions to not require a SparkContext; we can
        get that from the RDD the user gives
      - Added a toString method in LabeledPoint
      - Made the Python MLlib tests run as part of run-tests as well (before
        they could only be run individually through each file)
      fdaabdc6
    • Patrick Wendell's avatar
      Merge pull request #367 from ankurdave/graphx · 4a805aff
      Patrick Wendell authored
      GraphX: Unifying Graphs and Tables
      
      GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/.
      
      Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak.
      
      Tasks left:
      - [x] Graph-level uncache
      - [x] Uncache previous iterations in Pregel
      - [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release)
      - [x] - Describe GC issue with GraphLab
      - [ ] Write `docs/graphx-programming-guide.md`
      - [x] - Mention future Bagel support in docs
      - [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again.
      - [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx
      - [x] Make Graph serializable to work around capture in Spark shell
      - [x] Rename graph -> graphx in package name and subproject
      - [x] Remove standalone PageRank
      - [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~
      4a805aff
    • Joseph E. Gonzalez's avatar
    • Patrick Wendell's avatar
      Merge pull request #408 from pwendell/external-serializers · 945fe7a3
      Patrick Wendell authored
      Improvements to external sorting
      
      1. Adds the option of compressing outputs.
      2. Adds batching to the serialization to prevent OOM on the read side.
      3. Slight renaming of config options.
      4. Use Spark's buffer size for reads in addition to writes.
      945fe7a3
    • Joseph E. Gonzalez's avatar
      adding documentation about EdgeRDD · 4bafc4f4
      Joseph E. Gonzalez authored
      4bafc4f4
    • Patrick Wendell's avatar
      Merge pull request #413 from rxin/scaladoc · 68641bce
      Patrick Wendell authored
      Adjusted visibility of various components and documentation for 0.9.0 release.
      68641bce
    • Patrick Wendell's avatar
      Merge pull request #401 from andrewor14/master · 0ca0d4d6
      Patrick Wendell authored
      External sorting - Add number of bytes spilled to Web UI
      
      Additionally, update test suite for external sorting to induce spilling.
      0ca0d4d6
    • Ankur Dave's avatar
      Fix all code examples in guide · af645be5
      Ankur Dave authored
      af645be5
    • Ankur Dave's avatar
      Finish 6f6f8c92 · 2cd9358c
      Ankur Dave authored
      2cd9358c
    • Patrick Wendell's avatar
      Merge pull request #409 from tdas/unpersist · 08b9fec9
      Patrick Wendell authored
      Automatically unpersisting RDDs that have been cleaned up from DStreams
      
      Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads.
      
      This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default.
      
      Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.
      08b9fec9
    • Ankur Dave's avatar
    • Ankur Dave's avatar
      c6dbfd16
  2. Jan 13, 2014
Loading