- Jan 14, 2014
-
-
Patrick Wendell authored
-
Patrick Wendell authored
Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)
-
Patrick Wendell authored
GraphX: Unifying Graphs and Tables GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/. Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak. Tasks left: - [x] Graph-level uncache - [x] Uncache previous iterations in Pregel - [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release) - [x] - Describe GC issue with GraphLab - [ ] Write `docs/graphx-programming-guide.md` - [x] - Mention future Bagel support in docs - [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again. - [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx - [x] Make Graph serializable to work around capture in Spark shell - [x] Rename graph -> graphx in package name and subproject - [x] Remove standalone PageRank - [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~
-
Joseph E. Gonzalez authored
-
Patrick Wendell authored
Improvements to external sorting 1. Adds the option of compressing outputs. 2. Adds batching to the serialization to prevent OOM on the read side. 3. Slight renaming of config options. 4. Use Spark's buffer size for reads in addition to writes.
-
Joseph E. Gonzalez authored
-
Patrick Wendell authored
Adjusted visibility of various components and documentation for 0.9.0 release.
-
Patrick Wendell authored
External sorting - Add number of bytes spilled to Web UI Additionally, update test suite for external sorting to induce spilling.
-
Ankur Dave authored
-
Ankur Dave authored
-
Patrick Wendell authored
Automatically unpersisting RDDs that have been cleaned up from DStreams Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads. This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default. Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.
-
Ankur Dave authored
-
Ankur Dave authored
-
- Jan 13, 2014
-
-
Ankur Dave authored
-
Ankur Dave authored
-
Ankur Dave authored
-
Ankur Dave authored
-
Ankur Dave authored
-
Ankur Dave authored
The loop occurred when numEdges < numVertices. This commit fixes it by allowing generateRandomEdges to generate a multigraph.
-
Ankur Dave authored
-
Ankur Dave authored
-
Andrew Or authored
-
Matei Zaharia authored
-
Patrick Wendell authored
Add default value for HadoopRDD's `cloneRecords` constructor arg Small mend to https://github.com/apache/incubator-spark/pull/359/files#diff-1 for backwards compatibility
-
Reynold Xin authored
-
Patrick Wendell authored
Improved logic of finding new files in FileInputDStream Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it. The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further. Also reduced line lengths in DStream to <=100 chars.
-
Harvey authored
Add default value for HadoopRDD's `cloneRecords` constructor arg, to maintain backwards compatibility.
-
Joseph E. Gonzalez authored
-
Patrick Wendell authored
-
Reynold Xin authored
-
Reynold Xin authored
-
Reynold Xin authored
-
Reynold Xin authored
-
Joseph E. Gonzalez authored
-
Joseph E. Gonzalez authored
-
Reynold Xin authored
-
Reynold Xin authored
-
Reynold Xin authored
-
Reynold Xin authored
Conflicts: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
-
Reynold Xin authored
-