Commits · 0984647aaefffcecf85ebfbdb45e41ecd1e49a8c · cs525-sp18-g07 / spark

Jan 14, 2014

Enable compression by default for spills · 0984647a
Patrick Wendell authored 11 years ago

0984647a

Merge pull request #380 from mateiz/py-bayes · fdaabdc6

Patrick Wendell authored 11 years ago

Add Naive Bayes to Python MLlib, and some API fixes

- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)

fdaabdc6

Merge pull request #367 from ankurdave/graphx · 4a805aff

Patrick Wendell authored 11 years ago

GraphX: Unifying Graphs and Tables

GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/.

Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak.

Tasks left:
- [x] Graph-level uncache
- [x] Uncache previous iterations in Pregel
- [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release)
- [x] - Describe GC issue with GraphLab
- [ ] Write `docs/graphx-programming-guide.md`
- [x] - Mention future Bagel support in docs
- [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again.
- [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx
- [x] Make Graph serializable to work around capture in Spark shell
- [x] Rename graph -> graphx in package name and subproject
- [x] Remove standalone PageRank
- [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~

4a805aff

Adding minimal additional functionality to EdgeRDD · 80e73ed0
Joseph E. Gonzalez authored 11 years ago

80e73ed0

Merge pull request #408 from pwendell/external-serializers · 945fe7a3

Patrick Wendell authored 11 years ago

Improvements to external sorting

1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.

945fe7a3

adding documentation about EdgeRDD · 4bafc4f4
Joseph E. Gonzalez authored 11 years ago

4bafc4f4
Merge pull request #413 from rxin/scaladoc · 68641bce
Patrick Wendell authored 11 years ago
```
Adjusted visibility of various components and documentation for 0.9.0 release.
```
68641bce

Merge pull request #401 from andrewor14/master · 0ca0d4d6

Patrick Wendell authored 11 years ago

External sorting - Add number of bytes spilled to Web UI

Additionally, update test suite for external sorting to induce spilling.

0ca0d4d6

Fix all code examples in guide · af645be5
Ankur Dave authored 11 years ago

af645be5
Finish 6f6f8c92 · 2cd9358c
Ankur Dave authored 11 years ago

2cd9358c

Merge pull request #409 from tdas/unpersist · 08b9fec9

Patrick Wendell authored 11 years ago

Automatically unpersisting RDDs that have been cleaned up from DStreams

Earlier RDDs generated by DStreams were forgotten but not unpersisted. The system relied on the natural BlockManager LRU to drop the data. The cleaner.ttl was a hammer to clean up RDDs but it is something that needs to be set separately and need to be set very conservatively (at best, few minutes). This automatic unpersisting allows the system to handle this automatically, which reduces memory usage. As a side effect it will also improve GC performance as there are less number of objects stored in memory. In fact, for some workloads, it may allow RDDs to be cached as deserialized, which speeds up processing without too much GC overheads.

This is disabled by default. To enable it set configuration spark.streaming.unpersist to true. In future release, this will be set to true by default.

Also, reduced sleep time in TaskSchedulerImpl.stop() from 5 second to 1 second. From my conversation with Matei, there does not seem to be any good reason for the sleep for letting messages be sent out be so long.

08b9fec9

Fix bug in GraphLoader.edgeListFile that caused srcId > dstId · 76ebdae7
Ankur Dave authored 11 years ago

76ebdae7
Edge object must be public for Edge case class · c6dbfd16
Ankur Dave authored 11 years ago

c6dbfd16

Jan 13, 2014
- Wrap methods in the appropriate class/object declaration · 6f6f8c92
  Ankur Dave authored 11 years ago
  
  6f6f8c92
- Write Graph Builders section in guide · 67795dbb
  Ankur Dave authored 11 years ago
  
  67795dbb
- Remove K-Core and LDA sections from guide; they are unimplemented · e14a14bc
  Ankur Dave authored 11 years ago
  
  e14a14bc
- Improve scaladoc links · c28e5a08
  Ankur Dave authored 11 years ago
  
  c28e5a08
- Fix Pregel SSSP example in programming guide · 59e4384e
  Ankur Dave authored 11 years ago
  
  59e4384e
- Fix infinite loop in GraphGenerators.generateRandomEdges · c6023bee
  Ankur Dave authored 11 years ago
  
  The loop occurred when numEdges < numVertices. This commit fixes it by allowing generateRandomEdges to generate a multigraph.
  c6023bee
- Make Graph{,Impl,Ops} serializable to work around capture · 84d6af80
  Ankur Dave authored 11 years ago
  
  84d6af80
- Remove Graph.statistics and GraphImpl.printLineage · d4d9ece1
  Ankur Dave authored 11 years ago
  
  d4d9ece1
- Wording changes per Patrick · 83993414
  Andrew Or authored 11 years ago
  
  83993414
- Disable MLlib tests for now while Jenkins is still on Python 2.6 · cc93c2ab
  Matei Zaharia authored 11 years ago
  
  cc93c2ab
- Merge pull request #412 from harveyfeng/master · b07bc02a
  Patrick Wendell authored 11 years ago
  
  Add default value for HadoopRDD's `cloneRecords` constructor arg Small mend to https://github.com/apache/incubator-spark/pull/359/files#diff-1 for backwards compatibility
  b07bc02a
- Adjusted visibility of various components. · 33022d66
  Reynold Xin authored 11 years ago
  
  33022d66
- Merge pull request #411 from tdas/filestream-fix · a2fee38e
  Patrick Wendell authored 11 years ago
  
  Improved logic of finding new files in FileInputDStream Earlier, if HDFS has a hiccup and reports a existence of a new file (mod time T sec) at time T + 1 sec, then fileStream could have missed that file. With this change, it should be able to find files that are delayed by up to <batch size> seconds. That is, even if file is reported at T + <batch time> sec, file stream should be able to catch it. The new logic, at a high level, is as follows. It keeps track of the new files it found in the previous interval and mod time of the oldest of those files (lets call it X). Then in the current interval, it will ignore those files that were seen in the previous interval and those which have mod time older than X. So if a new file gets reported by HDFS that in the current interval, but has mod time in the previous interval, it will be considered. However, if the mod time earlier than the previous interval (that is, earlier than X), they will be ignored. This is the current limitation, and future version would improve this behavior further. Also reduced line lengths in DStream to <=100 chars.
  a2fee38e
- Add default value for HadoopRDD's `cloneRecords` constructor arg, to maintain... · 9e84e705
  Harvey authored 11 years ago
  
  Add default value for HadoopRDD's `cloneRecords` constructor arg, to maintain backwards compatibility.
  9e84e705
- Finished documenting vertexrdd. · ee8931d2
  Joseph E. Gonzalez authored 11 years ago
  
  ee8931d2
- Fix for Kryo Serializer · d4cd5deb
  Patrick Wendell authored 11 years ago
  
  d4cd5deb
- Merge branch 'graphx' of github.com:ankurdave/incubator-spark into graphx · 0fbc0b05
  Reynold Xin authored 11 years ago
  
  0fbc0b05
- Updated doc for PageRank. · 0b18bfba
  Reynold Xin authored 11 years ago
  
  0b18bfba
- More cleanup. · 9317286b
  Reynold Xin authored 11 years ago
  
  9317286b
- Moved SVDPlusPlusConf into SVDPlusPlus object itself. · 8e5c7324
  Reynold Xin authored 11 years ago
  
  8e5c7324
- Finished second pass on pregel docs. · 552de5d4
  Joseph E. Gonzalez authored 11 years ago
  
  552de5d4
- Minor changes in graphx programming guide. · 622b7f7d
  Joseph E. Gonzalez authored 11 years ago
  
  622b7f7d
- Moved PartitionStrategy's into an object. · 1dce9ce4
  Reynold Xin authored 11 years ago
  
  1dce9ce4
- Updated GraphGenerator. · ae06d2c2
  Reynold Xin authored 11 years ago
  
  ae06d2c2
- Made more things private. · 87f335db
  Reynold Xin authored 11 years ago
  
  87f335db
- Merge branch 'graphx' of github.com:ankurdave/incubator-spark into graphx · a4e12af7
  Reynold Xin authored 11 years ago
  
  Conflicts: graphx/src/main/scala/org/apache/spark/graphx/Pregel.scala
  a4e12af7
- Miscel doc update. · 02a8f54b
  Reynold Xin authored 11 years ago
  
  02a8f54b