Commits · 139c24ef08e6ffb090975c9808a2cba304eb79e0 · cs525-sp18-g07 / spark

Jan 15, 2014
- Merge pull request #435 from tdas/filestream-fix · 139c24ef
  Patrick Wendell authored 11 years ago
  
  Fixed the flaky tests by making SparkConf not serializable SparkConf was being serialized with CoGroupedRDD and Aggregator, which somehow caused OptionalJavaException while being deserialized as part of a ShuffleMapTask. SparkConf should not even be serializable (according to conversation with Matei). This change fixes that. @mateiz @pwendell
  139c24ef
- Merge pull request #434 from rxin/graphxmaven · 087487e9
  Patrick Wendell authored 11 years ago
  
  Fixed SVDPlusPlusSuite in Maven build. This should go into 0.9.0 also.
  087487e9
- Merge remote-tracking branch 'apache/master' into filestream-fix · 0e15bd78
  Tathagata Das authored 11 years ago
  
  0e15bd78
- Changed SparkConf to not be serializable. And also fixed unit-test log paths... · 1f4718c4
  Tathagata Das authored 11 years ago
  
  Changed SparkConf to not be serializable. And also fixed unit-test log paths in log4j.properties of external modules.
  1f4718c4
- Fixed SVDPlusPlusSuite in Maven build. · dfb15244
  Reynold Xin authored 11 years ago
  
  dfb15244
Jan 14, 2014

Merge pull request #424 from jegonzal/GraphXProgrammingGuide · 3a386e23

Reynold Xin authored 11 years ago

Additional edits for clarity in the graphx programming guide.

Added an overview of the Graph and GraphOps functions and fixed numerous typos.

3a386e23

Merge pull request #431 from ankurdave/graphx-caching-doc · ad294db3
Reynold Xin authored 11 years ago
```
Describe caching and uncaching in GraphX programming guide
```
ad294db3
Describe GraphX caching and uncaching in guide · 1210ec29
Ankur Dave authored 11 years ago

1210ec29
Merge pull request #428 from pwendell/writeable-objects · 74b46acd
Reynold Xin authored 11 years ago
```
Don't clone records for text files
```
74b46acd
Merge pull request #429 from ankurdave/graphx-examples-pom.xml · 193a0757
Reynold Xin authored 11 years ago
```
Add GraphX dependency to examples/pom.xml
```
193a0757
Merge pull request #427 from pwendell/deprecate-aggregator · d601a76d
Reynold Xin authored 11 years ago
```
Deprecate rather than remove old combineValuesByKey function
```
d601a76d
Add GraphX dependency to examples/pom.xml · 8ea056d7
Ankur Dave authored 11 years ago

8ea056d7
Style fix · b1b22b7a
Patrick Wendell authored 11 years ago

b1b22b7a
Adding fix covering combineCombinersByKey as well · 8ea2cd56
Patrick Wendell authored 11 years ago

8ea2cd56

Merge pull request #425 from rxin/scaladoc · 2ce23a55

Reynold Xin authored 11 years ago

API doc update & make Broadcast public

In #413 Broadcast was mistakenly made private[spark]. I changed it to public again. Also exposing id in public given the R frontend requires that.

Copied some of the documentation from the programming guide to API Doc for Broadcast and Accumulator.

This should be cherry picked into branch-0.9 as well for 0.9.0 release.

2ce23a55

Deprecate rather than remove old combineValuesByKey function · b683608c
Patrick Wendell authored 11 years ago

b683608c
Don't clone records for text files · 6f965a46
Patrick Wendell authored 11 years ago

6f965a46
Fixed a typo in JavaSparkContext's API doc. · f12e506c
Reynold Xin authored 11 years ago

f12e506c
Maintain Serializable API compatibility by reverting back to... · 1b5623fd
Reynold Xin authored 11 years ago
```
Maintain Serializable API compatibility by reverting back to java.io.Serializable for Broadcast and Accumulator.
```
1b5623fd
Added license header for package.scala in the Java API package. · 55db7741
Reynold Xin authored 11 years ago

55db7741
Added package doc for the Java API. · f8c12e94
Reynold Xin authored 11 years ago

f8c12e94
Updated API doc for Accumulable and Accumulator. · 6a12b9eb
Reynold Xin authored 11 years ago

6a12b9eb

Broadcast variable visibility change & doc update. · 71b3007d

Reynold Xin authored 11 years ago

Note that previously Broadcast class was accidentally marked as private[spark]. It needs to be public
for broadcast variables to work. Also exposing the broadcast varaible id.

71b3007d

Additional edits for clarity in the graphx programming guide. · 0bba7738
Joseph E. Gonzalez authored 11 years ago

0bba7738

Merge pull request #423 from jegonzal/GraphXProgrammingGuide · 3fcc68bf

Reynold Xin authored 11 years ago

Improving the graphx-programming-guide

This PR will track a few minor improvements to the content and formatting of the graphx-programming-guide.

3fcc68bf

Improving the graphx-programming-guide. · 486f37c5
Joseph E. Gonzalez authored 11 years ago

486f37c5
Merge pull request #420 from pwendell/header-files · fa75e5e1
Patrick Wendell authored 11 years ago
```
Add missing header files
```
fa75e5e1
Add missing header files · 23034798
Patrick Wendell authored 11 years ago

23034798

Merge pull request #416 from tdas/filestream-fix · 980250b1

Patrick Wendell authored 11 years ago

Removed unnecessary DStream operations and updated docs

Removed StreamingContext.registerInputStream and registerOutputStream - they were useless. InputDStream has been made to register itself, and just registering a DStream as output stream cause RDD objects to be created but the RDDs will not be computed at all.. Also made DStream.register() private[streaming] for the same reasons.

Updated docs, specially added package documentation for streaming package.

Also, changed NetworkWordCount's input storage level to use MEMORY_ONLY, replication on the local machine causes warning messages (as replication fails) which is scary for a new user trying out his/her first example.

980250b1

Fixed loose ends in docs. · f8bd828c
Tathagata Das authored 11 years ago

f8bd828c
Merge remote-tracking branch 'apache/master' into filestream-fix · f8e239e0
Tathagata Das authored 11 years ago
```
Conflicts:
	streaming/src/main/scala/org/apache/spark/streaming/dstream/DStream.scala
```
f8e239e0
Merge pull request #415 from pwendell/shuffle-compress · 055be5c6
Patrick Wendell authored 11 years ago
```
Enable compression by default for spills
```
055be5c6
Enable compression by default for spills · 0984647a
Patrick Wendell authored 11 years ago

0984647a

Removed StreamingContext.registerInputStream and registerOutputStream - they... · 4e497db8

Tathagata Das authored 11 years ago

Removed StreamingContext.registerInputStream and registerOutputStream - they were useless as InputDStream has been made to register itself. Also made DStream.register() private[streaming] - not useful to expose the confusing function. Updated a lot of documentation.

4e497db8

Merge pull request #380 from mateiz/py-bayes · fdaabdc6

Patrick Wendell authored 11 years ago

Add Naive Bayes to Python MLlib, and some API fixes

- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)

fdaabdc6

Merge pull request #367 from ankurdave/graphx · 4a805aff

Patrick Wendell authored 11 years ago

GraphX: Unifying Graphs and Tables

GraphX extends Spark's distributed fault-tolerant collections API and interactive console with a new graph API which leverages recent advances in graph systems (e.g., [GraphLab](http://graphlab.org)) to enable users to easily and interactively build, transform, and reason about graph structured data at scale. See http://amplab.github.io/graphx/.

Thanks to @jegonzal, @rxin, @ankurdave, @dcrankshaw, @jianpingjwang, @amatsukawa, @kellrott, and @adamnovak.

Tasks left:
- [x] Graph-level uncache
- [x] Uncache previous iterations in Pregel
- [x] ~~Uncache previous iterations in GraphLab~~ (postponed to post-release)
- [x] - Describe GC issue with GraphLab
- [ ] Write `docs/graphx-programming-guide.md`
- [x] - Mention future Bagel support in docs
- [ ] - Section on caching/uncaching in docs: As with Spark, cache something that is used more than once. In an iterative algorithm, try to cache and force (i.e., materialize) something every iteration, then uncache the cached things that depended on the newly materialized RDD but that won't be referenced again.
- [x] Undo modifications to core collections and instead copy them to org.apache.spark.graphx
- [x] Make Graph serializable to work around capture in Spark shell
- [x] Rename graph -> graphx in package name and subproject
- [x] Remove standalone PageRank
- [x] ~~Fix amplab/graphx#52 by checking `iter.hasNext`~~

4a805aff

Adding minimal additional functionality to EdgeRDD · 80e73ed0
Joseph E. Gonzalez authored 11 years ago

80e73ed0

Merge pull request #408 from pwendell/external-serializers · 945fe7a3

Patrick Wendell authored 11 years ago

Improvements to external sorting

1. Adds the option of compressing outputs.
2. Adds batching to the serialization to prevent OOM on the read side.
3. Slight renaming of config options.
4. Use Spark's buffer size for reads in addition to writes.

945fe7a3

adding documentation about EdgeRDD · 4bafc4f4
Joseph E. Gonzalez authored 11 years ago

4bafc4f4
Merge pull request #413 from rxin/scaladoc · 68641bce
Patrick Wendell authored 11 years ago
```
Adjusted visibility of various components and documentation for 0.9.0 release.
```
68641bce