Commits · dca80094d317363e1e0d7e32bc7dfd99faf943cf · cs525-sp18-g07 / spark

Oct 11, 2013
- Merge pull request #54 from aoiwelle/remove_unused_imports · dca80094
  Reynold Xin authored 11 years ago
  
  Remove unnecessary mutable imports It appears that the imports aren't necessary here.
  dca80094
- Merge pull request #53 from witgo/master · fb25f323
  Matei Zaharia authored 11 years ago
  
  Add a zookeeper compile dependency to fix build in maven Add a zookeeper compile dependency to fix build in maven
  fb25f323
- Merge pull request #32 from mridulm/master · d6ead478
  Matei Zaharia authored 11 years ago
  
  Address review comments, move to incubator spark Also includes a small fix to speculative execution. <edit> Continued from https://github.com/mesos/spark/pull/914 </edit>
  d6ead478
- Remove unnecessary mutable imports · 67d4a31f
  Neal Wiggins authored 11 years ago
  
  67d4a31f
- Add a zookeeper compile dependency to fix build in maven · fc60c412
  LiGuoqiang authored 11 years ago
  
  fc60c412
Oct 10, 2013

Merge remote-tracking branch 'tgravescs/sparkYarnDistCache' · 8f11c36f

Matei Zaharia authored 11 years ago

Closes #11

Conflicts:
	docs/running-on-yarn.md
	yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientArguments.scala

8f11c36f

Merge pull request #19 from aarondav/master-zk · c71499b7

Matei Zaharia authored 11 years ago

Standalone Scheduler fault tolerance using ZooKeeper

This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.

Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.

Master failover follows directly from the single-node Master recovery via the file
system (patch d5a96fec), save that the Master state is stored in ZooKeeper instead.

Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from d5a96fec.

Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.

c71499b7

Minor clarification and cleanup to spark-standalone.md · 66c20635
Aaron Davidson authored 11 years ago

66c20635

Merge pull request #44 from mateiz/fast-map · cd08f734

Matei Zaharia authored 11 years ago

A fast and low-memory append-only map for shuffle operations

This is a continuation of the old repo's pull request https://github.com/mesos/spark/pull/823 to add a more efficient hashmap class for shuffles. I've optimized and tested this more thoroughly now so I think it's good to go. I've also addressed some of the comments that were outstanding there.

The idea is to reduce the cost of shuffles by taking advantage of the properties their hashmaps need. In particular, the hashmaps there are append-only, and a common operation is updating a key's value based on the old value. The included AppendOnlyMap class uses open hashing to use less space than Java's (by not having a linked list per bucket), does not support deletes, and has a changeValue operation to update a key in place without following the hash chain twice. In micro-benchmarks against java.util.HashMap and scala.collection.mutable.HashMap, this is 20-30% smaller and 10-40% faster depending on the number and type of keys. It's also noticeably faster than fastutil's Object2ObjectOpenHashMap.

I've also tested this in Spark apps now. While the speed gain is modest (partly due to other overheads, like serialization), there is some, and I think the lower memory usage is worth it. Here's one example where the speedup is most noticeable, in spark-shell on local mode:
```
scala> val nums = sc.parallelize(1 to 8).flatMap(x => (1 to 5e6.toInt)).cache

scala> nums.count

scala> def time(x: => Unit) = { val now = System.currentTimeMillis; x; System.currentTimeMillis - now }

scala> (1 to 8).map(_ => time(nums.map(x => (x % 100000, x)).reduceByKey(_ + _).count) / 1000.0)
```

This prints the following times before and after this change:
```
Before: Vector(4.368, 2.635, 2.549, 2.522, 2.233, 2.222, 2.214, 2.195)

After: Vector(3.588, 1.741, 1.706, 1.648, 1.777, 1.81, 1.776, 1.731)
```

I've also run the spark-perf suite, enhanced with some tests that use Ints (https://github.com/amplab/spark-perf/pull/9), and it shows some speedup on those, but less on the string ones (presumably due to existing overhead): https://gist.github.com/mateiz/6897121.

cd08f734

Merge branch 'master' into fast-map · 001d13f7
Matei Zaharia authored 11 years ago
```
Conflicts:
	core/src/test/scala/org/apache/spark/scheduler/SparkListenerSuite.scala
```
001d13f7
Address Matei's comments on documentation · 42d8b8ef
Aaron Davidson authored 11 years ago
```
Updates to the documentation and changing some logError()s to logWarning()s.
```
42d8b8ef

Oct 09, 2013

Merge pull request #49 from mateiz/kryo-fix-2 · 320418f7

Reynold Xin authored 11 years ago

Fix Chill serialization of Range objects

It used to write out each element one by one, creating very large objects.

320418f7

Merge pull request #50 from kayousterhout/SPARK-908 · 215238cb
Reynold Xin authored 11 years ago
```
Fix race condition in SparkListenerSuite (fixes SPARK-908).
```
215238cb
Fix Chill serialization of Range objects, which used to write out each · c84c2052
Matei Zaharia authored 11 years ago
```
element, and register user and Spark classes before Chill's serializers
to let them override Chill's behavior in general.
```
c84c2052
Style fixes · 36966f65
Kay Ousterhout authored 11 years ago

36966f65
Fixed comment to use javadoc style · 3f7e9b26
Kay Ousterhout authored 11 years ago

3f7e9b26
Fix race condition in SparkListenerSuite (fixes SPARK-908). · a34a4e81
Kay Ousterhout authored 11 years ago

a34a4e81

Merge pull request #46 from mateiz/py-sort-update · 7827efc8

Matei Zaharia authored 11 years ago

Fix PySpark docs and an overly long line of code after #38

Just noticed these after merging that commit (https://github.com/apache/incubator-spark/pull/38).

7827efc8

Merge pull request #45 from pwendell/metrics_units · 7b3ae04e

Patrick Wendell authored 11 years ago

Use standard abbreviation in metrics description (MBytes -> MB)

This is a small change - older commits are shown here because Github hasn't sync'ed yet with apache.

7b3ae04e

Fix PySpark docs and an overly long line of code after fdbae41e · 478b2b7e
Matei Zaharia authored 11 years ago

478b2b7e

Merge pull request #38 from AndreSchumacher/pyspark_sorting · b4fa11f6

Matei Zaharia authored 11 years ago

SPARK-705: implement sortByKey() in PySpark

This PR contains the implementation of a RangePartitioner in Python and uses its partition ID's to get a global sort in PySpark.

b4fa11f6

Use standard abbreviations in metrics labels · bd3bcc5f
Patrick Wendell authored 11 years ago

bd3bcc5f

Merge pull request #22 from GraceH/metrics-naming · 19d445d3

Patrick Wendell authored 11 years ago

SPARK-900 Use coarser grained naming for metrics

see SPARK-900 Use coarser grained naming for metrics.
Now the new metric name is formatted as {XXX.YYY.ZZZ.COUNTER_UNIT}, XXX.YYY.ZZZ represents the group name, which can group several metrics under the same Ganglia view.

19d445d3

Merge pull request #4 from MLnick/implicit-als · 3218fa79

Matei Zaharia authored 11 years ago

Adding algorithm for implicit feedback data to ALS

This PR adds the commonly used "implicit feedack" variant to ALS.

The implementation is based in part on Mahout's implementation, which is in turn based on [Collaborative Filtering for Implicit Feedback Datasets](http://research.yahoo.com/pub/2433). It has been adapted for the blocked approach used in MLlib.

I have tested this implementation against the MovieLens 100k, 1m and 10m datasets, and confirmed that it produces the same RMSE score as Mahout, as well as my own port of Mahout's implicit ALS implementation to Spark (not that RMSE is necessarily the best metric to judge by for implicit feedback, but it provides a consistent metric for comparison).

It turned out to be more straightforward than I had thought to add this. The main additions are:
1. Adding `implicitPrefs` boolean flag and `alpha` parameter
2. Added the `computeYtY` method. In each least-squares step, the algorithm requires the computation of `YtY`, where `Y` is the {user, item} factor matrix. Since the factors are already block-distributed in an `RDD`, this is quite straightforward to compute but does add an extra operation over the explicit version (but only twice per iteration)
3. Finally the actual solve step in `updateBlock` boils down to:
* a multiplication of the `XtX` matrix by `alpha * rating`
* a multiplication of the `Xty` vector by `1 + alpha * rating`
* when solving for the factor vector, the implicit variant adds the `YtY` matrix to the LHS
4. Added `trainImplicit` methods in the `ALS` object
5. Added test cases for both Scala and Java - based on achieving a confidence-weighted RMSE score < 0.4 (this is taken from Mahout's test cases)

It would be great to get some feedback on this and have people test things out against some datasets (MovieLens and others and perhaps proprietary datasets) both locally and on a cluster if possible. I have not yet tested on a cluster but will try to do that soon.

I have tried to make things as efficient as possible but if there are potential improvements let me know.

The results of a run against ml-1m are below (note the vanilla RMSE scores will be very different from the explicit variant):

**INPUTS**
```
iterations=10
factors=10
lambda=0.01
alpha=1
implicitPrefs=true
```

**RESULTS**

```
Spark MLlib 0.8.0-SNAPSHOT

RMSE = 3.1544
Time: 24.834 sec
```
```
My own port of Mahout's ALS to Spark (updated to 0.8.0-SNAPSHOT)

RMSE = 3.1543
Time: 58.708 sec
```
```
Mahout 0.8

time ./factorize-movielens-1M.sh /path/to/ratings/ml-1m/ratings.dat

real 3m48.648s
user 6m39.254s
sys 0m14.505s

RMSE = 3.1539
```

Results of a run against ml-10m

```
Spark MLlib

RMSE = 3.1200
Time: 162.348 sec
```
```
Mahout 0.8

real 23m2.220s
user 43m39.185s
sys 0m25.316s

RMSE = 3.1187
```

3218fa79

Create fewer function objects in uses of AppendOnlyMap.changeValue · 12d59312
Matei Zaharia authored 11 years ago

12d59312
Address some comments on code clarity · 0b35051f
Matei Zaharia authored 11 years ago

0b35051f
Moved files that were in the wrong directory after package rename · 4acbc5af
Matei Zaharia authored 11 years ago

4acbc5af
Fix some review comments · 0e40cfab
Matei Zaharia authored 11 years ago

0e40cfab
Added a fast and low-memory append-only map implementation for cogroup · b535db7d
Matei Zaharia authored 11 years ago
```
and parallel reduce operations
```
b535db7d

Merge pull request #43 from mateiz/kryo-fix · e67d5b96

Reynold Xin authored 11 years ago

Don't allocate Kryo buffers unless needed

I noticed that the Kryo serializer could be slower than the Java one by 2-3x on small shuffles because it spend a lot of time initializing Kryo Input and Output objects. This is because our default buffer size for them is very large. Since the serializer is often used on streams, I made the initialization lazy for that, and used a smaller buffer (auto-managed by Kryo) for input.

e67d5b96

Oct 08, 2013
- remove those futile suffixes like number/count · f7628e40
  Grace Huang authored 11 years ago
  
  f7628e40
- Add docs for standalone scheduler fault tolerance · 4ea8ee46
  Aaron Davidson authored 11 years ago
  
  Also fix a couple HTML/Markdown issues in other files.
  4ea8ee46
- Revert change to spark-class · 749233b8
  Aaron Davidson authored 11 years ago
  
  Also adds comment about how to configure for FaultToleranceTest.
  749233b8
- Add license agreements to dockerfiles · 1cd57cd4
  Aaron Davidson authored 11 years ago
  
  1cd57cd4
- create metrics name manually. · 22bed59d
  Grace Huang authored 11 years ago
  
  22bed59d
- Revert "SPARK-900 Use coarser grained naming for metrics" · 188abbf8
  Grace Huang authored 11 years ago
  
  This reverts commit 4b68be5f.
  188abbf8
- Revert "remedy the line-wrap while exceeding 100 chars" · a2af6b54
  Grace Huang authored 11 years ago
  
  This reverts commit 892fb8ff.
  a2af6b54
Oct 07, 2013

Merge pull request #42 from pwendell/shuffle-read-perf · ea34c521

Reynold Xin authored 11 years ago

Fix inconsistent and incorrect log messages in shuffle read path

The user-facing messages generated by the CacheManager are currently wrong and somewhat misleading. This patch makes the messages more accurate. It also uses a consistent representation of the partition being fetched (`rdd_xx_yy`) so that it's easier for users to trace what is going on when reading logs.

ea34c521

Responses to review · 8b377718
Patrick Wendell authored 11 years ago

8b377718
Don't allocate Kryo buffers unless needed · a8725bf8
Matei Zaharia authored 11 years ago

a8725bf8