Skip to content
Snippets Groups Projects
Commit 3218fa79 authored by Matei Zaharia's avatar Matei Zaharia
Browse files

Merge pull request #4 from MLnick/implicit-als

Adding algorithm for implicit feedback data to ALS

This PR adds the commonly used "implicit feedack" variant to ALS.

The implementation is based in part on Mahout's implementation, which is in turn based on [Collaborative Filtering for Implicit Feedback Datasets](http://research.yahoo.com/pub/2433). It has been adapted for the blocked approach used in MLlib.

I have tested this implementation against the MovieLens 100k, 1m and 10m datasets, and confirmed that it produces the same RMSE score as Mahout, as well as my own port of Mahout's implicit ALS implementation to Spark (not that RMSE is necessarily the best metric to judge by for implicit feedback, but it provides a consistent metric for comparison).

It turned out to be more straightforward than I had thought to add this. The main additions are:
1. Adding `implicitPrefs` boolean flag and `alpha` parameter
2. Added the `computeYtY` method. In each least-squares step, the algorithm requires the computation of `YtY`, where `Y` is the {user, item} factor matrix. Since the factors are already block-distributed in an `RDD`, this is quite straightforward to compute but does add an extra operation over the explicit version (but only twice per iteration)
3. Finally the actual solve step in `updateBlock` boils down to:
    * a multiplication of the `XtX` matrix by `alpha * rating`
    * a multiplication of the `Xty` vector by `1 + alpha * rating`
    * when solving for the factor vector, the implicit variant adds the `YtY` matrix to the LHS
4. Added `trainImplicit` methods in the `ALS` object
5. Added test cases for both Scala and Java - based on achieving a confidence-weighted RMSE score < 0.4 (this is taken from Mahout's test cases)

It would be great to get some feedback on this and have people test things out against some datasets (MovieLens and others and perhaps proprietary datasets) both locally and on a cluster if possible. I have not yet tested on a cluster but will try to do that soon.

I have tried to make things as efficient as possible but if there are potential improvements let me know.

The results of a run against ml-1m are below (note the vanilla RMSE scores will be very different from the explicit variant):

**INPUTS**
```
iterations=10
factors=10
lambda=0.01
alpha=1
implicitPrefs=true
```

**RESULTS**

```
Spark MLlib 0.8.0-SNAPSHOT

RMSE = 3.1544
Time: 24.834 sec
```
```
My own port of Mahout's ALS to Spark (updated to 0.8.0-SNAPSHOT)

RMSE = 3.1543
Time: 58.708 sec
```
```
Mahout 0.8

time ./factorize-movielens-1M.sh /path/to/ratings/ml-1m/ratings.dat

real	3m48.648s
user	6m39.254s
sys	0m14.505s

RMSE = 3.1539
```

Results of a run against ml-10m

```
Spark MLlib

RMSE = 3.1200
Time: 162.348 sec
```
```
Mahout 0.8

real	23m2.220s
user	43m39.185s
sys	0m25.316s

RMSE = 3.1187
```
parents e67d5b96 a5e58b8f
No related branches found
No related tags found
No related merge requests found
Loading
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment