Skip to content
Snippets Groups Projects
Commit 20900e5f authored by Nick Pentreath's avatar Nick Pentreath Committed by Joseph K. Bradley
Browse files

[SPARK-15502][DOC][ML][PYSPARK] add guide note that ALS only supports integer ids

This PR adds a note to clarify that the ML API for ALS only supports integers for user/item ids, and that other types for these columns can be used but the ids must fall within integer range.

(Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)).

Also cleaned up a reference to `mllib` in the ML doc.

## How was this patch tested?
Built and viewed User Guide doc locally.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13278 from MLnick/SPARK-15502-als-int-id-doc-note.
parent be99a99f
No related branches found
No related tags found
No related merge requests found
......@@ -29,6 +29,10 @@ following parameters:
*baseline* confidence in preference observations (defaults to 1.0).
* *nonnegative* specifies whether or not to use nonnegative constraints for least squares (defaults to `false`).
**Note:** The DataFrame-based API for ALS currently only supports integers for user and item ids.
Other numeric types are supported for the user and item id columns,
but the ids must be within the integer value range.
### Explicit vs. implicit feedback
The standard approach to matrix factorization based collaborative filtering treats
......@@ -36,7 +40,7 @@ the entries in the user-item matrix as *explicit* preferences given by the user
for example, users giving ratings to movies.
It is common in many real-world use cases to only have access to *implicit feedback* (e.g. views,
clicks, purchases, likes, shares etc.). The approach used in `spark.mllib` to deal with such data is taken
clicks, purchases, likes, shares etc.). The approach used in `spark.ml` to deal with such data is taken
from [Collaborative Filtering for Implicit Feedback Datasets](http://dx.doi.org/10.1109/ICDM.2008.22).
Essentially, instead of trying to model the matrix of ratings directly, this approach treats the data
as numbers representing the *strength* in observations of user actions (such as the number of clicks,
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment