Skip to content
Snippets Groups Projects
Commit d12d2ad7 authored by Xiangrui Meng's avatar Xiangrui Meng
Browse files

[SPARK-5879][MLLIB] update PIC user guide and add a Java example

Updated PIC user guide to reflect API changes and added a simple Java example. The API is still not very Java-friendly. I created SPARK-5990 for this issue.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4680 from mengxr/SPARK-5897 and squashes the following commits:

847d216 [Xiangrui Meng] apache header
87719a2 [Xiangrui Meng] remove PIC image
2dd921f [Xiangrui Meng] update PIC user guide and add a Java example
parent aa8f10e8
No related branches found
No related tags found
No related merge requests found
docs/img/PIClusteringFiveCirclesInputsAndOutputs.png

243 KiB

......@@ -270,23 +270,92 @@ for i in range(2):
## Power iteration clustering (PIC)
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering points given pointwise mutual affinity values. Internally the algorithm:
Power iteration clustering (PIC) is a scalable and efficient algorithm for clustering vertices of a
graph given pairwise similarties as edge properties,
described in [Lin and Cohen, Power Iteration Clustering](http://www.icml2010.org/papers/387.pdf).
It computes a pseudo-eigenvector of the normalized affinity matrix of the graph via
[power iteration](http://en.wikipedia.org/wiki/Power_iteration) and uses it to cluster vertices.
MLlib includes an implementation of PIC using GraphX as its backend.
It takes an `RDD` of `(srcId, dstId, similarity)` tuples and outputs a model with the clustering assignments.
The similarities must be nonnegative.
PIC assumes that the similarity measure is symmetric.
A pair `(srcId, dstId)` regardless of the ordering should appear at most once in the input data.
If a pair is missing from input, their similarity is treated as zero.
MLlib's PIC implementation takes the following (hyper-)parameters:
* `k`: number of clusters
* `maxIterations`: maximum number of power iterations
* `initializationMode`: initialization model. This can be either "random", which is the default,
to use a random vector as vertex properties, or "degree" to use normalized sum similarities.
* accepts a [Graph](api/graphx/index.html#org.apache.spark.graphx.Graph) that represents a normalized pairwise affinity between all input points.
* calculates the principal eigenvalue and eigenvector
* Clusters each of the input points according to their principal eigenvector component value
**Examples**
In the following, we show code snippets to demonstrate how to use PIC in MLlib.
<div class="codetabs">
<div data-lang="scala" markdown="1">
[`PowerIterationClustering`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClustering)
implements the PIC algorithm.
It takes an `RDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
affinity matrix.
Calling `PowerIterationClustering.run` returns a
[`PowerIterationClusteringModel`](api/scala/index.html#org.apache.spark.mllib.clustering.PowerIterationClusteringModel),
which contains the computed clustering assignments.
Details of this algorithm are found within [Power Iteration Clustering, Lin and Cohen]{www.icml2010.org/papers/387.pdf}
{% highlight scala %}
import org.apache.spark.mllib.clustering.PowerIterationClustering
import org.apache.spark.mllib.linalg.Vectors
Example outputs for a dataset inspired by the paper - but with five clusters instead of three- have he following output from our implementation:
val similarities: RDD[(Long, Long, Double)] = ...
val pic = new PowerIteartionClustering()
.setK(3)
.setMaxIterations(20)
val model = pic.run(similarities)
model.assignments.foreach { case (vertexId, clusterId) =>
println(s"$vertexId -> $clusterId")
}
{% endhighlight %}
A full example that produces the experiment described in the PIC paper can be found under
[`examples/`](https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/mllib/PowerIterationClusteringExample.scala).
</div>
<p style="text-align: center;">
<img src="img/PIClusteringFiveCirclesInputsAndOutputs.png"
title="The Property Graph"
alt="The Property Graph"
width="50%" />
<!-- Images are downsized intentionally to improve quality on retina displays -->
</p>
<div data-lang="java" markdown="1">
[`PowerIterationClustering`](api/java/org/apache/spark/mllib/clustering/PowerIterationClustering.html)
implements the PIC algorithm.
It takes an `JavaRDD` of `(srcId: Long, dstId: Long, similarity: Double)` tuples representing the
affinity matrix.
Calling `PowerIterationClustering.run` returns a
[`PowerIterationClusteringModel`](api/java/org/apache/spark/mllib/clustering/PowerIterationClusteringModel.html)
which contains the computed clustering assignments.
{% highlight java %}
import scala.Tuple2;
import scala.Tuple3;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.mllib.clustering.PowerIterationClustering;
import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;
JavaRDD<Tuple3<Long, Long, Double>> similarities = ...
PowerIterationClustering pic = new PowerIterationClustering()
.setK(2)
.setMaxIterations(10);
PowerIterationClusteringModel model = pic.run(similarities);
for (Tuple2<Object, Object> assignment: model.assignments().toJavaRDD().collect()) {
System.out.println(assignment._1() + " -> " + assignment._2());
}
{% endhighlight %}
</div>
</div>
## Latent Dirichlet allocation (LDA)
......
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.examples.mllib;
import scala.Tuple2;
import scala.Tuple3;
import com.google.common.collect.Lists;
import org.apache.spark.SparkConf;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.JavaSparkContext;
import org.apache.spark.mllib.clustering.PowerIterationClustering;
import org.apache.spark.mllib.clustering.PowerIterationClusteringModel;
/**
* Java example for graph clustering using power iteration clustering (PIC).
*/
public class JavaPowerIterationClusteringExample {
public static void main(String[] args) {
SparkConf sparkConf = new SparkConf().setAppName("JavaPowerIterationClusteringExample");
JavaSparkContext sc = new JavaSparkContext(sparkConf);
@SuppressWarnings("unchecked")
JavaRDD<Tuple3<Long, Long, Double>> similarities = sc.parallelize(Lists.newArrayList(
new Tuple3<Long, Long, Double>(0L, 1L, 0.9),
new Tuple3<Long, Long, Double>(1L, 2L, 0.9),
new Tuple3<Long, Long, Double>(2L, 3L, 0.9),
new Tuple3<Long, Long, Double>(3L, 4L, 0.1),
new Tuple3<Long, Long, Double>(4L, 5L, 0.9)));
PowerIterationClustering pic = new PowerIterationClustering()
.setK(2)
.setMaxIterations(10);
PowerIterationClusteringModel model = pic.run(similarities);
for (Tuple2<Object, Object> assignment: model.assignments().toJavaRDD().collect()) {
System.out.println(assignment._1() + " -> " + assignment._2());
}
sc.stop();
}
}
......@@ -17,6 +17,7 @@
package org.apache.spark.mllib.clustering
import org.apache.spark.api.java.JavaRDD
import org.apache.spark.{Logging, SparkException}
import org.apache.spark.annotation.Experimental
import org.apache.spark.graphx._
......@@ -115,6 +116,14 @@ class PowerIterationClustering private[clustering] (
pic(w0)
}
/**
* A Java-friendly version of [[PowerIterationClustering.run]].
*/
def run(similarities: JavaRDD[(java.lang.Long, java.lang.Long, java.lang.Double)])
: PowerIterationClusteringModel = {
run(similarities.rdd.asInstanceOf[RDD[(Long, Long, Double)]])
}
/**
* Runs the PIC algorithm.
*
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment