docs/bagel-programming-guide.md · bcd7b40578f9aaf8618a4408308b3721cf6afb79 · cs525-sp18-g07 / spark · GitLab

Snippets Groups Projects

12 years ago
ca2c999e
Making the link to api scaladocs work and migrating other code snippets · ca2c999e
Andy Konwinski authored 12 years ago
```
to use pygments syntax highlighting.
```
ca2c999e

History
Making the link to api scaladocs work and migrating other code snippets
Andy Konwinski authored 12 years ago
```
to use pygments syntax highlighting.
```

bagel-programming-guide.md 6.67 KiB

layout: global
title: Bagel Programming Guide

Bagel is a Spark implementation of Google's Pregel graph processing framework. Bagel currently supports basic graph computation, combiners, and aggregators.

In the Pregel programming model, jobs run as a sequence of iterations called supersteps. In each superstep, each vertex in the graph runs a user-specified function that can update state associated with the vertex and send messages to other vertices for use in the next iteration.

This guide shows the programming model and features of Bagel by walking through an example implementation of PageRank on Bagel.

Linking with Bagel

To write a Bagel application, you will need to add Spark, its dependencies, and Bagel to your CLASSPATH:

Run sbt/sbt update to fetch Spark's dependencies, if you haven't already done so.
Run sbt/sbt assembly to build Spark and its dependencies into one JAR (core/target/scala_2.8.1/Spark Core-assembly-0.3-SNAPSHOT.jar) and Bagel into a second JAR (bagel/target/scala_2.8.1/Bagel-assembly-0.3-SNAPSHOT.jar).
Add these two JARs to your CLASSPATH.

Programming Model

Bagel operates on a graph represented as a distributed dataset of (K, V) pairs, where keys are vertex IDs and values are vertices plus their associated state. In each superstep, Bagel runs a user-specified compute function on each vertex that takes as input the current vertex state and a list of messages sent to that vertex during the previous superstep, and returns the new vertex state and a list of outgoing messages.

For example, we can use Bagel to implement PageRank. Here, vertices represent pages, edges represent links between pages, and messages represent shares of PageRank sent to the pages that a particular page links to.

We first extend the default Vertex class to store a Double representing the current PageRank of the vertex, and similarly extend the Message and Edge classes. Note that these need to be marked @serializable to allow Spark to transfer them across machines. We also import the Bagel types and implicit conversions.

{% highlight scala %} import spark.bagel._ import spark.bagel.Bagel._

@serializable class PREdge(val targetId: String) extends Edge

@serializable class PRVertex( val id: String, val rank: Double, val outEdges: Seq[Edge], val active: Boolean) extends Vertex

@serializable class PRMessage( val targetId: String, val rankShare: Double) extends Message
{% endhighlight %}

Next, we load a sample graph from a text file as a distributed dataset and package it into PRVertex objects. We also cache the distributed dataset because Bagel will use it multiple times and we'd like to avoid recomputing it.

{% highlight scala %} val input = sc.textFile("pagerank_data.txt")

val numVerts = input.count()

val verts = input.map(line => { val fields = line.split('\t') val (id, linksStr) = (fields(0), fields(1)) val links = linksStr.split(',').map(new PREdge(_)) (id, new PRVertex(id, 1.0 / numVerts, links, true)) }).cache {% endhighlight %}

We run the Bagel job, passing in verts, an empty distributed dataset of messages, and a custom compute function that runs PageRank for 10 iterations.