Skip to content
Snippets Groups Projects
Commit cf57b1b0 authored by Joseph E. Gonzalez's avatar Joseph E. Gonzalez
Browse files

Correcting typos in documentation.

parent 64c45935
No related branches found
No related tags found
No related merge requests found
...@@ -19,11 +19,11 @@ title: GraphX Programming Guide ...@@ -19,11 +19,11 @@ title: GraphX Programming Guide
GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high-level, GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high-level,
GraphX extends the Spark [RDD](api/core/index.html#org.apache.spark.rdd.RDD) by introducing the GraphX extends the Spark [RDD](api/core/index.html#org.apache.spark.rdd.RDD) by introducing the
[Resilient Distributed property Graph (RDG)](#property_graph): a directed multigraph with properties [Resilient Distributed property Graph (RDG)](#property_graph): a directed multigraph with properties
attached to each vertex and edge. To support graph computation, GraphX exposes a set of functions attached to each vertex and edge. To support graph computation, GraphX exposes a set of fundamental
(e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and operators (e.g., [subgraph](#structural_operators), [joinVertices](#join_operators), and
[mapReduceTriplets](#mrTriplets)) as well as an optimized variant of the [mapReduceTriplets](#mrTriplets)) as well as an optimized variant of the [Pregel](#pregel) API. In
[Pregel](#pregel) API. In addition, GraphX includes a growing collection of graph addition, GraphX includes a growing collection of graph [algorithms](#graph_algorithms) and
[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify graph analytics tasks. [builders](#graph_builders) to simplify graph analytics tasks.
## Background on Graph-Parallel Computation ## Background on Graph-Parallel Computation
...@@ -65,15 +65,13 @@ in graph-parallel systems, GraphX is able to optimize the execution of graph ope ...@@ -65,15 +65,13 @@ in graph-parallel systems, GraphX is able to optimize the execution of graph ope
## GraphX Replaces the Spark Bagel API ## GraphX Replaces the Spark Bagel API
Prior to the release of GraphX, graph computation in Spark was expressed using Prior to the release of GraphX, graph computation in Spark was expressed using Bagel, an
Bagel, an implementation of the Pregel API. GraphX improves upon Bagel by implementation of Pregel. GraphX improves upon Bagel by exposing a richer property graph API, a
exposing a richer property graph API, a more streamlined version of the Pregel more streamlined version of the Pregel abstraction, and system optimizations to improve performance
abstraction, and system optimizations to improve performance and reduce memory and reduce memory overhead. While we plan to eventually deprecate the Bagel, we will continue to
overhead. While we plan to eventually deprecate the Bagel, we will continue to support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and [Bagel programming
support the [Bagel API](api/bagel/index.html#org.apache.spark.bagel.package) and guide](bagel-programming-guide.html). However, we encourage Bagel users to explore the new GraphX
[Bagel programming guide](bagel-programming-guide.html). However, we encourage API and comment on issues that may complicate the transition from Bagel.
Bagel users to explore the new GraphX API and comment on issues that may
complicate the transition from Bagel.
# Getting Started # Getting Started
...@@ -94,41 +92,55 @@ The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a d ...@@ -94,41 +92,55 @@ The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a d
graph with user defined objects attached to each vertex and edge. A directed multigraph is a graph with user defined objects attached to each vertex and edge. A directed multigraph is a
directed graph with potentially multiple parallel edges sharing the same source and destination directed graph with potentially multiple parallel edges sharing the same source and destination
vertex. The ability to support parallel edges simplifies modeling scenarios where there can be vertex. The ability to support parallel edges simplifies modeling scenarios where there can be
multiple relationships (e.g., co-worker and friend) between the same vertices. Note, however there multiple relationships (e.g., co-worker and friend) between the same vertices. Each vertex is keyed
can only be one instance of each vertex. by a *unique* 64-bit long identifier (`VertexId`). Similarly, edges have corresponding source and
destination vertex identifiers. GraphX does not impose any ordering or constraints on the vertex
Like RDDs, property graphs are immutable, distributed, and fault-tolerant. Vertices are keyed by identifiers. The property graph is parameterized over the vertex `VD` and edge `ED` types. These
their vertex identifier (`VertexId`) which is a unique 64-bit long. Similarly, edges have are the types of the objects associated with each vertex and edge respectively.
corresponding source and destination vertex identifiers. GraphX does not impose any ordering or
constraints on the vertex identifiers.
The property graph is parameterized over the vertex `VD` and edge `ED` types. These are the types
of the objects associated with each vertex and edge respectively. In some cases it can be desirable
to have vertices of different types. However, this can be accomplished through inheritance.
> GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g., > GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
> int, double, etc...) reducing the in memory footprint. > int, double, etc...) reducing the in memory footprint.
In some cases we may wish to have vertices with different property types in the same graph. This can
be accomplished through inheritance. For example to model users and products as a bipartie graph we
might do the following:
{% highlight scala %}
case class VertexProperty
case class UserProperty extends VertexProperty
(val name: String)
case class ProductProperty extends VertexProperty
(val name: String, val price: Double)
// The graph might then have the type:
val graph: Graph[VertexProperty, String]
{% endhighlight %}
Like RDDs, property graphs are immutable, distributed, and fault-tolerant. Changes to the values or
structure of the graph are accomplished by producing a new graph with the desired changes. The graph
is partitioned across the workers using a range of vertex-partitioning heuristics. As with RDDs,
each partition of the graph can be recreated on a different machine in the event of a failure.
Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
properties for each vertex and edge: properties for each vertex and edge. As a consequence, the graph class contains members to access
the vertices and edges of the graph:
{% highlight scala %} {% highlight scala %}
class Graph[VD: ClassTag, ED: ClassTag] { val vertices: VertexRDD[VD]
val vertices: RDD[(VertexId, VD)] val edges: EdgeRDD[ED]
val edges: RDD[Edge[ED]]
// ...
}
{% endhighlight %} {% endhighlight %}
> Note that the vertices and edges of the graph are actually of type `VertexRDD[VD]` and The classes `VertexRDD[VD]` and `EdgeRDD[ED]` extend and are optimized versions of `RDD[(VertexId,
> `EdgeRDD[ED]` respectively. These classes extend and are optimized versions of `RDD[(VertexId, VD)]` and `RDD[Edge[ED]]` respectively. Both `VertexRDD[VD]` and `EdgeRDD[ED]` provide additional
> VD)]` and `RDD[Edge[ED]]` with additional functionality built around the internal index and column functionality built around graph computation and leverage internal optimizations. We discuss the
> oriented representations. We discuss the `VertexRDD` and `EdgeRDD` API in greater detail in the `VertexRDD` and `EdgeRDD` API in greater detail in the section on [vertex and edge
> section on [vertex and edge RDDs](#vertex_and_edge_rdds) RDDs](#vertex_and_edge_rdds) but for now they can be thought of as simply RDDs of the form:
`RDD[(VertexId, VD)]` and `RDD[Edge[ED]]`.
### Example Property Graph
For example, we might construct a property graph consisting of various collaborators on the GraphX Suppose we want to construct a property graph consisting of the various collaborators on the GraphX
project. The vertex property contains the username and occupation and the edge property contains project. The vertex property might contain the username and occupation. We could annotate edges
a string describing the relationships between the users. with a string describing the relationships between collaborators:
<p style="text-align: center;"> <p style="text-align: center;">
<img src="img/property_graph.png" <img src="img/property_graph.png"
...@@ -183,18 +195,19 @@ graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count ...@@ -183,18 +195,19 @@ graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count
graph.edges.filter(e => e.srcId > e.dstId).count graph.edges.filter(e => e.srcId > e.dstId).count
{% endhighlight %} {% endhighlight %}
> Note that `graph.vertices` returns an `RDD[(VertexId, (String, String))]` and so we must use the > Note that `graph.vertices` returns an `VertexRDD[(String, String)]` which extends
> scala `case` expression to deconstruct the tuple. Alternatively, `graph.edges` returns an `RDD` > `RDD[(VertexId, (String, String))]` and so we use the scala `case` expression to deconstruct
> containing `Edge[String]` objects. We could have also used the case class type constructor as > the tuple. Alternatively, `graph.edges` returns an `EdgeRDD` containing `Edge[String]` objects.
> in the following: > We could have also used the case class type constructor as in the following:
> {% highlight scala %} > {% highlight scala %}
graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
{% endhighlight %} {% endhighlight %}
In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view. In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD, The triplet view logically joins the vertex and edge properties yielding an `RDD[EdgeTriplet[VD,
ED]]` consisting of [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet). ED]]` containing instances of the
This *join* can be expressed in the following SQL expression: [`EdgeTriplet`](api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet) class. This *join* can be
expressed in the following SQL expression:
{% highlight sql %} {% highlight sql %}
SELECT src.id, dst.id, src.attr, e.attr, dst.attr SELECT src.id, dst.id, src.attr, e.attr, dst.attr
...@@ -266,7 +279,7 @@ defined `map` function. ...@@ -266,7 +279,7 @@ defined `map` function.
> does not preserve the structural indicies and would not benefit from the substantial system > does not preserve the structural indicies and would not benefit from the substantial system
> optimizations in GraphX. > optimizations in GraphX.
> {% highlight scala %} > {% highlight scala %}
val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr))} val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr)) }
val newGraph = Graph(newVertices, graph.edges) val newGraph = Graph(newVertices, graph.edges)
{% endhighlight %} {% endhighlight %}
...@@ -291,12 +304,9 @@ add more in the future. The following is a list of the basic structural operato ...@@ -291,12 +304,9 @@ add more in the future. The following is a list of the basic structural operato
{% highlight scala %} {% highlight scala %}
def reverse: Graph[VD, ED] def reverse: Graph[VD, ED]
def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
def subgraph(epred: EdgeTriplet[VD,ED] => Boolean = (x => true), vpred: (VertexID, VD) => Boolean): Graph[VD, ED]
vpred: (VertexID, VD) => Boolean = ((v,d) => true) ): Graph[VD, ED]
def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED] def mask[VD2, ED2](other: Graph[VD2, ED2]): Graph[VD, ED]
def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED] def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
{% endhighlight %} {% endhighlight %}
...@@ -309,7 +319,7 @@ The `subgraph` operator takes vertex and edge predicates and returns the graph c ...@@ -309,7 +319,7 @@ The `subgraph` operator takes vertex and edge predicates and returns the graph c
vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge
predicate *and connect vertices that satisfy the vertex predicate*. The `subgraph` operator can be predicate *and connect vertices that satisfy the vertex predicate*. The `subgraph` operator can be
used in number of situations to restrict the graph to the vertices and edges of interest or used in number of situations to restrict the graph to the vertices and edges of interest or
eliminate broken links. For example in the following code we remove broken links: eliminate broken links. For example in the following code we remove broken links:
{% highlight scala %} {% highlight scala %}
val users: RDD[(VertexId, (String, String))] val users: RDD[(VertexId, (String, String))]
...@@ -322,32 +332,35 @@ val graph = Graph(users, relationships, defaultUser) ...@@ -322,32 +332,35 @@ val graph = Graph(users, relationships, defaultUser)
val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing") val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
{% endhighlight %} {% endhighlight %}
The `mask` operators returns the subgraph containing only the vertices and edges that are found in > Note in the above example only the vertex predicate is provided. The `subgraph` operator defaults
the input graph. This can be used in conjunction with the `subgraph` operator to restrict a graph > to `true` if the vertex or edge predicates are not provided.
based on the properties in another related graph. For example, we might run connected components
using the graph with missing vertices and then restrict the answer to the valid subgraph. The `mask` operator also constructs a subgraph by returning a graph that contains the vertices and
edges that are also found in the input graph. This can be used in conjunction with the `subgraph`
operator to restrict a graph based on the properties in another related graph. For example, we
might run connected components using the graph with missing vertices and then restrict the answer to
the valid subgraph.
{% highlight scala %} {% highlight scala %}
// Run Connected Components // Run Connected Components
val ccGraph = graph.connectedComponents() val ccGraph = graph.connectedComponents() // No longer contains missing field
// Remove missing vertices as well as the edges to connected to them // Remove missing vertices as well as the edges to connected to them
val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing") val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
// Restrict the answer to the valid subgraph // Restrict the answer to the valid subgraph
val validCCGraph = ccGraph.mask(validGraph) val validCCGraph = ccGraph.mask(validGraph)
{% endhighlight %} {% endhighlight %}
The `groupEdges` operator merges parallel edges: duplicate edges between pairs of vertices. In many The `groupEdges` operator merges parallel edges (i.e., duplicate edges between pairs of vertices) in
numerical applications parallel edges can be *added* (their weights combined) into a single edge the multigraph. In many numerical applications, parallel edges can be *added* (their weights
thereby reducing the graph size in memory as well as the cost of computation. combined) into a single edge thereby reducing the size of the graph.
## Join Operators ## Join Operators
<a name="join_operators"></a> <a name="join_operators"></a>
The ability to move between graph and collection views is a key part of GraphX. In many cases it is In many cases it is necessary to join data from external collections (RDDs) with graphs. For
necessary to join data from external collections (RDDs) with graphs. For example, we might have example, we might have extra user properties that we want to merge with an existing graph or we
extra user properties that we want to merge with an existing graph or we might want to pull vertex might want to pull vertex properties from one graph into another. These tasks can be accomplished
properties from one graph into another. These tasks can be accomplished using the *join* operators. using the *join* operators. Below we list the key join operators:
Below we list the key join operators:
{% highlight scala %} {% highlight scala %}
def joinVertices[U](table: RDD[(VertexID, U)])(map: (VertexID, VD, U) => VD) def joinVertices[U](table: RDD[(VertexID, U)])(map: (VertexID, VD, U) => VD)
...@@ -356,7 +369,7 @@ def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(map: (VertexID, VD, Opt ...@@ -356,7 +369,7 @@ def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(map: (VertexID, VD, Opt
: Graph[VD2, ED] : Graph[VD2, ED]
{% endhighlight %} {% endhighlight %}
The `joinVertices` operators, defined in The `joinVertices` operator, defined in
[`GraphOps.scala`](api/graphx/index.html#org.apache.spark.graphx.GraphOps), joins the vertices with [`GraphOps.scala`](api/graphx/index.html#org.apache.spark.graphx.GraphOps), joins the vertices with
the input RDD and returns a new graph with the vertex properties obtained by applying the user the input RDD and returns a new graph with the vertex properties obtained by applying the user
defined `map` function to the result of the joined vertices. Vertices without a matching value in defined `map` function to the result of the joined vertices. Vertices without a matching value in
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment