Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
S
spark
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
cs525-sp18-g07
spark
Commits
cf57b1b0
There was an error fetching the commit references. Please try again later.
Commit
cf57b1b0
authored
11 years ago
by
Joseph E. Gonzalez
Browse files
Options
Downloads
Patches
Plain Diff
Correcting typos in documentation.
parent
64c45935
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
docs/graphx-programming-guide.md
+79
-66
79 additions, 66 deletions
docs/graphx-programming-guide.md
with
79 additions
and
66 deletions
docs/graphx-programming-guide.md
+
79
−
66
View file @
cf57b1b0
...
@@ -19,11 +19,11 @@ title: GraphX Programming Guide
...
@@ -19,11 +19,11 @@ title: GraphX Programming Guide
GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high-level,
GraphX is the new (alpha) Spark API for graphs and graph-parallel computation. At a high-level,
GraphX extends the Spark
[
RDD
](
api/core/index.html#org.apache.spark.rdd.RDD
)
by introducing the
GraphX extends the Spark
[
RDD
](
api/core/index.html#org.apache.spark.rdd.RDD
)
by introducing the
[
Resilient Distributed property Graph (RDG)
](
#property_graph
)
: a directed multigraph with properties
[
Resilient Distributed property Graph (RDG)
](
#property_graph
)
: a directed multigraph with properties
attached to each vertex and edge. To support graph computation, GraphX exposes a set of fun
ctions
attached to each vertex and edge. To support graph computation, GraphX exposes a set of fun
damental
(e.g.,
[
subgraph
](
#structural_operators
)
,
[
joinVertices
](
#join_operators
)
, and
operators
(e.g.,
[
subgraph
](
#structural_operators
)
,
[
joinVertices
](
#join_operators
)
, and
[
mapReduceTriplets
](
#mrTriplets
)
) as well as an optimized variant of the
[
mapReduceTriplets
](
#mrTriplets
)
) as well as an optimized variant of the
[
Pregel
](
#pregel
)
API. In
[
Pregel
](
#pregel
)
API. In
addition, GraphX includes a growing collection of graph
addition, GraphX includes a growing collection of graph
[
algorithms
](
#graph_algorithms
)
and
[
algorithms
](
#graph_algorithms
)
and
[
builders
](
#graph_builders
)
to simplify graph analytics tasks.
[
builders
](
#graph_builders
)
to simplify graph analytics tasks.
## Background on Graph-Parallel Computation
## Background on Graph-Parallel Computation
...
@@ -65,15 +65,13 @@ in graph-parallel systems, GraphX is able to optimize the execution of graph ope
...
@@ -65,15 +65,13 @@ in graph-parallel systems, GraphX is able to optimize the execution of graph ope
## GraphX Replaces the Spark Bagel API
## GraphX Replaces the Spark Bagel API
Prior to the release of GraphX, graph computation in Spark was expressed using
Prior to the release of GraphX, graph computation in Spark was expressed using Bagel, an
Bagel, an implementation of the Pregel API. GraphX improves upon Bagel by
implementation of Pregel. GraphX improves upon Bagel by exposing a richer property graph API, a
exposing a richer property graph API, a more streamlined version of the Pregel
more streamlined version of the Pregel abstraction, and system optimizations to improve performance
abstraction, and system optimizations to improve performance and reduce memory
and reduce memory overhead. While we plan to eventually deprecate the Bagel, we will continue to
overhead. While we plan to eventually deprecate the Bagel, we will continue to
support the
[
Bagel API
](
api/bagel/index.html#org.apache.spark.bagel.package
)
and
[
Bagel programming
support the
[
Bagel API
](
api/bagel/index.html#org.apache.spark.bagel.package
)
and
guide
](
bagel-programming-guide.html
)
. However, we encourage Bagel users to explore the new GraphX
[
Bagel programming guide
](
bagel-programming-guide.html
)
. However, we encourage
API and comment on issues that may complicate the transition from Bagel.
Bagel users to explore the new GraphX API and comment on issues that may
complicate the transition from Bagel.
# Getting Started
# Getting Started
...
@@ -94,41 +92,55 @@ The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a d
...
@@ -94,41 +92,55 @@ The [property graph](api/graphx/index.html#org.apache.spark.graphx.Graph) is a d
graph with user defined objects attached to each vertex and edge. A directed multigraph is a
graph with user defined objects attached to each vertex and edge. A directed multigraph is a
directed graph with potentially multiple parallel edges sharing the same source and destination
directed graph with potentially multiple parallel edges sharing the same source and destination
vertex. The ability to support parallel edges simplifies modeling scenarios where there can be
vertex. The ability to support parallel edges simplifies modeling scenarios where there can be
multiple relationships (e.g., co-worker and friend) between the same vertices. Note, however there
multiple relationships (e.g., co-worker and friend) between the same vertices. Each vertex is keyed
can only be one instance of each vertex.
by a
*unique*
64-bit long identifier (
`VertexId`
). Similarly, edges have corresponding source and
destination vertex identifiers. GraphX does not impose any ordering or constraints on the vertex
Like RDDs, property graphs are immutable, distributed, and fault-tolerant. Vertices are keyed by
identifiers. The property graph is parameterized over the vertex
`VD`
and edge
`ED`
types. These
their vertex identifier (
`VertexId`
) which is a unique 64-bit long. Similarly, edges have
are the types of the objects associated with each vertex and edge respectively.
corresponding source and destination vertex identifiers. GraphX does not impose any ordering or
constraints on the vertex identifiers.
The property graph is parameterized over the vertex
`VD`
and edge
`ED`
types. These are the types
of the objects associated with each vertex and edge respectively. In some cases it can be desirable
to have vertices of different types. However, this can be accomplished through inheritance.
> GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
> GraphX optimizes the representation of `VD` and `ED` when they are plain old data-types (e.g.,
> int, double, etc...) reducing the in memory footprint.
> int, double, etc...) reducing the in memory footprint.
In some cases we may wish to have vertices with different property types in the same graph. This can
be accomplished through inheritance. For example to model users and products as a bipartie graph we
might do the following:
{% highlight scala %}
case class VertexProperty
case class UserProperty extends VertexProperty
(val name: String)
case class ProductProperty extends VertexProperty
(val name: String, val price: Double)
// The graph might then have the type:
val graph: Graph[VertexProperty, String]
{% endhighlight %}
Like RDDs, property graphs are immutable, distributed, and fault-tolerant. Changes to the values or
structure of the graph are accomplished by producing a new graph with the desired changes. The graph
is partitioned across the workers using a range of vertex-partitioning heuristics. As with RDDs,
each partition of the graph can be recreated on a different machine in the event of a failure.
Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
Logically the property graph corresponds to a pair of typed collections (RDDs) encoding the
properties for each vertex and edge:
properties for each vertex and edge. As a consequence, the graph class contains members to access
the vertices and edges of the graph:
{% highlight scala %}
{% highlight scala %}
class Graph[VD: ClassTag, ED: ClassTag] {
val vertices: VertexRDD[VD]
val vertices: RDD[(VertexId, VD)]
val edges: EdgeRDD[ED]
val edges: RDD[Edge[ED]]
// ...
}
{% endhighlight %}
{% endhighlight %}
> Note that the vertices and edges of the graph are actually of type `VertexRDD[VD]` and
The classes
`VertexRDD[VD]`
and
`EdgeRDD[ED]`
extend and are optimized versions of
`RDD[(VertexId,
> `EdgeRDD[ED]` respectively. These classes extend and are optimized versions of `RDD[(VertexId,
VD)]`
and
`RDD[Edge[ED]]`
respectively. Both
`VertexRDD[VD]`
and
`EdgeRDD[ED]`
provide additional
> VD)]` and `RDD[Edge[ED]]` with additional functionality built around the internal index and column
functionality built around graph computation and leverage internal optimizations. We discuss the
> oriented representations. We discuss the `VertexRDD` and `EdgeRDD` API in greater detail in the
`VertexRDD`
and
`EdgeRDD`
API in greater detail in the section on
[
vertex and edge
> section on [vertex and edge RDDs](#vertex_and_edge_rdds)
RDDs
](
#vertex_and_edge_rdds
)
but for now they can be thought of as simply RDDs of the form:
`RDD[(VertexId, VD)]`
and
`RDD[Edge[ED]]`
.
### Example Property Graph
For example, we might
construct a property graph consisting of various collaborators on the GraphX
Suppose we want to
construct a property graph consisting of
the
various collaborators on the GraphX
project. The vertex property contain
s
the username and occupation
and the edge property contain
s
project. The vertex property
might
contain the username and occupation
. We could annotate edge
s
a string describing the relationships between
the use
rs
.
with
a string describing the relationships between
collaborato
rs
:
<p
style=
"text-align: center;"
>
<p
style=
"text-align: center;"
>
<img src="img/property_graph.png"
<img src="img/property_graph.png"
...
@@ -183,18 +195,19 @@ graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count
...
@@ -183,18 +195,19 @@ graph.vertices.filter { case (id, (name, pos)) => pos == "postdoc"}.count
graph.edges.filter(e => e.srcId > e.dstId).count
graph.edges.filter(e => e.srcId > e.dstId).count
{% endhighlight %}
{% endhighlight %}
> Note that `graph.vertices` returns an `
RDD[(
Vertex
Id,
(String, String)
)
]`
and so we must use the
> Note that `graph.vertices` returns an `Vertex
RDD[
(String, String)]`
which extends
>
scala `case` expression to deconstruct the tuple. Alternatively, `graph.edges` returns an `RDD`
>
`RDD[(VertexId, (String, String))]` and so we use the scala `case` expression to deconstruct
>
containing `Edge[String]` objects. We could have also used the case class type constructor as
>
the tuple. Alternatively, `graph.edges` returns an `EdgeRDD` containing `Edge[String]` objects.
> in the following:
>
We could have also used the case class type constructor as
in the following:
> {% highlight scala %}
> {% highlight scala %}
graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
graph.edges.filter { case Edge(src, dst, prop) => src < dst }.count
{% endhighlight %}
{% endhighlight %}
In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
In addition to the vertex and edge views of the property graph, GraphX also exposes a triplet view.
The triplet view logically joins the vertex and edge properties yielding an
`RDD[EdgeTriplet[VD,
The triplet view logically joins the vertex and edge properties yielding an
`RDD[EdgeTriplet[VD,
ED]]`
consisting of
[
`EdgeTriplet`
](
api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet
)
.
ED]]`
containing instances of the
This
*join*
can be expressed in the following SQL expression:
[
`EdgeTriplet`
](
api/graphx/index.html#org.apache.spark.graphx.EdgeTriplet
)
class. This
*join*
can be
expressed in the following SQL expression:
{% highlight sql %}
{% highlight sql %}
SELECT src.id, dst.id, src.attr, e.attr, dst.attr
SELECT src.id, dst.id, src.attr, e.attr, dst.attr
...
@@ -266,7 +279,7 @@ defined `map` function.
...
@@ -266,7 +279,7 @@ defined `map` function.
> does not preserve the structural indicies and would not benefit from the substantial system
> does not preserve the structural indicies and would not benefit from the substantial system
> optimizations in GraphX.
> optimizations in GraphX.
> {% highlight scala %}
> {% highlight scala %}
val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr))}
val newVertices = graph.vertices.map { case (id, attr) => (id, mapUdf(id, attr))
}
val newGraph = Graph(newVertices, graph.edges)
val newGraph = Graph(newVertices, graph.edges)
{% endhighlight %}
{% endhighlight %}
...
@@ -291,12 +304,9 @@ add more in the future. The following is a list of the basic structural operato
...
@@ -291,12 +304,9 @@ add more in the future. The following is a list of the basic structural operato
{% highlight scala %}
{% highlight scala %}
def reverse: Graph[VD, ED]
def reverse: Graph[VD, ED]
def subgraph(epred: EdgeTriplet[VD,ED] => Boolean,
def subgraph(epred: EdgeTriplet[VD,ED] => Boolean = (x => true),
vpred: (VertexID, VD) => Boolean): Graph[VD, ED]
vpred: (VertexID, VD) => Boolean = ((v,d) => true) ): Graph[VD, ED]
def mask
[
VD2, ED2
](
other:
Graph[VD2, ED2]): Graph[VD, ED]
def mask
[
VD2, ED2
](
other:
Graph[VD2, ED2]): Graph[VD, ED]
def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
def groupEdges(merge: (ED, ED) => ED): Graph[VD,ED]
{% endhighlight %}
{% endhighlight %}
...
@@ -309,7 +319,7 @@ The `subgraph` operator takes vertex and edge predicates and returns the graph c
...
@@ -309,7 +319,7 @@ The `subgraph` operator takes vertex and edge predicates and returns the graph c
vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge
vertices that satisfy the vertex predicate (evaluate to true) and edges that satisfy the edge
predicate
*and connect vertices that satisfy the vertex predicate*
. The
`subgraph`
operator can be
predicate
*and connect vertices that satisfy the vertex predicate*
. The
`subgraph`
operator can be
used in number of situations to restrict the graph to the vertices and edges of interest or
used in number of situations to restrict the graph to the vertices and edges of interest or
eliminate broken links.
For example in the following code we remove broken links:
eliminate broken links. For example in the following code we remove broken links:
{% highlight scala %}
{% highlight scala %}
val users: RDD[(VertexId, (String, String))]
val users: RDD[(VertexId, (String, String))]
...
@@ -322,32 +332,35 @@ val graph = Graph(users, relationships, defaultUser)
...
@@ -322,32 +332,35 @@ val graph = Graph(users, relationships, defaultUser)
val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
{% endhighlight %}
{% endhighlight %}
The
`mask`
operators returns the subgraph containing only the vertices and edges that are found in
> Note in the above example only the vertex predicate is provided. The `subgraph` operator defaults
the input graph. This can be used in conjunction with the
`subgraph`
operator to restrict a graph
> to `true` if the vertex or edge predicates are not provided.
based on the properties in another related graph. For example, we might run connected components
using the graph with missing vertices and then restrict the answer to the valid subgraph.
The
`mask`
operator also constructs a subgraph by returning a graph that contains the vertices and
edges that are also found in the input graph. This can be used in conjunction with the
`subgraph`
operator to restrict a graph based on the properties in another related graph. For example, we
might run connected components using the graph with missing vertices and then restrict the answer to
the valid subgraph.
{% highlight scala %}
{% highlight scala %}
// Run Connected Components
// Run Connected Components
val ccGraph = graph.connectedComponents()
val ccGraph = graph.connectedComponents()
// No longer contains missing field
// Remove missing vertices as well as the edges to connected to them
// Remove missing vertices as well as the edges to connected to them
val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
val validGraph = graph.subgraph((id, attr) => attr._2 != "Missing")
// Restrict the answer to the valid subgraph
// Restrict the answer to the valid subgraph
val validCCGraph = ccGraph.mask(validGraph)
val validCCGraph = ccGraph.mask(validGraph)
{% endhighlight %}
{% endhighlight %}
The
`groupEdges`
operator merges parallel edges
:
duplicate edges between pairs of vertices
. In many
The
`groupEdges`
operator merges parallel edges
(i.e.,
duplicate edges between pairs of vertices
) in
numerical applications parallel edges can be
*added*
(their weights
combined) into a single edge
the multigraph. In many
numerical applications
,
parallel edges can be
*added*
(their weights
thereby reducing the graph size in memory as well as the cost of computation
.
combined) into a single edge thereby reducing the size of the graph
.
## Join Operators
## Join Operators
<a
name=
"join_operators"
></a>
<a
name=
"join_operators"
></a>
The ability to move between graph and collection views is a key part of GraphX. In many cases it is
In many cases it is necessary to join data from external collections (RDDs) with graphs. For
necessary to join data from external collections (RDDs) with graphs. For example, we might have
example, we might have extra user properties that we want to merge with an existing graph or we
extra user properties that we want to merge with an existing graph or we might want to pull vertex
might want to pull vertex properties from one graph into another. These tasks can be accomplished
properties from one graph into another. These tasks can be accomplished using the
*join*
operators.
using the
*join*
operators. Below we list the key join operators:
Below we list the key join operators:
{% highlight scala %}
{% highlight scala %}
def joinVertices
[
U
](
table:
RDD[(VertexID, U)])(map: (VertexID, VD, U) => VD)
def joinVertices
[
U
](
table:
RDD[(VertexID, U)])(map: (VertexID, VD, U) => VD)
...
@@ -356,7 +369,7 @@ def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(map: (VertexID, VD, Opt
...
@@ -356,7 +369,7 @@ def outerJoinVertices[U, VD2](table: RDD[(VertexID, U)])(map: (VertexID, VD, Opt
: Graph[VD2, ED]
: Graph[VD2, ED]
{% endhighlight %}
{% endhighlight %}
The
`joinVertices`
operator
s
, defined in
The
`joinVertices`
operator, defined in
[
`GraphOps.scala`
](
api/graphx/index.html#org.apache.spark.graphx.GraphOps
)
, joins the vertices with
[
`GraphOps.scala`
](
api/graphx/index.html#org.apache.spark.graphx.GraphOps
)
, joins the vertices with
the input RDD and returns a new graph with the vertex properties obtained by applying the user
the input RDD and returns a new graph with the vertex properties obtained by applying the user
defined
`map`
function to the result of the joined vertices. Vertices without a matching value in
defined
`map`
function to the result of the joined vertices. Vertices without a matching value in
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment