diff --git a/docs/_layouts/global.html b/docs/_layouts/global.html index ad7969d012283a009a9b3bf71c6da3e80fbbdbcd..7721854685fb2e0b850f474b85b6eb24e46fa0f4 100755 --- a/docs/_layouts/global.html +++ b/docs/_layouts/global.html @@ -21,7 +21,7 @@ <link rel="stylesheet" href="css/main.css"> <script src="js/vendor/modernizr-2.6.1-respond-1.1.0.min.js"></script> - + <link rel="stylesheet" href="css/pygments-default.css"> <!-- Google analytics script --> @@ -67,10 +67,10 @@ <li class="divider"></li> <li><a href="streaming-programming-guide.html">Spark Streaming</a></li> <li><a href="mllib-guide.html">MLlib (Machine Learning)</a></li> - <li><a href="bagel-programming-guide.html">Bagel (Pregel on Spark)</a></li> + <li><a href="graphx-programming-guide.html">GraphX (Graph-Parallel Spark)</a></li> </ul> </li> - + <li class="dropdown"> <a href="#" class="dropdown-toggle" data-toggle="dropdown">API Docs<b class="caret"></b></a> <ul class="dropdown-menu"> @@ -79,7 +79,7 @@ <li class="divider"></li> <li><a href="api/streaming/index.html#org.apache.spark.streaming.package">Spark Streaming</a></li> <li><a href="api/mllib/index.html#org.apache.spark.mllib.package">MLlib (Machine Learning)</a></li> - <li><a href="api/bagel/index.html#org.apache.spark.bagel.package">Bagel (Pregel on Spark)</a></li> + <li><a href="api/graphx/index.html#org.apache.spark.graphx.package">GraphX (Graph-Paralle Spark)</a></li> </ul> </li> @@ -161,7 +161,7 @@ <script src="js/vendor/jquery-1.8.0.min.js"></script> <script src="js/vendor/bootstrap.min.js"></script> <script src="js/main.js"></script> - + <!-- A script to fix internal hash links because we have an overlapping top bar. Based on https://github.com/twitter/bootstrap/issues/193#issuecomment-2281510 --> <script> diff --git a/docs/graphx-programming-guide.md b/docs/graphx-programming-guide.md index ebc47f5d1c43c137ea580ee74ddcce3ab12c3973..a551e4306d9600d5debc68482d140a98e66f97e3 100644 --- a/docs/graphx-programming-guide.md +++ b/docs/graphx-programming-guide.md @@ -1,63 +1,141 @@ --- layout: global -title: "GraphX: Unifying Graphs and Tables" +title: GraphX Programming Guide --- +* This will become a table of contents (this text will be scraped). +{:toc} -GraphX extends the distributed fault-tolerant collections API and -interactive console of [Spark](http://spark.incubator.apache.org) with -a new graph API which leverages recent advances in graph systems -(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and -interactively build, transform, and reason about graph structured data -at scale. - - -## Motivation - -From social networks and targeted advertising to protein modeling and -astrophysics, big graphs capture the structure in data and are central -to the recent advances in machine learning and data mining. Directly -applying existing *data-parallel* tools (e.g., -[Hadoop](http://hadoop.apache.org) and -[Spark](http://spark.incubator.apache.org)) to graph computation tasks -can be cumbersome and inefficient. The need for intuitive, scalable -tools for graph computation has lead to the development of new -*graph-parallel* systems (e.g., -[Pregel](http://http://giraph.apache.org) and -[GraphLab](http://graphlab.org)) which are designed to efficiently -execute graph algorithms. Unfortunately, these systems do not address -the challenges of graph construction and transformation and provide -limited fault-tolerance and support for interactive analysis. - -{:.pagination-centered} - - -## Solution - -The GraphX project combines the advantages of both data-parallel and -graph-parallel systems by efficiently expressing graph computation -within the [Spark](http://spark.incubator.apache.org) framework. We -leverage new ideas in distributed graph representation to efficiently -distribute graphs as tabular data-structures. Similarly, we leverage -advances in data-flow systems to exploit in-memory computation and -fault-tolerance. We provide powerful new operations to simplify graph -construction and transformation. Using these primitives we implement -the PowerGraph and Pregel abstractions in less than 20 lines of code. -Finally, by exploiting the Scala foundation of Spark, we enable users -to interactively load, transform, and compute on massive graphs. - -<p align="center"> - <img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" /> +<p style="text-align: center;"> + <img src="img/graphx_logo.png" + title="GraphX Logo" + alt="GraphX" + width="65%" /> </p> -## Examples +# Overview + +GraphX is the new (alpha) Spark API for graphs and graph-parallel +computation. At a high-level GraphX, extends the Spark +[RDD](api/core/index.html#org.apache.spark.rdd.RDD) by +introducing the [Resilient Distributed property Graph (RDG)](#property_graph): +a directed graph with properties attached to each vertex and edge. +To support graph computation, GraphX exposes a set of functions +(e.g., [mapReduceTriplets](#mrTriplets)) as well as optimized variants of the +[Pregel](http://giraph.apache.org) and [GraphLab](http://graphlab.org) +APIs. In addition, GraphX includes a growing collection of graph +[algorithms](#graph_algorithms) and [builders](#graph_builders) to simplify +graph analytics tasks. + +## Background on Graph-Parallel Computation + +From social networks to language modeling, the growing scale and importance of +graph data has driven the development of numerous new *graph-parallel* systems +(e.g., [Giraph](http://http://giraph.apache.org) and +[GraphLab](http://graphlab.org)). By restricting the types of computation that can be +expressed and introducing new techniques to partition and distribute graphs, +these systems can efficiently execute sophisticated graph algorithms orders of +magnitude faster than more general *data-parallel* systems. + +<p style="text-align: center;"> + <img src="img/data_parallel_vs_graph_parallel.png" + title="Data-Parallel vs. Graph-Parallel" + alt="Data-Parallel vs. Graph-Parallel" + width="50%" /> +</p> + +However, the same restrictions that enable these substantial performance gains +also make it difficult to express many of the important stages in a typical graph-analytics pipeline: +constructing the graph, modifying its structure, or expressing computation that +spans multiple graphs. As a consequence, existing graph analytics pipelines +compose graph-parallel and data-parallel systems, leading to extensive data +movement and duplication and a complicated programming model. + +<p style="text-align: center;"> + <img src="img/graph_analytics_pipeline.png" + title="Graph Analytics Pipeline" + alt="Graph Analytics Pipeline" + width="50%" /> +</p> + +The goal of the GraphX project is to unify graph-parallel and data-parallel +computation in one system with a single composable API. This goal is achieved +through an API that enables users to view data both as a graph and as +collections (i.e., RDDs) without data movement or duplication and by +incorporating advances in graph-parallel systems to optimize the execution of +operations on the graph view. In preliminary experiments we find that the GraphX +system is able to achieve performance comparable to state-of-the-art +graph-parallel systems while easily expressing the entire analytics pipelines. + +<p style="text-align: center;"> + <img src="img/graphx_performance_comparison.png" + title="GraphX Performance Comparison" + alt="GraphX Performance Comparison" + width="50%" /> +</p> + +## GraphX Replaces the Spark Bagel API + +Prior to the release of GraphX, graph computation in Spark was expressed using +Bagel, an implementation of the Pregel API. GraphX improves upon Bagel by exposing +a richer property graph API, a more streamlined version of the Pregel abstraction, +and system optimizations to improve performance and reduce memory +overhead. While we plan to eventually deprecate the Bagel, we will continue to +support the API and [Bagel programming guide](bagel-programming-guide.html). However, +we encourage Bagel to explore the new GraphX API and comment on issues that may +complicate the transition from Bagel. + +# The Property Graph +<a name="property_graph"></a> + +<p style="text-align: center;"> + <img src="img/edge_cut_vs_vertex_cut.png" + title="Edge Cut vs. Vertex Cut" + alt="Edge Cut vs. Vertex Cut" + width="50%" /> +</p> + +<p style="text-align: center;"> + <img src="img/property_graph.png" + title="The Property Graph" + alt="The Property Graph" + width="50%" /> +</p> + +<p style="text-align: center;"> + <img src="img/vertex_routing_edge_tables.png" + title="RDD Graph Representation" + alt="RDD Graph Representation" + width="50%" /> +</p> + + +# Graph Operators + +## Map Reduce Triplets (mapReduceTriplets) +<a name="mrTriplets"></a> + +# Graph Algorithms +<a name="graph_algorithms"></a> + +# Graph Builders +<a name="graph_builders"></a> + +<p style="text-align: center;"> + <img src="img/tables_and_graphs.png" + title="Tables and Graphs" + alt="Tables and Graphs" + width="50%" /> +</p> + +# Examples Suppose I want to build a graph from some text files, restrict the graph to important relationships and users, run page-rank on the sub-graph, and then finally return attributes associated with the top users. I can do all of this in just a few lines with GraphX: -```scala +{% highlight scala %} // Connect to the Spark cluster val sc = new SparkContext("spark://master.amplab.org", "research") @@ -89,108 +167,5 @@ val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){ println(userInfoWithPageRank.top(5)) -``` - - -## Online Documentation - -You can find the latest Spark documentation, including a programming -guide, on the project webpage at -<http://spark.incubator.apache.org/documentation.html>. This README -file only contains basic setup instructions. - - -## Building - -Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The -project is built using Simple Build Tool (SBT), which is packaged with -it. To build Spark and its example programs, run: - - sbt/sbt assembly - -Once you've built Spark, the easiest way to start using it is the -shell: - - ./spark-shell - -Or, for the Python API, the Python shell (`./pyspark`). - -Spark also comes with several sample programs in the `examples` -directory. To run one of them, use `./run-example <class> -<params>`. For example: - - ./run-example org.apache.spark.examples.SparkLR local[2] - -will run the Logistic Regression example locally on 2 CPUs. - -Each of the example programs prints usage help if no params are given. - -All of the Spark samples take a `<master>` parameter that is the -cluster URL to connect to. This can be a mesos:// or spark:// URL, or -"local" to run locally with one thread, or "local[N]" to run locally -with N threads. - - -## A Note About Hadoop Versions - -Spark uses the Hadoop core library to talk to HDFS and other -Hadoop-supported storage systems. Because the protocols have changed -in different versions of Hadoop, you must build Spark against the same -version that your cluster runs. You can change the version by setting -the `SPARK_HADOOP_VERSION` environment when building Spark. - -For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop -versions without YARN, use: - - # Apache Hadoop 1.2.1 - $ SPARK_HADOOP_VERSION=1.2.1 sbt/sbt assembly - - # Cloudera CDH 4.2.0 with MapReduce v1 - $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly - -For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions -with YARN, also set `SPARK_YARN=true`: - - # Apache Hadoop 2.0.5-alpha - $ SPARK_HADOOP_VERSION=2.0.5-alpha SPARK_YARN=true sbt/sbt assembly - - # Cloudera CDH 4.2.0 with MapReduce v2 - $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly - -For convenience, these variables may also be set through the -`conf/spark-env.sh` file described below. - -When developing a Spark application, specify the Hadoop version by adding the -"hadoop-client" artifact to your project's dependencies. For example, if you're -using Hadoop 1.2.1 and build your application using SBT, add this entry to -`libraryDependencies`: - - "org.apache.hadoop" % "hadoop-client" % "1.2.1" - -If your project is built with Maven, add this to your POM file's -`<dependencies>` section: - - <dependency> - <groupId>org.apache.hadoop</groupId> - <artifactId>hadoop-client</artifactId> - <version>1.2.1</version> - </dependency> - - -## Configuration - -Please refer to the [Configuration -guide](http://spark.incubator.apache.org/docs/latest/configuration.html) -in the online documentation for an overview on how to configure Spark. - - -## Contributing to GraphX +{% endhighlight %} -Contributions via GitHub pull requests are gladly accepted from their -original author. Along with any pull requests, please state that the -contribution is your original work and that you license the work to -the project under the project's open source license. Whether or not -you state this explicitly, by submitting any copyrighted material via -pull request, email, or other means you agree to license the material -under the project's open source license and warrant that you have the -legal authority to do so. diff --git a/docs/img/data_parallel_vs_graph_parallel.png b/docs/img/data_parallel_vs_graph_parallel.png index d9aa811466327dd6960d3b2552112b5259c02a42..d3918f01d8f3b8d39d0d2c1c3ac11c99a5e8253a 100644 Binary files a/docs/img/data_parallel_vs_graph_parallel.png and b/docs/img/data_parallel_vs_graph_parallel.png differ diff --git a/docs/img/edge_cut_vs_vertex_cut.png b/docs/img/edge_cut_vs_vertex_cut.png new file mode 100644 index 0000000000000000000000000000000000000000..ae30396d3fe19c594f3ad715236fde4233fc0775 Binary files /dev/null and b/docs/img/edge_cut_vs_vertex_cut.png differ diff --git a/docs/img/graph_analytics_pipeline.png b/docs/img/graph_analytics_pipeline.png new file mode 100644 index 0000000000000000000000000000000000000000..6d606e01894aee1a0c02e636911e782ecba539a0 Binary files /dev/null and b/docs/img/graph_analytics_pipeline.png differ diff --git a/docs/img/graphx_figures.pptx b/docs/img/graphx_figures.pptx new file mode 100644 index 0000000000000000000000000000000000000000..c67ddb487687d1f795aa3aed63a4763d9a4a11e3 Binary files /dev/null and b/docs/img/graphx_figures.pptx differ diff --git a/docs/img/graphx_logo.png b/docs/img/graphx_logo.png new file mode 100644 index 0000000000000000000000000000000000000000..9869ac148cad5e362d46dead2b3995b2909f08e5 Binary files /dev/null and b/docs/img/graphx_logo.png differ diff --git a/docs/img/graphx_performance_comparison.png b/docs/img/graphx_performance_comparison.png new file mode 100644 index 0000000000000000000000000000000000000000..62dcf098c904fb9a4c3b97276fc3d98b321fd2d8 Binary files /dev/null and b/docs/img/graphx_performance_comparison.png differ diff --git a/docs/img/property_graph.png b/docs/img/property_graph.png new file mode 100644 index 0000000000000000000000000000000000000000..859d4013fb3177f10f00cfdf03e98e1adaa5e82c Binary files /dev/null and b/docs/img/property_graph.png differ diff --git a/docs/img/tables_and_graphs.png b/docs/img/tables_and_graphs.png index 9af07d30811aa11bd9226c16a5a30b8a86fea1c7..ec37bb45a62f04a3aee93e0d7a57ac6c624edc22 100644 Binary files a/docs/img/tables_and_graphs.png and b/docs/img/tables_and_graphs.png differ diff --git a/docs/img/vertex_routing_edge_tables.png b/docs/img/vertex_routing_edge_tables.png new file mode 100644 index 0000000000000000000000000000000000000000..4379becc87ee4aee48e484bd420be8f9a0e4548d Binary files /dev/null and b/docs/img/vertex_routing_edge_tables.png differ diff --git a/docs/index.md b/docs/index.md index 86d574daaab4a8fc530f5a7f80a179133c4553a6..7228809738d36eac0e45e7b34a03c8df1804f576 100644 --- a/docs/index.md +++ b/docs/index.md @@ -5,7 +5,7 @@ title: Spark Overview Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in [Scala](scala-programming-guide.html), [Java](java-programming-guide.html), and [Python](python-programming-guide.html) that make parallel jobs easy to write, and an optimized engine that supports general computation graphs. -It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [Bagel](bagel-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html). +It also supports a rich set of higher-level tools including [Shark](http://shark.cs.berkeley.edu) (Hive on Spark), [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html). # Downloading @@ -77,7 +77,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui * [Python Programming Guide](python-programming-guide.html): using Spark from Python * [Spark Streaming](streaming-programming-guide.html): using the alpha release of Spark Streaming * [MLlib (Machine Learning)](mllib-guide.html): Spark's built-in machine learning library -* [Bagel (Pregel on Spark)](bagel-programming-guide.html): simple graph processing model +* [GraphX (Graphs on Spark)](graphx-programming-guide.html): simple graph processing model **API Docs:** @@ -85,7 +85,7 @@ For this version of Spark (0.8.1) Hadoop 2.2.x (or newer) users will have to bui * [Spark for Python (Epydoc)](api/pyspark/index.html) * [Spark Streaming for Java/Scala (Scaladoc)](api/streaming/index.html) * [MLlib (Machine Learning) for Java/Scala (Scaladoc)](api/mllib/index.html) -* [Bagel (Pregel on Spark) for Scala (Scaladoc)](api/bagel/index.html) +* [GraphX (Graphs on Spark) for Scala (Scaladoc)](api/graphx/index.html) **Deployment guides:**