Remove GraphX README

22374559 · Ankur Dave · 74fdfac1 · 22374559
Commit 22374559 authored 11 years ago by Ankur Dave
--- a/README.md
+++ b/README.md
-# GraphX: Unifying Graphs and Tables
+# Apache Spark

-
-GraphX extends the distributed fault-tolerant collections API and
-interactive console of [Spark](http://spark.incubator.apache.org) with
-a new graph API which leverages recent advances in graph systems
-(e.g., [GraphLab](http://graphlab.org)) to enable users to easily and
-interactively build, transform, and reason about graph structured data
-at scale.
-
-
-## Motivation
-
-From social networks and targeted advertising to protein modeling and
-astrophysics, big graphs capture the structure in data and are central
-to the recent advances in machine learning and data mining. Directly
-applying existing *data-parallel* tools (e.g.,
-[Hadoop](http://hadoop.apache.org) and
-[Spark](http://spark.incubator.apache.org)) to graph computation tasks
-can be cumbersome and inefficient.  The need for intuitive, scalable
-tools for graph computation has lead to the development of new
-*graph-parallel* systems (e.g.,
-[Pregel](http://http://giraph.apache.org) and
-[GraphLab](http://graphlab.org)) which are designed to efficiently
-execute graph algorithms.  Unfortunately, these systems do not address
-the challenges of graph construction and transformation and provide
-limited fault-tolerance and support for interactive analysis.
-
-<p align="center">
-  <img src="https://raw.github.com/amplab/graphx/master/docs/img/data_parallel_vs_graph_parallel.png" />
-</p>
-
-
-
-## Solution
-
-The GraphX project combines the advantages of both data-parallel and
-graph-parallel systems by efficiently expressing graph computation
-within the [Spark](http://spark.incubator.apache.org) framework.  We
-leverage new ideas in distributed graph representation to efficiently
-distribute graphs as tabular data-structures.  Similarly, we leverage
-advances in data-flow systems to exploit in-memory computation and
-fault-tolerance.  We provide powerful new operations to simplify graph
-construction and transformation.  Using these primitives we implement
-the PowerGraph and Pregel abstractions in less than 20 lines of code.
-Finally, by exploiting the Scala foundation of Spark, we enable users
-to interactively load, transform, and compute on massive graphs.
-
-<p align="center">
-  <img src="https://raw.github.com/amplab/graphx/master/docs/img/tables_and_graphs.png" />
-</p>
-
-## Examples
-
-Suppose I want to build a graph from some text files, restrict the graph
-to important relationships and users, run page-rank on the sub-graph, and
-then finally return attributes associated with the top users.  I can do
-all of this in just a few lines with GraphX:
-
-```scala
-// Connect to the Spark cluster
-val sc = new SparkContext("spark://master.amplab.org", "research")
-
-// Load my user data and prase into tuples of user id and attribute list
-val users = sc.textFile("hdfs://user_attributes.tsv")
-  .map(line => line.split).map( parts => (parts.head, parts.tail) )
-
-// Parse the edge data which is already in userId -> userId format
-val followerGraph = Graph.textFile(sc, "hdfs://followers.tsv")
-
-// Attach the user attributes
-val graph = followerGraph.outerJoinVertices(users){
-  case (uid, deg, Some(attrList)) => attrList
-  // Some users may not have attributes so we set them as empty
-  case (uid, deg, None) => Array.empty[String]
-  }
-
-// Restrict the graph to users which have exactly two attributes
-val subgraph = graph.subgraph((vid, attr) => attr.size == 2)
-
-// Compute the PageRank
-val pagerankGraph = Analytics.pagerank(subgraph)
-
-// Get the attributes of the top pagerank users
-val userInfoWithPageRank = subgraph.outerJoinVertices(pagerankGraph.vertices){
-  case (uid, attrList, Some(pr)) => (pr, attrList)
-  case (uid, attrList, None) => (pr, attrList)
-  }
-
-println(userInfoWithPageRank.top(5))
-
-```
+Lightning-Fast Cluster Computing - <http://spark.incubator.apache.org/>


 ## Online Documentation

 You can find the latest Spark documentation, including a programming
-guide, on the project webpage at
-<http://spark.incubator.apache.org/documentation.html>.  This README
-file only contains basic setup instructions.
+guide, on the project webpage at <http://spark.incubator.apache.org/documentation.html>.
+This README file only contains basic setup instructions.


 ## Building

-Spark requires Scala 2.9.3 (Scala 2.10 is not yet supported). The
-project is built using Simple Build Tool (SBT), which is packaged with
-it. To build Spark and its example programs, run:
+Spark requires Scala 2.10. The project is built using Simple Build Tool (SBT),
+which can be obtained [here](http://www.scala-sbt.org). If SBT is installed we
+will use the system version of sbt otherwise we will attempt to download it
+automatically. To build Spark and its example programs, run:

-    sbt/sbt assembly
+    ./sbt/sbt assembly

-Once you've built Spark, the easiest way to start using it is the
-shell:
+Once you've built Spark, the easiest way to start using it is the shell:

-    ./spark-shell
+    ./bin/spark-shell

-Or, for the Python API, the Python shell (`./pyspark`).
+Or, for the Python API, the Python shell (`./bin/pyspark`).

-Spark also comes with several sample programs in the `examples`
-directory.  To run one of them, use `./run-example <class>
-<params>`. For example:
+Spark also comes with several sample programs in the `examples` directory.
+To run one of them, use `./bin/run-example <class> <params>`. For example:

-    ./run-example org.apache.spark.examples.SparkLR local[2]
+    ./bin/run-example org.apache.spark.examples.SparkLR local[2]

 will run the Logistic Regression example locally on 2 CPUs.

 Each of the example programs prints usage help if no params are given.

-All of the Spark samples take a `<master>` parameter that is the
-cluster URL to connect to. This can be a mesos:// or spark:// URL, or
-"local" to run locally with one thread, or "local[N]" to run locally
-with N threads.
+All of the Spark samples take a `<master>` parameter that is the cluster URL
+to connect to. This can be a mesos:// or spark:// URL, or "local" to run
+locally with one thread, or "local[N]" to run locally with N threads.
+
+## Running tests

+Testing first requires [Building](#building) Spark. Once Spark is built, tests
+can be run using:

+`./sbt/sbt test`
+ 
 ## A Note About Hadoop Versions

-Spark uses the Hadoop core library to talk to HDFS and other
-Hadoop-supported storage systems. Because the protocols have changed
-in different versions of Hadoop, you must build Spark against the same
-version that your cluster runs.  You can change the version by setting
-the `SPARK_HADOOP_VERSION` environment when building Spark.
+Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
+storage systems. Because the protocols have changed in different versions of
+Hadoop, you must build Spark against the same version that your cluster runs.
+You can change the version by setting the `SPARK_HADOOP_VERSION` environment
+when building Spark.

 For Apache Hadoop versions 1.x, Cloudera CDH MRv1, and other Hadoop
 versions without YARN, use:
@@ -148,7 +62,7 @@ versions without YARN, use:
    # Cloudera CDH 4.2.0 with MapReduce v1
    $ SPARK_HADOOP_VERSION=2.0.0-mr1-cdh4.2.0 sbt/sbt assembly

-For Apache Hadoop 2.x, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
+For Apache Hadoop 2.2.X, 2.1.X, 2.0.X, 0.23.x, Cloudera CDH MRv2, and other Hadoop versions
 with YARN, also set `SPARK_YARN=true`:

    # Apache Hadoop 2.0.5-alpha
@@ -157,8 +71,8 @@ with YARN, also set `SPARK_YARN=true`:
    # Cloudera CDH 4.2.0 with MapReduce v2
    $ SPARK_HADOOP_VERSION=2.0.0-cdh4.2.0 SPARK_YARN=true sbt/sbt assembly

-For convenience, these variables may also be set through the
-`conf/spark-env.sh` file described below.
+    # Apache Hadoop 2.2.X and newer
+    $ SPARK_HADOOP_VERSION=2.2.0 SPARK_YARN=true sbt/sbt assembly

 When developing a Spark application, specify the Hadoop version by adding the
 "hadoop-client" artifact to your project's dependencies. For example, if you're
@@ -167,8 +81,7 @@ using Hadoop 1.2.1 and build your application using SBT, add this entry to

    "org.apache.hadoop" % "hadoop-client" % "1.2.1"

-If your project is built with Maven, add this to your POM file's
-`<dependencies>` section:
+If your project is built with Maven, add this to your POM file's `<dependencies>` section:

    <dependency>
      <groupId>org.apache.hadoop</groupId>
@@ -179,19 +92,28 @@ If your project is built with Maven, add this to your POM file's

 ## Configuration

-Please refer to the [Configuration
-guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
+Please refer to the [Configuration guide](http://spark.incubator.apache.org/docs/latest/configuration.html)
 in the online documentation for an overview on how to configure Spark.


-## Contributing to GraphX
+## Apache Incubator Notice
+
+Apache Spark is an effort undergoing incubation at The Apache Software
+Foundation (ASF), sponsored by the Apache Incubator. Incubation is required of
+all newly accepted projects until a further review indicates that the
+infrastructure, communications, and decision making process have stabilized in
+a manner consistent with other successful ASF projects. While incubation status
+is not necessarily a reflection of the completeness or stability of the code,
+it does indicate that the project has yet to be fully endorsed by the ASF.
+
+
+## Contributing to Spark

-Contributions via GitHub pull requests are gladly accepted from their
-original author. Along with any pull requests, please state that the
-contribution is your original work and that you license the work to
-the project under the project's open source license. Whether or not
-you state this explicitly, by submitting any copyrighted material via
-pull request, email, or other means you agree to license the material
-under the project's open source license and warrant that you have the
-legal authority to do so.
+Contributions via GitHub pull requests are gladly accepted from their original
+author. Along with any pull requests, please state that the contribution is
+your original work and that you license the work to the project under the
+project's open source license. Whether or not you state this explicitly, by
+submitting any copyrighted material via pull request, email, or other means
+you agree to license the material under the project's open source license and
+warrant that you have the legal authority to do so.