Skip to content
Snippets Groups Projects
index.md 7.26 KiB
Newer Older
  • Learn to ignore specific revisions
  • displayTitle: Spark Overview
    title: Overview
    description: Apache Spark SPARK_VERSION_SHORT documentation homepage
    
    Apache Spark is a fast and general-purpose cluster computing system.
    
    It provides high-level APIs in Java, Scala, Python and R,
    
    and an optimized engine that supports general execution graphs.
    
    It also supports a rich set of higher-level tools including [Spark SQL](sql-programming-guide.html) for SQL and structured data processing, [MLlib](mllib-guide.html) for machine learning, [GraphX](graphx-programming-guide.html) for graph processing, and [Spark Streaming](streaming-programming-guide.html).
    
    Get Spark from the [downloads page](http://spark.apache.org/downloads.html) of the project website. This documentation is for Spark version {{site.SPARK_VERSION}}. The downloads page 
    
    contains Spark packages for many popular HDFS versions. If you'd like to build Spark from 
    
    scratch, visit [Building Spark](building-spark.html).
    
    Spark runs on both Windows and UNIX-like systems (e.g. Linux, Mac OS). It's easy to run
    locally on one machine --- all you need is to have `java` installed on your system `PATH`,
    or the `JAVA_HOME` environment variable pointing to a Java installation.
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    
    
    Spark runs on Java 7+, Python 2.6+ and R 3.1+. For the Scala API, Spark {{site.SPARK_VERSION}} uses
    
    Scala {{site.SCALA_BINARY_VERSION}}. You will need to use a compatible Scala version 
    ({{site.SCALA_BINARY_VERSION}}.x).
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    # Running the Examples and Shell
    
    Spark comes with several sample programs.  Scala, Java, Python and R examples are in the
    
    `examples/src/main` directory. To run one of the Java or Scala sample programs, use
    `bin/run-example <class> [params]` in the top-level Spark directory. (Behind the scenes, this
    invokes the more general
    
    [`spark-submit` script](submitting-applications.html) for
    
    You can also run Spark interactively through a modified version of the Scala shell. This is a
    
        ./bin/spark-shell --master local[2]
    
    The `--master` option specifies the
    
    [master URL for a distributed cluster](submitting-applications.html#master-urls), or `local` to run
    
    locally with one thread, or `local[N]` to run locally with N threads. You should start by using
    `local` for testing. For a full list of options, run Spark shell with the `--help` option.
    
    
    Spark also provides a Python API. To run Spark interactively in a Python interpreter, use
    `bin/pyspark`:
    
        ./bin/pyspark --master local[2]
    
    Example applications are also provided in Python. For example,
    
        ./bin/spark-submit examples/src/main/python/pi.py 10
    
    Spark also provides an experimental [R API](sparkr.html) since 1.4 (only DataFrames APIs included).
    
    To run Spark interactively in a R interpreter, use `bin/sparkR`:
    
        ./bin/sparkR --master local[2]
    
    Example applications are also provided in R. For example,
        
        ./bin/spark-submit examples/src/main/r/dataframe.R
    
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    # Launching on a Cluster
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    The Spark [cluster mode overview](cluster-overview.html) explains the key concepts in running on a cluster.
    Spark can run both by itself, or over several existing cluster managers. It currently provides several
    options for deployment:
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    * [Amazon EC2](ec2-scripts.html): our EC2 scripts let you launch a cluster in about 5 minutes
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    * [Standalone Deploy Mode](spark-standalone.html): simplest way to deploy Spark on a private cluster
    * [Apache Mesos](running-on-mesos.html)
    * [Hadoop YARN](running-on-yarn.html)
    
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    * [Quick Start](quick-start.html): a quick introduction to the Spark API; start here!
    
    * [Spark Programming Guide](programming-guide.html): detailed overview of Spark
    
      in all supported languages (Scala, Java, Python, R)
    
    * Modules built on Spark:
      * [Spark Streaming](streaming-programming-guide.html): processing real-time data streams
    
      * [Spark SQL and DataFrames](sql-programming-guide.html): support for structured data and relational queries
    
      * [MLlib](mllib-guide.html): built-in machine learning library
      * [GraphX](graphx-programming-guide.html): Spark's new API for graph processing
      * [Bagel (Pregel on Spark)](bagel-programming-guide.html): older, simple graph processing model
    
    * [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package)
    * [Spark Java API (Javadoc)](api/java/index.html)
    
    * [Spark Python API (Sphinx)](api/python/index.html)
    * [Spark R API (Roxygen2)](api/R/index.html)
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    * [Cluster Overview](cluster-overview.html): overview of concepts and components when running on a cluster
    
    * [Submitting Applications](submitting-applications.html): packaging and deploying applications
    * Deployment modes:
      * [Amazon EC2](ec2-scripts.html): scripts that let you launch a cluster on EC2 in about 5 minutes
      * [Standalone Deploy Mode](spark-standalone.html): launch a standalone cluster quickly without a third-party cluster manager
      * [Mesos](running-on-mesos.html): deploy a private cluster using
          [Apache Mesos](http://mesos.apache.org)
      * [YARN](running-on-yarn.html): deploy Spark on top of Hadoop NextGen (YARN)
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    
    
    * [Configuration](configuration.html): customize Spark via its configuration system
    
    * [Monitoring](monitoring.html): track the behavior of your applications
    
    * [Tuning Guide](tuning.html): best practices to optimize performance and memory use
    
    * [Job Scheduling](job-scheduling.html): scheduling resources across and within Spark applications
    
    * [Security](security.html): Spark security support
    
    * [Hardware Provisioning](hardware-provisioning.html): recommendations for cluster hardware
    
    * [3<sup>rd</sup> Party Hadoop Distributions](hadoop-third-party-distributions.html): using common Hadoop distributions
    
    * Integration with other storage systems:
      * [OpenStack Swift](storage-openstack-swift.html)
    
    * [Building Spark](building-spark.html): build Spark using the Maven system
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    * [Contributing to Spark](https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark)
    
    Evan Chan's avatar
    Evan Chan committed
    * [Supplemental Projects](https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects): related third party Spark projects
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    
    
    * [Spark Homepage](http://spark.apache.org)
    
    * [Spark Wiki](https://cwiki.apache.org/confluence/display/SPARK)
    
    * [Spark Community](http://spark.apache.org/community.html) resources, including local meetups
    * [StackOverflow tag `apache-spark`](http://stackoverflow.com/questions/tagged/apache-spark)
    
    * [Mailing Lists](http://spark.apache.org/mailing-lists.html): ask questions about Spark here
    
    * [AMP Camps](http://ampcamp.berkeley.edu/): a series of training camps at UC Berkeley that featured talks and
    
      exercises about Spark, Spark Streaming, Mesos, and more. [Videos](http://ampcamp.berkeley.edu/3/),
    
      [slides](http://ampcamp.berkeley.edu/3/) and [exercises](http://ampcamp.berkeley.edu/3/exercises/) are
    
      available online for free.
    
    * [Code Examples](http://spark.apache.org/examples.html): more are also available in the `examples` subfolder of Spark ([Scala]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/scala/org/apache/spark/examples),
     [Java]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/java/org/apache/spark/examples),
    
     [Python]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/python),
     [R]({{site.SPARK_GITHUB_URL}}/tree/master/examples/src/main/r))