Skip to content
Snippets Groups Projects
README.md 3.73 KiB
Newer Older
  • Learn to ignore specific revisions
  • Matei Zaharia's avatar
    Matei Zaharia committed
    # Apache Spark
    
    Spark is a fast and general cluster computing system for Big Data. It provides
    
    high-level APIs in Scala, Java, Python, and R, and an optimized engine that
    
    supports general computation graphs for data analysis. It also supports a
    
    rich set of higher-level tools including Spark SQL for SQL and DataFrames,
    MLlib for machine learning, GraphX for graph processing,
    
    and Spark Streaming for stream processing.
    
    
    
    ## Online Documentation
    
    You can find the latest Spark documentation, including a programming
    
    guide, on the [project web page](http://spark.apache.org/documentation.html).
    
    This README file only contains basic setup instructions.
    
    Reynold Xin's avatar
    Reynold Xin committed
    ## Building Spark
    
    Spark is built using [Apache Maven](http://maven.apache.org/).
    To build Spark and its example programs, run:
    
        build/mvn -DskipTests clean package
    
    (You do not need to do this if you downloaded a pre-built package.)
    
    
    You can build Spark using more than one thread by using the -T option with Maven, see ["Parallel builds in Maven 3"](https://cwiki.apache.org/confluence/display/MAVEN/Parallel+builds+in+Maven+3).
    
    More detailed documentation is available from the project site, at
    
    ["Building Spark"](http://spark.apache.org/docs/latest/building-spark.html).
    
    
    For general development tips, including info on developing Spark using an IDE, see 
    [http://spark.apache.org/developer-tools.html](the Useful Developer Tools page).
    
    Reynold Xin's avatar
    Reynold Xin committed
    ## Interactive Scala Shell
    
    The easiest way to start using Spark is through the Scala shell:
    
        ./bin/spark-shell
    
    Reynold Xin's avatar
    Reynold Xin committed
    Try the following command, which should return 1000:
    
        scala> sc.parallelize(1 to 1000).count()
    
    ## Interactive Python Shell
    
    Alternatively, if you prefer Python, you can use the Python shell:
    
        ./bin/pyspark
    
    Reynold Xin's avatar
    Reynold Xin committed
    And run the following command, which should also return 1000:
    
        >>> sc.parallelize(range(1000)).count()
    
    ## Example Programs
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    Spark also comes with several sample programs in the `examples` directory.
    
    To run one of them, use `./bin/run-example <class> [params]`. For example:
    
    You can set the MASTER environment variable when running examples to submit
    
    examples to a cluster. This can be a mesos:// or spark:// URL,
    
    "yarn" to run on YARN, and "local" to run
    
    locally with one thread, or "local[N]" to run locally with N threads. You
    
    can also use an abbreviated class name if the class is in the `examples`
    package. For instance:
    
        MASTER=spark://host:7077 ./bin/run-example SparkPi
    
    Many of the example programs print usage help if no params are given.
    
    Reynold Xin's avatar
    Reynold Xin committed
    ## Running Tests
    
    Reynold Xin's avatar
    Reynold Xin committed
    Testing first requires [building Spark](#building-spark). Once Spark is built, tests
    
    can be run using:
    
        ./dev/run-tests
    
    Reynold Xin's avatar
    Reynold Xin committed
    
    
    Please see the guidance on how to
    
    [run tests for a module, or individual tests](http://spark.apache.org/developer-tools.html#individual-tests).
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    ## A Note About Hadoop Versions
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    
    Spark uses the Hadoop core library to talk to HDFS and other Hadoop-supported
    
    Jey Kottalam's avatar
    Jey Kottalam committed
    storage systems. Because the protocols have changed in different versions of
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    Hadoop, you must build Spark against the same version that your cluster runs.
    
    Jey Kottalam's avatar
    Jey Kottalam committed
    
    
    Please refer to the build documentation at
    
    ["Specifying the Hadoop Version"](http://spark.apache.org/docs/latest/building-spark.html#specifying-the-hadoop-version)
    
    for detailed guidance on building for a particular distribution of Hadoop, including
    
    building for particular Hive and Hive Thriftserver distributions.
    
    Please refer to the [Configuration Guide](http://spark.apache.org/docs/latest/configuration.html)
    
    Matei Zaharia's avatar
    Matei Zaharia committed
    in the online documentation for an overview on how to configure Spark.
    
    Please review the [Contribution to Spark guide](http://spark.apache.org/contributing.html)
    for information on how to get started contributing to the project.