-
gatorsmile authored
This fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons. Author: gatorsmile <gatorsmile@gmail.com> Closes #9314 from gatorsmile/lossNull.
gatorsmile authoredThis fix is to add one line to explain the current behavior of Spark SQL when writing Parquet files. All columns are forced to be nullable for compatibility reasons. Author: gatorsmile <gatorsmile@gmail.com> Closes #9314 from gatorsmile/lossNull.
layout: global
displayTitle: Spark SQL and DataFrame Guide
title: Spark SQL and DataFrames
- This will become a table of contents (this text will be scraped). {:toc}
Overview
Spark SQL is a Spark module for structured data processing. It provides a programming abstraction called DataFrames and can also act as distributed SQL query engine.
Spark SQL can also be used to read data from an existing Hive installation. For more on how to configure this feature, please refer to the Hive Tables section.
DataFrames
A DataFrame is a distributed collection of data organized into named columns. It is conceptually equivalent to a table in a relational database or a data frame in R/Python, but with richer optimizations under the hood. DataFrames can be constructed from a wide array of sources such as: structured data files, tables in Hive, external databases, or existing RDDs.
The DataFrame API is available in Scala, Java, Python, and R.
All of the examples on this page use sample data included in the Spark distribution and can be run in the spark-shell
, pyspark
shell, or sparkR
shell.
Starting Point: SQLContext
The entry point into all functionality in Spark SQL is the
SQLContext
class, or one of its
descendants. To create a basic SQLContext
, all you need is a SparkContext.
{% highlight scala %} val sc: SparkContext // An existing SparkContext. val sqlContext = new org.apache.spark.sql.SQLContext(sc)
// this is used to implicitly convert an RDD to a DataFrame. import sqlContext.implicits._ {% endhighlight %}
The entry point into all functionality in Spark SQL is the
SQLContext
class, or one of its
descendants. To create a basic SQLContext
, all you need is a SparkContext.
{% highlight java %} JavaSparkContext sc = ...; // An existing JavaSparkContext. SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc); {% endhighlight %}
The entry point into all relational functionality in Spark is the
SQLContext
class, or one
of its decedents. To create a basic SQLContext
, all you need is a SparkContext.
{% highlight python %} from pyspark.sql import SQLContext sqlContext = SQLContext(sc) {% endhighlight %}
The entry point into all relational functionality in Spark is the
SQLContext
class, or one of its decedents. To create a basic SQLContext
, all you need is a SparkContext.
{% highlight r %} sqlContext <- sparkRSQL.init(sc) {% endhighlight %}
In addition to the basic SQLContext
, you can also create a HiveContext
, which provides a
superset of the functionality provided by the basic SQLContext
. Additional features include
the ability to write queries using the more complete HiveQL parser, access to Hive UDFs, and the
ability to read data from Hive tables. To use a HiveContext
, you do not need to have an
existing Hive setup, and all of the data sources available to a SQLContext
are still available.
HiveContext
is only packaged separately to avoid including all of Hive's dependencies in the default
Spark build. If these dependencies are not a problem for your application then using HiveContext
is recommended for the 1.3 release of Spark. Future releases will focus on bringing SQLContext
up
to feature parity with a HiveContext
.
The specific variant of SQL that is used to parse queries can also be selected using the
spark.sql.dialect
option. This parameter can be changed using either the setConf
method on
a SQLContext
or by using a SET key=value
command in SQL. For a SQLContext
, the only dialect
available is "sql" which uses a simple SQL parser provided by Spark SQL. In a HiveContext
, the
default is "hiveql", though "sql" is also available. Since the HiveQL parser is much more complete,
this is recommended for most use cases.
Creating DataFrames
With a SQLContext
, applications can create DataFrame
s from an existing RDD
, from a Hive table, or from data sources.
As an example, the following creates a DataFrame
based on the content of a JSON file:
val df = sqlContext.read.json("examples/src/main/resources/people.json")
// Displays the content of the DataFrame to stdout df.show() {% endhighlight %}
DataFrame df = sqlContext.read().json("examples/src/main/resources/people.json");
// Displays the content of the DataFrame to stdout df.show(); {% endhighlight %}
df = sqlContext.read.json("examples/src/main/resources/people.json")
Displays the content of the DataFrame to stdout
df.show() {% endhighlight %}
df <- jsonFile(sqlContext, "examples/src/main/resources/people.json")
Displays the content of the DataFrame to stdout
showDF(df) {% endhighlight %}
DataFrame Operations
DataFrames provide a domain-specific language for structured data manipulation in Scala, Java, Python and R.
Here we include some basic examples of structured data processing using DataFrames: