Skip to content
Snippets Groups Projects
Commit 84073eb1 authored by Michael Armbrust's avatar Michael Armbrust
Browse files

[SQL][DOCS] Improve section on thrift-server

Taken from liancheng's updates. Merged conflicts with #2316.

Author: Michael Armbrust <michael@databricks.com>

Closes #2384 from marmbrus/sqlDocUpdate and squashes the following commits:

2db6319 [Michael Armbrust] @liancheng's updates
parent df90e81f
No related branches found
No related tags found
No related merge requests found
...@@ -128,7 +128,7 @@ feature parity with a HiveContext. ...@@ -128,7 +128,7 @@ feature parity with a HiveContext.
</div> </div>
The specific variant of SQL that is used to parse queries can also be selected using the The specific variant of SQL that is used to parse queries can also be selected using the
`spark.sql.dialect` option. This parameter can be changed using either the `setConf` method on `spark.sql.dialect` option. This parameter can be changed using either the `setConf` method on
a SQLContext or by using a `SET key=value` command in SQL. For a SQLContext, the only dialect a SQLContext or by using a `SET key=value` command in SQL. For a SQLContext, the only dialect
available is "sql" which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the available is "sql" which uses a simple SQL parser provided by Spark SQL. In a HiveContext, the
...@@ -139,7 +139,7 @@ default is "hiveql", though "sql" is also available. Since the HiveQL parser is ...@@ -139,7 +139,7 @@ default is "hiveql", though "sql" is also available. Since the HiveQL parser is
Spark SQL supports operating on a variety of data sources through the `SchemaRDD` interface. Spark SQL supports operating on a variety of data sources through the `SchemaRDD` interface.
A SchemaRDD can be operated on as normal RDDs and can also be registered as a temporary table. A SchemaRDD can be operated on as normal RDDs and can also be registered as a temporary table.
Registering a SchemaRDD as a table allows you to run SQL queries over its data. This section Registering a SchemaRDD as a table allows you to run SQL queries over its data. This section
describes the various methods for loading data into a SchemaRDD. describes the various methods for loading data into a SchemaRDD.
## RDDs ## RDDs
...@@ -152,7 +152,7 @@ while writing your Spark application. ...@@ -152,7 +152,7 @@ while writing your Spark application.
The second method for creating SchemaRDDs is through a programmatic interface that allows you to The second method for creating SchemaRDDs is through a programmatic interface that allows you to
construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows construct a schema and then apply it to an existing RDD. While this method is more verbose, it allows
you to construct SchemaRDDs when the columns and their types are not known until runtime. you to construct SchemaRDDs when the columns and their types are not known until runtime.
### Inferring the Schema Using Reflection ### Inferring the Schema Using Reflection
<div class="codetabs"> <div class="codetabs">
...@@ -193,7 +193,7 @@ teenagers.map(t => "Name: " + t(0)).collect().foreach(println) ...@@ -193,7 +193,7 @@ teenagers.map(t => "Name: " + t(0)).collect().foreach(println)
<div data-lang="java" markdown="1"> <div data-lang="java" markdown="1">
Spark SQL supports automatically converting an RDD of [JavaBeans](http://stackoverflow.com/questions/3295496/what-is-a-javabean-exactly) Spark SQL supports automatically converting an RDD of [JavaBeans](http://stackoverflow.com/questions/3295496/what-is-a-javabean-exactly)
into a Schema RDD. The BeanInfo, obtained using reflection, defines the schema of the table. into a Schema RDD. The BeanInfo, obtained using reflection, defines the schema of the table.
Currently, Spark SQL does not support JavaBeans that contain Currently, Spark SQL does not support JavaBeans that contain
nested or contain complex types such as Lists or Arrays. You can create a JavaBean by creating a nested or contain complex types such as Lists or Arrays. You can create a JavaBean by creating a
class that implements Serializable and has getters and setters for all of its fields. class that implements Serializable and has getters and setters for all of its fields.
...@@ -480,7 +480,7 @@ for name in names.collect(): ...@@ -480,7 +480,7 @@ for name in names.collect():
[Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems. [Parquet](http://parquet.io) is a columnar format that is supported by many other data processing systems.
Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema
of the original data. of the original data.
### Loading Data Programmatically ### Loading Data Programmatically
...@@ -562,7 +562,7 @@ for teenName in teenNames.collect(): ...@@ -562,7 +562,7 @@ for teenName in teenNames.collect():
</div> </div>
</div> </div>
### Configuration ### Configuration
...@@ -808,7 +808,7 @@ memory usage and GC pressure. You can call `uncacheTable("tableName")` to remove ...@@ -808,7 +808,7 @@ memory usage and GC pressure. You can call `uncacheTable("tableName")` to remove
Note that if you call `cache` rather than `cacheTable`, tables will _not_ be cached using Note that if you call `cache` rather than `cacheTable`, tables will _not_ be cached using
the in-memory columnar format, and therefore `cacheTable` is strongly recommended for this use case. the in-memory columnar format, and therefore `cacheTable` is strongly recommended for this use case.
Configuration of in-memory caching can be done using the `setConf` method on SQLContext or by running Configuration of in-memory caching can be done using the `setConf` method on SQLContext or by running
`SET key=value` commands using SQL. `SET key=value` commands using SQL.
<table class="table"> <table class="table">
...@@ -881,10 +881,32 @@ To start the JDBC server, run the following in the Spark directory: ...@@ -881,10 +881,32 @@ To start the JDBC server, run the following in the Spark directory:
./sbin/start-thriftserver.sh ./sbin/start-thriftserver.sh
The default port the server listens on is 10000. To listen on customized host and port, please set This script accepts all `bin/spark-submit` command line options, plus a `--hiveconf` option to
the `HIVE_SERVER2_THRIFT_PORT` and `HIVE_SERVER2_THRIFT_BIND_HOST` environment variables. You may specify Hive properties. You may run `./sbin/start-thriftserver.sh --help` for a complete list of
run `./sbin/start-thriftserver.sh --help` for a complete list of all available options. Now you can all available options. By default, the server listens on localhost:10000. You may override this
use beeline to test the Thrift JDBC server: bahaviour via either environment variables, i.e.:
{% highlight bash %}
export HIVE_SERVER2_THRIFT_PORT=<listening-port>
export HIVE_SERVER2_THRIFT_BIND_HOST=<listening-host>
./sbin/start-thriftserver.sh \
--master <master-uri> \
...
```
{% endhighlight %}
or system properties:
{% highlight bash %}
./sbin/start-thriftserver.sh \
--hiveconf hive.server2.thrift.port=<listening-port> \
--hiveconf hive.server2.thrift.bind.host=<listening-host> \
--master <master-uri>
...
```
{% endhighlight %}
Now you can use beeline to test the Thrift JDBC server:
./bin/beeline ./bin/beeline
...@@ -930,7 +952,7 @@ SQL deprecates this property in favor of `spark.sql.shuffle.partitions`, whose d ...@@ -930,7 +952,7 @@ SQL deprecates this property in favor of `spark.sql.shuffle.partitions`, whose d
is 200. Users may customize this property via `SET`: is 200. Users may customize this property via `SET`:
SET spark.sql.shuffle.partitions=10; SET spark.sql.shuffle.partitions=10;
SELECT page, count(*) c SELECT page, count(*) c
FROM logs_last_month_cached FROM logs_last_month_cached
GROUP BY page ORDER BY c DESC LIMIT 10; GROUP BY page ORDER BY c DESC LIMIT 10;
...@@ -1139,7 +1161,7 @@ evaluated by the SQL execution engine. A full list of the functions supported c ...@@ -1139,7 +1161,7 @@ evaluated by the SQL execution engine. A full list of the functions supported c
<div data-lang="scala" markdown="1"> <div data-lang="scala" markdown="1">
All data types of Spark SQL are located in the package `org.apache.spark.sql`. All data types of Spark SQL are located in the package `org.apache.spark.sql`.
You can access them by doing You can access them by doing
{% highlight scala %} {% highlight scala %}
import org.apache.spark.sql._ import org.apache.spark.sql._
{% endhighlight %} {% endhighlight %}
...@@ -1245,7 +1267,7 @@ import org.apache.spark.sql._ ...@@ -1245,7 +1267,7 @@ import org.apache.spark.sql._
<tr> <tr>
<td> <b>StructType</b> </td> <td> <b>StructType</b> </td>
<td> org.apache.spark.sql.Row </td> <td> org.apache.spark.sql.Row </td>
<td> <td>
StructType(<i>fields</i>)<br /> StructType(<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
name are not allowed. name are not allowed.
...@@ -1267,7 +1289,7 @@ import org.apache.spark.sql._ ...@@ -1267,7 +1289,7 @@ import org.apache.spark.sql._
All data types of Spark SQL are located in the package of All data types of Spark SQL are located in the package of
`org.apache.spark.sql.api.java`. To access or create a data type, `org.apache.spark.sql.api.java`. To access or create a data type,
please use factory methods provided in please use factory methods provided in
`org.apache.spark.sql.api.java.DataType`. `org.apache.spark.sql.api.java.DataType`.
<table class="table"> <table class="table">
...@@ -1373,7 +1395,7 @@ please use factory methods provided in ...@@ -1373,7 +1395,7 @@ please use factory methods provided in
<tr> <tr>
<td> <b>StructType</b> </td> <td> <b>StructType</b> </td>
<td> org.apache.spark.sql.api.java </td> <td> org.apache.spark.sql.api.java </td>
<td> <td>
DataType.createStructType(<i>fields</i>)<br /> DataType.createStructType(<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a List or an array of StructFields. <b>Note:</b> <i>fields</i> is a List or an array of StructFields.
Also, two fields with the same name are not allowed. Also, two fields with the same name are not allowed.
...@@ -1394,7 +1416,7 @@ please use factory methods provided in ...@@ -1394,7 +1416,7 @@ please use factory methods provided in
<div data-lang="python" markdown="1"> <div data-lang="python" markdown="1">
All data types of Spark SQL are located in the package of `pyspark.sql`. All data types of Spark SQL are located in the package of `pyspark.sql`.
You can access them by doing You can access them by doing
{% highlight python %} {% highlight python %}
from pyspark.sql import * from pyspark.sql import *
{% endhighlight %} {% endhighlight %}
...@@ -1518,7 +1540,7 @@ from pyspark.sql import * ...@@ -1518,7 +1540,7 @@ from pyspark.sql import *
<tr> <tr>
<td> <b>StructType</b> </td> <td> <b>StructType</b> </td>
<td> list or tuple </td> <td> list or tuple </td>
<td> <td>
StructType(<i>fields</i>)<br /> StructType(<i>fields</i>)<br />
<b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same <b>Note:</b> <i>fields</i> is a Seq of StructFields. Also, two fields with the same
name are not allowed. name are not allowed.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment