Skip to content
Snippets Groups Projects
Commit 53aa8316 authored by Nicholas Chammas's avatar Nicholas Chammas Committed by Michael Armbrust
Browse files

[Docs] SQL doc formatting and typo fixes

As [reported on the dev list](http://apache-spark-developers-list.1001551.n3.nabble.com/VOTE-Release-Apache-Spark-1-1-0-RC2-tp8107p8131.html):
* Code fencing with triple-backticks doesn’t seem to work like it does on GitHub. Newlines are lost. Instead, use 4-space indent to format small code blocks.
* Nested bullets need 2 leading spaces, not 1.
* Spellcheck!

Author: Nicholas Chammas <nicholas.chammas@gmail.com>
Author: nchammas <nicholas.chammas@gmail.com>

Closes #2201 from nchammas/sql-doc-fixes and squashes the following commits:

873f889 [Nicholas Chammas] [Docs] fix skip-api flag
5195e0c [Nicholas Chammas] [Docs] SQL doc formatting and typo fixes
3b26c8d [nchammas] [Spark QA] Link to console output on test time out
parent e248328b
No related branches found
No related tags found
No related merge requests found
...@@ -30,7 +30,7 @@ called `_site` containing index.html as well as the rest of the compiled files. ...@@ -30,7 +30,7 @@ called `_site` containing index.html as well as the rest of the compiled files.
You can modify the default Jekyll build as follows: You can modify the default Jekyll build as follows:
# Skip generating API docs (which takes a while) # Skip generating API docs (which takes a while)
$ SKIP_SCALADOC=1 jekyll build $ SKIP_API=1 jekyll build
# Serve content locally on port 4000 # Serve content locally on port 4000
$ jekyll serve --watch $ jekyll serve --watch
# Build the site with extra features used on the live page # Build the site with extra features used on the live page
......
...@@ -474,10 +474,10 @@ anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD) ...@@ -474,10 +474,10 @@ anotherPeople = sqlContext.jsonRDD(anotherPeopleRDD)
Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/). Spark SQL also supports reading and writing data stored in [Apache Hive](http://hive.apache.org/).
However, since Hive has a large number of dependencies, it is not included in the default Spark assembly. However, since Hive has a large number of dependencies, it is not included in the default Spark assembly.
In order to use Hive you must first run '`sbt/sbt -Phive assembly/assembly`' (or use `-Phive` for maven). In order to use Hive you must first run "`sbt/sbt -Phive assembly/assembly`" (or use `-Phive` for maven).
This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present This command builds a new assembly jar that includes Hive. Note that this Hive assembly jar must also be present
on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries on all of the worker nodes, as they will need access to the Hive serialization and deserialization libraries
(SerDes) in order to acccess data stored in Hive. (SerDes) in order to access data stored in Hive.
Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.
...@@ -576,9 +576,8 @@ evaluated by the SQL execution engine. A full list of the functions supported c ...@@ -576,9 +576,8 @@ evaluated by the SQL execution engine. A full list of the functions supported c
## Running the Thrift JDBC server ## Running the Thrift JDBC server
The Thrift JDBC server implemented here corresponds to the [`HiveServer2`] The Thrift JDBC server implemented here corresponds to the [`HiveServer2`](https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2)
(https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) in Hive 0.12. You can test in Hive 0.12. You can test the JDBC server with the beeline script comes with either Spark or Hive 0.12.
the JDBC server with the beeline script comes with either Spark or Hive 0.12.
To start the JDBC server, run the following in the Spark directory: To start the JDBC server, run the following in the Spark directory:
...@@ -597,7 +596,7 @@ Connect to the JDBC server in beeline with: ...@@ -597,7 +596,7 @@ Connect to the JDBC server in beeline with:
Beeline will ask you for a username and password. In non-secure mode, simply enter the username on Beeline will ask you for a username and password. In non-secure mode, simply enter the username on
your machine and a blank password. For secure mode, please follow the instructions given in the your machine and a blank password. For secure mode, please follow the instructions given in the
[beeline documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients) [beeline documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients).
Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`. Configuration of Hive is done by placing your `hive-site.xml` file in `conf/`.
...@@ -616,11 +615,10 @@ In Shark, default reducer number is 1 and is controlled by the property `mapred. ...@@ -616,11 +615,10 @@ In Shark, default reducer number is 1 and is controlled by the property `mapred.
SQL deprecates this property by a new property `spark.sql.shuffle.partitions`, whose default value SQL deprecates this property by a new property `spark.sql.shuffle.partitions`, whose default value
is 200. Users may customize this property via `SET`: is 200. Users may customize this property via `SET`:
``` SET spark.sql.shuffle.partitions=10;
SET spark.sql.shuffle.partitions=10; SELECT page, count(*) c
SELECT page, count(*) c FROM logs_last_month_cached FROM logs_last_month_cached
GROUP BY page ORDER BY c DESC LIMIT 10; GROUP BY page ORDER BY c DESC LIMIT 10;
```
You may also put this property in `hive-site.xml` to override the default value. You may also put this property in `hive-site.xml` to override the default value.
...@@ -630,22 +628,18 @@ For now, the `mapred.reduce.tasks` property is still recognized, and is converte ...@@ -630,22 +628,18 @@ For now, the `mapred.reduce.tasks` property is still recognized, and is converte
#### Caching #### Caching
The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no The `shark.cache` table property no longer exists, and tables whose name end with `_cached` are no
longer automcatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to longer automatically cached. Instead, we provide `CACHE TABLE` and `UNCACHE TABLE` statements to
let user control table caching explicitly: let user control table caching explicitly:
``` CACHE TABLE logs_last_month;
CACHE TABLE logs_last_month; UNCACHE TABLE logs_last_month;
UNCACHE TABLE logs_last_month;
```
**NOTE** `CACHE TABLE tbl` is lazy, it only marks table `tbl` as "need to by cached if necessary", **NOTE:** `CACHE TABLE tbl` is lazy, it only marks table `tbl` as "need to by cached if necessary",
but doesn't actually cache it until a query that touches `tbl` is executed. To force the table to be but doesn't actually cache it until a query that touches `tbl` is executed. To force the table to be
cached, you may simply count the table immediately after executing `CACHE TABLE`: cached, you may simply count the table immediately after executing `CACHE TABLE`:
``` CACHE TABLE logs_last_month;
CACHE TABLE logs_last_month; SELECT COUNT(1) FROM logs_last_month;
SELECT COUNT(1) FROM logs_last_month;
```
Several caching related features are not supported yet: Several caching related features are not supported yet:
...@@ -655,7 +649,7 @@ Several caching related features are not supported yet: ...@@ -655,7 +649,7 @@ Several caching related features are not supported yet:
### Compatibility with Apache Hive ### Compatibility with Apache Hive
#### Deploying in Exising Hive Warehouses #### Deploying in Existing Hive Warehouses
Spark SQL Thrift JDBC server is designed to be "out of the box" compatible with existing Hive Spark SQL Thrift JDBC server is designed to be "out of the box" compatible with existing Hive
installations. You do not need to modify your existing Hive Metastore or change the data placement installations. You do not need to modify your existing Hive Metastore or change the data placement
...@@ -666,50 +660,50 @@ or partitioning of your tables. ...@@ -666,50 +660,50 @@ or partitioning of your tables.
Spark SQL supports the vast majority of Hive features, such as: Spark SQL supports the vast majority of Hive features, such as:
* Hive query statements, including: * Hive query statements, including:
* `SELECT` * `SELECT`
* `GROUP BY * `GROUP BY`
* `ORDER BY` * `ORDER BY`
* `CLUSTER BY` * `CLUSTER BY`
* `SORT BY` * `SORT BY`
* All Hive operators, including: * All Hive operators, including:
* Relational operators (`=`, ``, `==`, `<>`, `<`, `>`, `>=`, `<=`, etc) * Relational operators (`=`, `⇔`, `==`, `<>`, `<`, `>`, `>=`, `<=`, etc)
* Arthimatic operators (`+`, `-`, `*`, `/`, `%`, etc) * Arithmetic operators (`+`, `-`, `*`, `/`, `%`, etc)
* Logical operators (`AND`, `&&`, `OR`, `||`, etc) * Logical operators (`AND`, `&&`, `OR`, `||`, etc)
* Complex type constructors * Complex type constructors
* Mathemtatical functions (`sign`, `ln`, `cos`, etc) * Mathematical functions (`sign`, `ln`, `cos`, etc)
* String functions (`instr`, `length`, `printf`, etc) * String functions (`instr`, `length`, `printf`, etc)
* User defined functions (UDF) * User defined functions (UDF)
* User defined aggregation functions (UDAF) * User defined aggregation functions (UDAF)
* User defined serialization formats (SerDe's) * User defined serialization formats (SerDes)
* Joins * Joins
* `JOIN` * `JOIN`
* `{LEFT|RIGHT|FULL} OUTER JOIN` * `{LEFT|RIGHT|FULL} OUTER JOIN`
* `LEFT SEMI JOIN` * `LEFT SEMI JOIN`
* `CROSS JOIN` * `CROSS JOIN`
* Unions * Unions
* Sub queries * Sub-queries
* `SELECT col FROM ( SELECT a + b AS col from t1) t2` * `SELECT col FROM ( SELECT a + b AS col from t1) t2`
* Sampling * Sampling
* Explain * Explain
* Partitioned tables * Partitioned tables
* All Hive DDL Functions, including: * All Hive DDL Functions, including:
* `CREATE TABLE` * `CREATE TABLE`
* `CREATE TABLE AS SELECT` * `CREATE TABLE AS SELECT`
* `ALTER TABLE` * `ALTER TABLE`
* Most Hive Data types, including: * Most Hive Data types, including:
* `TINYINT` * `TINYINT`
* `SMALLINT` * `SMALLINT`
* `INT` * `INT`
* `BIGINT` * `BIGINT`
* `BOOLEAN` * `BOOLEAN`
* `FLOAT` * `FLOAT`
* `DOUBLE` * `DOUBLE`
* `STRING` * `STRING`
* `BINARY` * `BINARY`
* `TIMESTAMP` * `TIMESTAMP`
* `ARRAY<>` * `ARRAY<>`
* `MAP<>` * `MAP<>`
* `STRUCT<>` * `STRUCT<>`
#### Unsupported Hive Functionality #### Unsupported Hive Functionality
...@@ -749,8 +743,7 @@ releases of Spark SQL. ...@@ -749,8 +743,7 @@ releases of Spark SQL.
Hive automatically converts the join into a map join. We are adding this auto conversion in the Hive automatically converts the join into a map join. We are adding this auto conversion in the
next release. next release.
* Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you * Automatically determine the number of reducers for joins and groupbys: Currently in Spark SQL, you
need to control the degree of parallelism post-shuffle using "SET need to control the degree of parallelism post-shuffle using "`SET spark.sql.shuffle.partitions=[num_tasks];`". We are going to add auto-setting of parallelism in the
spark.sql.shuffle.partitions=[num_tasks];". We are going to add auto-setting of parallelism in the
next release. next release.
* Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still * Meta-data only query: For queries that can be answered by using only meta data, Spark SQL still
launches tasks to compute the result. launches tasks to compute the result.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment