Skip to content
Snippets Groups Projects
Commit 60472dbf authored by hyukjinkwon's avatar hyukjinkwon Committed by Reynold Xin
Browse files

[SPARK-21485][SQL][DOCS] Spark SQL documentation generation for built-in functions

## What changes were proposed in this pull request?

This generates a documentation for Spark SQL built-in functions.

One drawback is, this requires a proper build to generate built-in function list.
Once it is built, it only takes few seconds by `sql/create-docs.sh`.

Please see https://spark-test.github.io/sparksqldoc/ that I hosted to show the output documentation.

There are few more works to be done in order to make the documentation pretty, for example, separating `Arguments:` and `Examples:` but I guess this should be done within `ExpressionDescription` and `ExpressionInfo` rather than manually parsing it. I will fix these in a follow up.

This requires `pip install mkdocs` to generate HTMLs from markdown files.

## How was this patch tested?

Manually tested:

```
cd docs
jekyll build
```
,

```
cd docs
jekyll serve
```

and

```
cd sql
create-docs.sh
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #18702 from HyukjinKwon/SPARK-21485.
parent cf29828d
No related branches found
No related tags found
No related merge requests found
...@@ -47,6 +47,8 @@ dev/pr-deps/ ...@@ -47,6 +47,8 @@ dev/pr-deps/
dist/ dist/
docs/_site docs/_site
docs/api docs/api
sql/docs
sql/site
lib_managed/ lib_managed/
lint-r-report.log lint-r-report.log
log/ log/
......
...@@ -68,6 +68,6 @@ jekyll plugin to run `build/sbt unidoc` before building the site so if you haven ...@@ -68,6 +68,6 @@ jekyll plugin to run `build/sbt unidoc` before building the site so if you haven
may take some time as it generates all of the scaladoc. The jekyll plugin also generates the may take some time as it generates all of the scaladoc. The jekyll plugin also generates the
PySpark docs using [Sphinx](http://sphinx-doc.org/). PySpark docs using [Sphinx](http://sphinx-doc.org/).
NOTE: To skip the step of building and copying over the Scala, Python, R API docs, run `SKIP_API=1 NOTE: To skip the step of building and copying over the Scala, Python, R and SQL API docs, run `SKIP_API=1
jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, and `SKIP_RDOC=1` can be used to skip a single jekyll`. In addition, `SKIP_SCALADOC=1`, `SKIP_PYTHONDOC=1`, `SKIP_RDOC=1` and `SKIP_SQLDOC=1` can be used
step of the corresponding language. to skip a single step of the corresponding language.
...@@ -86,6 +86,7 @@ ...@@ -86,6 +86,7 @@
<li><a href="api/java/index.html">Java</a></li> <li><a href="api/java/index.html">Java</a></li>
<li><a href="api/python/index.html">Python</a></li> <li><a href="api/python/index.html">Python</a></li>
<li><a href="api/R/index.html">R</a></li> <li><a href="api/R/index.html">R</a></li>
<li><a href="api/sql/index.html">SQL, Built-in Functions</a></li>
</ul> </ul>
</li> </li>
......
...@@ -150,4 +150,31 @@ if not (ENV['SKIP_API'] == '1') ...@@ -150,4 +150,31 @@ if not (ENV['SKIP_API'] == '1')
cp("../R/pkg/DESCRIPTION", "api") cp("../R/pkg/DESCRIPTION", "api")
end end
if not (ENV['SKIP_SQLDOC'] == '1')
# Build SQL API docs
puts "Moving to project root and building API docs."
curr_dir = pwd
cd("..")
puts "Running 'build/sbt clean package' from " + pwd + "; this may take a few minutes..."
system("build/sbt clean package") || raise("SQL doc generation failed")
puts "Moving back into docs dir."
cd("docs")
puts "Moving to SQL directory and building docs."
cd("../sql")
system("./create-docs.sh") || raise("SQL doc generation failed")
puts "Moving back into docs dir."
cd("../docs")
puts "Making directory api/sql"
mkdir_p "api/sql"
puts "cp -r ../sql/site/. api/sql"
cp_r("../sql/site/.", "api/sql")
end
end end
...@@ -9,3 +9,4 @@ Here you can read API docs for Spark and its submodules. ...@@ -9,3 +9,4 @@ Here you can read API docs for Spark and its submodules.
- [Spark Java API (Javadoc)](api/java/index.html) - [Spark Java API (Javadoc)](api/java/index.html)
- [Spark Python API (Sphinx)](api/python/index.html) - [Spark Python API (Sphinx)](api/python/index.html)
- [Spark R API (Roxygen2)](api/R/index.html) - [Spark R API (Roxygen2)](api/R/index.html)
- [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)
...@@ -100,6 +100,7 @@ options for deployment: ...@@ -100,6 +100,7 @@ options for deployment:
* [Spark Java API (Javadoc)](api/java/index.html) * [Spark Java API (Javadoc)](api/java/index.html)
* [Spark Python API (Sphinx)](api/python/index.html) * [Spark Python API (Sphinx)](api/python/index.html)
* [Spark R API (Roxygen2)](api/R/index.html) * [Spark R API (Roxygen2)](api/R/index.html)
* [Spark SQL, Built-in Functions (MkDocs)](api/sql/index.html)
**Deployment Guides:** **Deployment Guides:**
......
...@@ -8,3 +8,5 @@ Spark SQL is broken up into four subprojects: ...@@ -8,3 +8,5 @@ Spark SQL is broken up into four subprojects:
- Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files. - Execution (sql/core) - A query planner / execution engine for translating Catalyst's logical query plans into Spark RDDs. This component also includes a new public interface, SQLContext, that allows users to execute SQL or LINQ statements against existing RDDs and Parquet files.
- Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs. - Hive Support (sql/hive) - Includes an extension of SQLContext called HiveContext that allows users to write queries using a subset of HiveQL and access data from a Hive Metastore using Hive SerDes. There are also wrappers that allows users to run queries that include Hive UDFs, UDAFs, and UDTFs.
- HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server. - HiveServer and CLI support (sql/hive-thriftserver) - Includes support for the SQL CLI (bin/spark-sql) and a HiveServer2 (for JDBC/ODBC) compatible server.
Running `sql/create-docs.sh` generates SQL documentation for built-in functions under `sql/site`.
...@@ -17,9 +17,16 @@ ...@@ -17,9 +17,16 @@
package org.apache.spark.sql.api.python package org.apache.spark.sql.api.python
import org.apache.spark.sql.catalyst.analysis.FunctionRegistry
import org.apache.spark.sql.catalyst.expressions.ExpressionInfo
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.apache.spark.sql.types.DataType import org.apache.spark.sql.types.DataType
private[sql] object PythonSQLUtils { private[sql] object PythonSQLUtils {
def parseDataType(typeText: String): DataType = CatalystSqlParser.parseDataType(typeText) def parseDataType(typeText: String): DataType = CatalystSqlParser.parseDataType(typeText)
// This is needed when generating SQL documentation for built-in functions.
def listBuiltinFunctionInfos(): Array[ExpressionInfo] = {
FunctionRegistry.functionSet.flatMap(f => FunctionRegistry.builtin.lookupFunction(f)).toArray
}
} }
#!/bin/bash
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
# Script to create SQL API docs. This requires `mkdocs` and to build
# Spark first. After running this script the html docs can be found in
# $SPARK_HOME/sql/site
set -o pipefail
set -e
FWDIR="$(cd "`dirname "${BASH_SOURCE[0]}"`"; pwd)"
SPARK_HOME="$(cd "`dirname "${BASH_SOURCE[0]}"`"/..; pwd)"
if ! hash python 2>/dev/null; then
echo "Missing python in your path, skipping SQL documentation generation."
exit 0
fi
if ! hash mkdocs 2>/dev/null; then
echo "Missing mkdocs in your path, skipping SQL documentation generation."
exit 0
fi
# Now create the markdown file
rm -fr docs
mkdir docs
echo "Generating markdown files for SQL documentation."
"$SPARK_HOME/bin/spark-submit" gen-sql-markdown.py
# Now create the HTML files
echo "Generating HTML files for SQL documentation."
mkdocs build --clean
rm -fr docs
#
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
#
import sys
import os
from collections import namedtuple
ExpressionInfo = namedtuple("ExpressionInfo", "className usage name extended")
def _list_function_infos(jvm):
"""
Returns a list of function information via JVM. Sorts wrapped expression infos by name
and returns them.
"""
jinfos = jvm.org.apache.spark.sql.api.python.PythonSQLUtils.listBuiltinFunctionInfos()
infos = []
for jinfo in jinfos:
name = jinfo.getName()
usage = jinfo.getUsage()
usage = usage.replace("_FUNC_", name) if usage is not None else usage
extended = jinfo.getExtended()
extended = extended.replace("_FUNC_", name) if extended is not None else extended
infos.append(ExpressionInfo(
className=jinfo.getClassName(),
usage=usage,
name=name,
extended=extended))
return sorted(infos, key=lambda i: i.name)
def _make_pretty_usage(usage):
"""
Makes the usage description pretty and returns a formatted string.
Otherwise, returns None.
"""
if usage is not None and usage.strip() != "":
usage = "\n".join(map(lambda u: u.strip(), usage.split("\n")))
return "%s\n\n" % usage
def _make_pretty_extended(extended):
"""
Makes the extended description pretty and returns a formatted string.
Otherwise, returns None.
"""
if extended is not None and extended.strip() != "":
extended = "\n".join(map(lambda u: u.strip(), extended.split("\n")))
return "```%s```\n\n" % extended
def generate_sql_markdown(jvm, path):
"""
Generates a markdown file after listing the function information. The output file
is created in `path`.
"""
with open(path, 'w') as mdfile:
for info in _list_function_infos(jvm):
mdfile.write("### %s\n\n" % info.name)
usage = _make_pretty_usage(info.usage)
extended = _make_pretty_extended(info.extended)
if usage is not None:
mdfile.write(usage)
if extended is not None:
mdfile.write(extended)
if __name__ == "__main__":
from pyspark.java_gateway import launch_gateway
jvm = launch_gateway().jvm
markdown_file_path = "%s/docs/index.md" % os.path.dirname(sys.argv[0])
generate_sql_markdown(jvm, markdown_file_path)
# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements. See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
site_name: Spark SQL, Built-in Functions
theme: readthedocs
pages:
- 'Functions': 'index.md'
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment