Commits · d9ca9fd3e582f9d29f8887c095637c93a8b93651 · cs525-sp18-g07 / spark

May 10, 2016

[SPARK-14837][SQL][STREAMING] Added support in file stream source for reading... · d9ca9fd3

Tathagata Das authored 8 years ago

[SPARK-14837][SQL][STREAMING] Added support in file stream source for reading new files added to subdirs

## What changes were proposed in this pull request?
Currently, file stream source can only find new files if they appear in the directory given to the source, but not if they appear in subdirs. This PR add support for providing glob patterns when creating file stream source so that it can find new files in nested directories based on the glob pattern.

## How was this patch tested?

Unit test that tests when new files are discovered with globs and partitioned directories.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12616 from tdas/SPARK-14837.

d9ca9fd3

[SPARK-14936][BUILD][TESTS] FlumePollingStreamSuite is slow · 86475520

Xin Ren authored 8 years ago

https://issues.apache.org/jira/browse/SPARK-14936

## What changes were proposed in this pull request?

FlumePollingStreamSuite contains two tests which run for a minute each. This seems excessively slow and we should speed it up if possible.

In this PR, instead of creating `StreamingContext` directly from `conf`, here an underlying `SparkContext` is created before all and it is used to create  each`StreamingContext`.

Running time is reduced by avoiding multiple `SparkContext` creations and destroys.

## How was this patch tested?

Tested on my local machine running `testOnly *.FlumePollingStreamSuite`

Author: Xin Ren <iamshrek@126.com>

Closes #12845 from keypointt/SPARK-14936.

86475520

[SPARK-15249][SQL] Use FunctionResource instead of (String, String) in... · da02d006

Sandeep Singh authored 8 years ago

[SPARK-15249][SQL] Use FunctionResource instead of (String, String) in CreateFunction and CatalogFunction for resource

Use FunctionResource instead of (String, String) in CreateFunction and CatalogFunction for resource
see: TODO's here
https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/interface.scala#L36
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/command/functions.scala#L42

Existing tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13024 from techaddict/SPARK-15249.

da02d006

[SPARK-6005][TESTS] Fix flaky test: o.a.s.streaming.kafka.DirectKafkaStreamSuite.offset recovery · 9533f539

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

Because this test extracts data from `DStream.generatedRDDs` before stopping, it may get data before checkpointing. Then after recovering from the checkpoint, `recoveredOffsetRanges` may contain something not in `offsetRangesBeforeStop`, which will fail the test. Adding `Thread.sleep(1000)` before `ssc.stop()` will reproduce this failure.

This PR just moves the logic of `offsetRangesBeforeStop` (also renamed to `offsetRangesAfterStop`) after `ssc.stop()` to fix the flaky test.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #12903 from zsxwing/SPARK-6005.

9533f539

[SPARK-15207][BUILD] Use Travis CI for Java Linter and JDK7/8 compilation test · 603c4f8e

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Currently, Java Linter is disabled in Jenkins tests.

https://github.com/apache/spark/blob/master/dev/run-tests.py#L554

However, as of today, Spark has 721 java files with 97362 code (without blank/comments). It's about 1/3 of Scala.
```
--------------------------------------------------------------------------------
Language files blank comment code
--------------------------------------------------------------------------------
Scala 2353 62819 124060 318747
Java 721 18617 23314 97362
```

This PR aims to take advantage of Travis CI to handle the following static analysis by adding a single file, `.travis.yml` without any additional burden on the existing servers.

- Java Linter
- JDK7/JDK8 maven compile

Note that this PR does not propose to remove some of the above work items from the Jenkins. It's possible, but we need to observe the Travis CI stability for a while. The main goal of this issue is to remove committer's overhead on linter-related PRs (the original PR and the fixation PR).

## How was this patch tested?

Pass the Travis CI tests. Please see the following link.

https://travis-ci.org/dongjoon-hyun/spark/builds/128595350
https://travis-ci.org/dongjoon-hyun/spark/builds/128708372

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12980 from dongjoon-hyun/SPARK-15207.

603c4f8e

[SPARK-14986][SQL] Return correct result for empty LATERAL VIEW OUTER · d28c6754

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
A Generate with the `outer` flag enabled should always return one or more rows for every input row. The optimizer currently violates this by rewriting `outer` Generates that do not contain columns of the child plan into an unjoined generate, for example:
```sql
select e from a lateral view outer explode(a.b) as e
```
The result of this is that `outer` Generate does not produce output at all when the Generators' input expression is empty. This PR fixes this.

## How was this patch tested?
Added test case to `SQLQuerySuite`.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12906 from hvanhovell/SPARK-14986.

d28c6754

[SPARK-14642][SQL] import org.apache.spark.sql.expressions._ breaks udf under functions · 89f73f67

Subhobrata Dey authored 8 years ago

## What changes were proposed in this pull request?

PR fixes the import issue which breaks udf functions.

The following code snippet throws an error

```
scala> import org.apache.spark.sql.functions._
import org.apache.spark.sql.functions._

scala> import org.apache.spark.sql.expressions._
import org.apache.spark.sql.expressions._

scala> udf((v: String) => v.stripSuffix("-abc"))
<console>:30: error: No TypeTag available for String
       udf((v: String) => v.stripSuffix("-abc"))
```

This PR resolves the issue.

## How was this patch tested?

patch tested with unit tests.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: Subhobrata Dey <sbcd90@gmail.com>

Closes #12458 from sbcd90/udfFuncBreak.

89f73f67

[SPARK-15195][PYSPARK][DOCS] Update ml.tuning PyDocs · 93353b01

Holden Karau authored 8 years ago

## What changes were proposed in this pull request?

Tag classes in ml.tuning as experimental, add docs for kfolds avg metric, and copy TrainValidationSplit scaladoc for more detailed explanation.

## How was this patch tested?

built docs locally

Author: Holden Karau <holden@us.ibm.com>

Closes #12967 from holdenk/SPARK-15195-pydoc-ml-tuning.

93353b01

[SPARK-15037][HOTFIX] Don't create 2 SparkSessions in constructor · 69641066

Andrew Or authored 8 years ago

## What changes were proposed in this pull request?

After #12907 `TestSparkSession` creates a spark session in one of the constructors just to get the `SparkContext` from it. This ends up creating 2 `SparkSession`s from one call, which is definitely not what we want.

## How was this patch tested?

Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #13031 from andrewor14/sql-test.

69641066

[SPARK-15037][HOTFIX] Replace `sqlContext` and `sparkSession` with `spark`. · db3b4a20

Dongjoon Hyun authored 8 years ago

This replaces `sparkSession` with `spark` in CatalogSuite.scala.

Pass the Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13030 from dongjoon-hyun/hotfix_sparkSession.

db3b4a20

[HOTFIX] SQL test compilation error from merge conflict · cddb9da0
Andrew Or authored 8 years ago

cddb9da0

[SPARK-14603][SQL] Verification of Metadata Operations by Session Catalog · 5c6b0855

gatorsmile authored 8 years ago

Since we cannot really trust if the underlying external catalog can throw exceptions when there is an invalid metadata operation, let's do it in SessionCatalog.

- [X] The first step is to unify the error messages issued in Hive-specific Session Catalog and general Session Catalog.
- [X] The second step is to verify the inputs of metadata operations for partitioning-related operations. This is moved to a separate PR: https://github.com/apache/spark/pull/12801
- [X] The third step is to add database existence verification in `SessionCatalog`
- [X] The fourth step is to add table existence verification in `SessionCatalog`
- [X] The fifth step is to add function existence verification in `SessionCatalog`

Add test cases and verify the error messages we issued

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12385 from gatorsmile/verifySessionAPIs.

5c6b0855

[SPARK-15037][SQL][MLLIB] Use SparkSession instead of SQLContext in Scala/Java TestSuites · ed0b4070

Sandeep Singh authored 8 years ago

## What changes were proposed in this pull request?
Use SparkSession instead of SQLContext in Scala/Java TestSuites
as this PR already very big working Python TestSuites in a diff PR.

## How was this patch tested?
Existing tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12907 from techaddict/SPARK-15037.

ed0b4070

[SPARK-12837][CORE] reduce network IO for accumulators · bcfee153

Wenchen Fan authored 8 years ago

Sending un-updated accumulators back to driver makes no sense, as merging a zero value accumulator is a no-op. We should only send back updated accumulators, to save network IO.

new test in `TaskContextSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12899 from cloud-fan/acc.

bcfee153

[SPARK-11249][LAUNCHER] Throw error if app resource is not provided. · 0b9cae42

Marcelo Vanzin authored 8 years ago

Without this, the code would build an invalid spark-submit command line,
and a more cryptic error would be presented to the user. Also, expose
a constant that allows users to set a dummy resource in cases where
they don't need an actual resource file; for backwards compatibility,
that uses the same "spark-internal" resource that Spark itself uses.

Tested via unit tests, run-example, spark-shell, and running the
thrift server with mixed spark and hive command line arguments.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12909 from vanzin/SPARK-11249.

0b9cae42

[SPARK-13670][LAUNCHER] Propagate error from launcher to shell. · 36c5892b

Marcelo Vanzin authored 8 years ago

bash doesn't really propagate errors from subshells when using redirection
the way spark-class does; so, instead, this change captures the exit code
of the launcher process in the command array, and checks it before executing
the actual command.

Tested by injecting an error in Main.java (the launcher entry point) and
verifying the shell gets the right exit code from spark-class.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #12910 from vanzin/SPARK-13670.

36c5892b

[SPARK-13382][DOCS][PYSPARK] Update pyspark testing notes in build docs · 488863d8

Holden Karau authored 8 years ago

## What changes were proposed in this pull request?

The current build documents don't specify that for PySpark tests we need to include Hive in the assembly otherwise the ORC tests fail.

## How was the this patch tested?

Manually built the docs locally. Ran the provided build command follow by the PySpark SQL tests.

![pyspark2](https://cloud.githubusercontent.com/assets/59893/13190008/8829cde4-d70f-11e5-8ff5-a88b7894d2ad.png)

Author: Holden Karau <holden@us.ibm.com>

Closes #11278 from holdenk/SPARK-13382-update-pyspark-testing-notes-r2.

488863d8

[SPARK-14773] [SPARK-15179] [SQL] Fix SQL building and enable Hive tests · 26462653

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
This PR fixes SQL building for predicate subqueries and correlated scalar subqueries. It also enables most Hive subquery tests.

## How was this patch tested?
Enabled new tests in HiveComparisionSuite.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12988 from hvanhovell/SPARK-14773.

26462653

[SPARK-15154] [SQL] Change key types to Long in tests · 2dfb9cd1

Pete Robbins authored 8 years ago

## What changes were proposed in this pull request?

As reported in the Jira the 2 tests changed here are using a key of type Integer where the Spark sql code assumes the type is Long. This PR changes the tests to use the correct key types.

## How was this patch tested?

Test builds run on both Big Endian and Little Endian platforms

Author: Pete Robbins <robbinspg@gmail.com>

Closes #13009 from robbinspg/HashedRelationSuiteFix.

2dfb9cd1

[SPARK-14127][SQL] "DESC <table>": Extracts schema information from table... · 8a12580d

Cheng Lian authored 8 years ago

[SPARK-14127][SQL] "DESC <table>": Extracts schema information from table properties for data source tables

## What changes were proposed in this pull request?

This is a follow-up of #12934 and #12844. This PR adds a set of utility methods in `DDLUtils` to help extract schema information (user-defined schema, partition columns, and bucketing information) from data source table properties. These utility methods are then used in `DescribeTableCommand` to refine output for data source tables. Before this PR, the aforementioned schema information are only shown as table properties, which are hard to read.

Sample output:

```
+----------------------------+---------------------------------------------------------+-------+
|col_name                    |data_type                                                |comment|
+----------------------------+---------------------------------------------------------+-------+
|a                           |bigint                                                   |       |
|b                           |bigint                                                   |       |
|c                           |bigint                                                   |       |
|d                           |bigint                                                   |       |
|# Partition Information     |                                                         |       |
|# col_name                  |                                                         |       |
|d                           |                                                         |       |
|                            |                                                         |       |
|# Detailed Table Information|                                                         |       |
|Database:                   |default                                                  |       |
|Owner:                      |lian                                                     |       |
|Create Time:                |Tue May 10 03:20:34 PDT 2016                             |       |
|Last Access Time:           |Wed Dec 31 16:00:00 PST 1969                             |       |
|Location:                   |file:/Users/lian/local/src/spark/workspace-a/target/...  |       |
|Table Type:                 |MANAGED                                                  |       |
|Table Parameters:           |                                                         |       |
|  rawDataSize               |-1                                                       |       |
|  numFiles                  |1                                                        |       |
|  transient_lastDdlTime     |1462875634                                               |       |
|  totalSize                 |684                                                      |       |
|  spark.sql.sources.provider|parquet                                                  |       |
|  EXTERNAL                  |FALSE                                                    |       |
|  COLUMN_STATS_ACCURATE     |false                                                    |       |
|  numRows                   |-1                                                       |       |
|                            |                                                         |       |
|# Storage Information       |                                                         |       |
|SerDe Library:              |org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe       |       |
|InputFormat:                |org.apache.hadoop.mapred.SequenceFileInputFormat         |       |
|OutputFormat:               |org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat|       |
|Compressed:                 |No                                                       |       |
|Num Buckets:                |2                                                        |       |
|Bucket Columns:             |[b]                                                      |       |
|Sort Columns:               |[c]                                                      |       |
|Storage Desc Parameters:    |                                                         |       |
|  path                      |file:/Users/lian/local/src/spark/workspace-a/target/...  |       |
|  serialization.format      |1                                                        |       |
+----------------------------+---------------------------------------------------------+-------+
```

## How was this patch tested?

Test cases are added in `HiveDDLSuite` to check command output.

Author: Cheng Lian <lian@databricks.com>

Closes #13025 from liancheng/spark-14127-extract-schema-info.

8a12580d

[SPARK-14963][YARN] Using recoveryPath if NM recovery is enabled · aab99d31

jerryshao authored 8 years ago

## What changes were proposed in this pull request?

From Hadoop 2.5+, Yarn NM supports NM recovery which using recovery path for auxiliary services such as spark_shuffle, mapreduce_shuffle. So here change to use this path install of NM local dir if NM recovery is enabled.

## How was this patch tested?

Unit test + local test.

Author: jerryshao <sshao@hortonworks.com>

Closes #12994 from jerryshao/SPARK-14963.

aab99d31

[SPARK-14542][CORE] PipeRDD should allow configurable buffer size for… · a019e6ef

Sital Kedia authored 8 years ago

## What changes were proposed in this pull request?

Currently PipedRDD internally uses PrintWriter to write data to the stdin of the piped process, which by default uses a BufferedWriter of buffer size 8k. In our experiment, we have seen that 8k buffer size is too small and the job spends significant amount of CPU time in system calls to copy the data. We should have a way to configure the buffer size for the writer.

## How was this patch tested?
Ran PipedRDDSuite tests.

Author: Sital Kedia <skedia@fb.com>

Closes #12309 from sitalkedia/bufferedPipedRDD.

a019e6ef

[SPARK-15215][SQL] Fix Explain Parsing and Output · 57064726

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
This PR is to address a few existing issues in `EXPLAIN`:
- The `EXPLAIN` options `LOGICAL | FORMATTED | EXTENDED | CODEGEN` should not be 0 or more match. It should 0 or one match. Parser does not allow users to use more than one option in a single command.
- The option `LOGICAL` is not supported. Issue an exception when users specify this option in the command.
- The output of `EXPLAIN ` contains a weird empty line when the output of analyzed plan is empty. We should remove it. For example:
  ```
  == Parsed Logical Plan ==
  CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false

  == Analyzed Logical Plan ==

  CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false

  == Optimized Logical Plan ==
  CreateTable CatalogTable(`t`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.  HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(col,int,true,None)),List(),List(),List(),-1,,1462725171656,-1,Map(),None,None,None), false
  ...
  ```

#### How was this patch tested?
Added and modified a few test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12991 from gatorsmile/explainCreateTable.

57064726

May 09, 2016

[SPARK-15187][SQL] Disallow Dropping Default Database · f4537917

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
In Hive Metastore, dropping default database is not allowed. However, in `InMemoryCatalog`, this is allowed.

This PR is to disallow users to drop default database.

#### How was this patch tested?
Previously, we already have a test case in HiveDDLSuite. Now, we also add the same one in DDLSuite

Author: gatorsmile <gatorsmile@gmail.com>

Closes #12962 from gatorsmile/dropDefaultDB.

f4537917

[SPARK-15229][SQL] Make case sensitivity setting internal · 4b4344a8

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
Our case sensitivity support is different from what ANSI SQL standards support. Postgres' behavior is that if an identifier is quoted, then it is treated as case sensitive; otherwise it is folded to lowercase. We will likely need to revisit this in the future and change our behavior. For now, the safest change to do for Spark 2.0 is to make the case sensitive option internal and discourage users from turning it on, effectively making Spark always case insensitive.

## How was this patch tested?
N/A - a small config documentation change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13011 from rxin/SPARK-15229.

4b4344a8

[SPARK-15234][SQL] Fix spark.catalog.listDatabases.show() · 8f932fb8

Andrew Or authored 8 years ago

## What changes were proposed in this pull request?

Before:
```
scala> spark.catalog.listDatabases.show()
+--------------------+-----------+-----------+
|                name|description|locationUri|
+--------------------+-----------+-----------+
|Database[name='de...|
|Database[name='my...|
|Database[name='so...|
+--------------------+-----------+-----------+
```

After:
```
+-------+--------------------+--------------------+
|   name|         description|         locationUri|
+-------+--------------------+--------------------+
|default|Default Hive data...|file:/user/hive/w...|
|  my_db|  This is a database|file:/Users/andre...|
|some_db|                    |file:/private/var...|
+-------+--------------------+--------------------+
```

## How was this patch tested?

New test in `CatalogSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13015 from andrewor14/catalog-show.

8f932fb8

[SPARK-15025][SQL] fix duplicate of PATH key in datasource table options · 980bba0d

xin Wu authored 8 years ago

## What changes were proposed in this pull request?
The issue is that when the user provides the path option with uppercase "PATH" key, `options` contains `PATH` key and will get into the non-external case in the following code in `createDataSourceTables.scala`, where a new key "path" is created with a default path.
```
val optionsWithPath =
      if (!options.contains("path")) {
        isExternal = false
        options + ("path" -> sessionState.catalog.defaultTablePath(tableIdent))
      } else {
        options
      }
```
So before creating hive table, serdeInfo.parameters will contain both "PATH" and "path" keys and different directories. and Hive table's dataLocation contains the value of "path".

The fix in this PR is to convert `options` in the code above to `CaseInsensitiveMap` before checking for containing "path" key.

## How was this patch tested?
A testcase is added

Author: xin Wu <xinwu@us.ibm.com>

Closes #12804 from xwu0226/SPARK-15025.

980bba0d

[SPARK-15209] Fix display of job descriptions with single quotes in web UI timeline · 3323d0f9

Josh Rosen authored 8 years ago

## What changes were proposed in this pull request?

This patch fixes an escaping bug in the Web UI's event timeline that caused Javascript errors when displaying timeline entries whose descriptions include single quotes.

The original bug can be reproduced by running

```scala
sc.setJobDescription("double quote: \" ")
sc.parallelize(1 to 10).count()

sc.setJobDescription("single quote: ' ")
sc.parallelize(1 to 10).count()
```

and then browsing to the driver UI. Previously, this resulted in an "Uncaught SyntaxError" because the single quote from the description was not escaped and ended up closing a Javascript string literal too early.

The fix implemented here is to change the relevant Javascript to define its string literals using double-quotes. Our escaping logic already properly escapes double quotes in the description, so this is safe to do.

## How was this patch tested?

Tested manually in `spark-shell` using the following cases:

```scala
sc.setJobDescription("double quote: \" ")
sc.parallelize(1 to 10).count()

sc.setJobDescription("single quote: ' ")
sc.parallelize(1 to 10).count()

sc.setJobDescription("ampersand: &")
sc.parallelize(1 to 10).count()

sc.setJobDescription("newline: \n text after newline ")
sc.parallelize(1 to 10).count()

sc.setJobDescription("carriage return: \r text after return ")
sc.parallelize(1 to 10).count()
```

/cc sarutak for review.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12995 from JoshRosen/SPARK-15209.

3323d0f9

[SPARK-14972] Improve performance of JSON schema inference's compatibleType method · c3350cad

Josh Rosen authored 8 years ago

This patch improves the performance of `InferSchema.compatibleType` and `inferField`. The net result of this patch is a 6x speedup in local benchmarks running against cached data with a massive nested schema.

The key idea is to remove unnecessary sorting in `compatibleType`'s `StructType` merging code. This code takes two structs, merges the fields with matching names, and copies over the unique fields, producing a new schema which is the union of the two structs' schemas. Previously, this code performed a very inefficient `groupBy()` to match up fields with the same name, but this is unnecessary because `inferField` already sorts structs' fields by name: since both lists of fields are sorted, we can simply merge them in a single pass.

This patch also speeds up the existing field sorting in `inferField`: the old sorting code allocated unnecessary intermediate collections, while the new code uses mutable collects and performs in-place sorting.

I rewrote inefficient `equals()` implementations in `StructType` and `Metadata`, significantly reducing object allocations in those methods.

Finally, I replaced a `treeAggregate` call with `fold`: I doubt that `treeAggregate` will benefit us very much because the schemas would have to be enormous to realize large savings in network traffic. Since most schemas are probably fairly small in serialized form, they should typically fit within a direct task result and therefore can be incrementally merged at the driver as individual tasks finish. This change eliminates an entire (short) scheduler stage.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #12750 from JoshRosen/schema-inference-speedups.

c3350cad

[SPARK-15173][SQL] DataFrameWriter.insertInto should work with datasource table stored in hive · 2adb11f6

Wenchen Fan authored 8 years ago

When we parse `CREATE TABLE USING`, we should build a `CreateTableUsing` plan with the `managedIfNoPath` set to true. Then we will add default table path to options when write it to hive.

new test in `SQLQuerySuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12949 from cloud-fan/bug.

2adb11f6

[SPARK-10653][CORE] Remove unnecessary things from SparkEnv · c3e23bc0

Alex Bozarth authored 8 years ago

## What changes were proposed in this pull request?

Removed blockTransferService and sparkFilesDir from SparkEnv since they're rarely used and don't need to be in stored in the env. Edited their few usages to accommodate the change.

## How was this patch tested?

ran dev/run-tests locally

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #12970 from ajbozarth/spark10653.

c3e23bc0

[SPARK-15166][SQL] Move some hive-specific code from SparkSession · 7bf9b120

Andrew Or authored 8 years ago

## What changes were proposed in this pull request?

This also simplifies the code being moved.

## How was this patch tested?

Existing tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12941 from andrewor14/move-code.

7bf9b120

[SPARK-15210][SQL] Add missing @DeveloperApi annotation in sql.types · dfdcab00

Zheng RuiFeng authored 8 years ago

add DeveloperApi annotation for `AbstractDataType` `MapType` `UserDefinedType`

local build

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12982 from zhengruifeng/types_devapi.

dfdcab00

[SAPRK-15220][UI] add hyperlink to running application and completed application · f8aca5b4

mwws authored 8 years ago

## What changes were proposed in this pull request?
Add hyperlink to "running application" and "completed application", so user can jump to application table directly, In my environment, I set up 1000+ works and it's painful to scroll down to skip worker list.

## How was this patch tested?
manual tested

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
![sceenshot](https://cloud.githubusercontent.com/assets/13216322/15105718/97e06768-15f6-11e6-809d-3574046751a9.png)

Author: mwws <wei.mao@intel.com>

Closes #12997 from mwws/SPARK_UI.

f8aca5b4

[MINOR][SQL] Enhance the exception message if checkpointLocation is not set · ee6a8d7e

jerryshao authored 8 years ago

Enhance the exception message when `checkpointLocation` is not set, previously the message is:

```
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:347)
  at scala.None$.get(Option.scala:345)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
  at org.apache.spark.sql.DataFrameWriter$$anonfun$8.apply(DataFrameWriter.scala:338)
  at scala.collection.MapLike$class.getOrElse(MapLike.scala:128)
  at scala.collection.AbstractMap.getOrElse(Map.scala:59)
  at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:337)
  at org.apache.spark.sql.DataFrameWriter.startStream(DataFrameWriter.scala:277)
  ... 48 elided
```

This is not so meaningful, so changing to make it more specific.

Local verified.

Author: jerryshao <sshao@hortonworks.com>

Closes #12998 from jerryshao/improve-exception-message.

ee6a8d7e

[SPARK-15067][YARN] YARN executors are launched with fixed perm gen size · 6747171e

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Look for MaxPermSize arguments anywhere in an arg, to account for quoted args. See JIRA for discussion.

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #12985 from srowen/SPARK-15067.

6747171e

[SPARK-15225][SQL] Replace SQLContext with SparkSession in Encoder documentation · e083db2e

Liang-Chi Hsieh authored 8 years ago

`Encoder`'s doc mentions `sqlContext.implicits._`. We should use `sparkSession.implicits._` instead now.

Only doc update.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13002 from viirya/encoder-doc.

e083db2e

[SPARK-15223][DOCS] fix wrongly named config reference · 65b4ab28

Philipp Hoffmann authored 8 years ago

## What changes were proposed in this pull request?

The configuration setting `spark.executor.logs.rolling.size.maxBytes` was changed to `spark.executor.logs.rolling.maxSize` in 1.4 or so.

This commit fixes a remaining reference to the old name in the documentation.

Also the description for `spark.executor.logs.rolling.maxSize` was edited to clearly state that the unit for the size is bytes.

## How was this patch tested?

no tests

Author: Philipp Hoffmann <mail@philipphoffmann.de>

Closes #13001 from philipphoffmann/patch-3.

65b4ab28

[MINOR][DOCS] Remove remaining sqlContext in documentation at examples · 2992a215

hyukjinkwon authored 8 years ago

This PR removes `sqlContext` in examples. Actual usage was all replaced in https://github.com/apache/spark/pull/12809 but there are some in comments.

Manual style checking.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13006 from HyukjinKwon/minor-docs.

2992a215

[SPARK-14127][SQL] Makes 'DESC [EXTENDED|FORMATTED] <table>' support data source tables · 671b382a

Cheng Lian authored 8 years ago

## What changes were proposed in this pull request?

This is a follow-up of PR #12844. It makes the newly updated `DescribeTableCommand` to support data sources tables.

## How was this patch tested?

A test case is added to check `DESC [EXTENDED | FORMATTED] <table>` output.

Author: Cheng Lian <lian@databricks.com>

Closes #12934 from liancheng/spark-14127-desc-table-follow-up.

671b382a