Commits · c84f7d3e1b845bc1e595ce9a6e2de663c2d218f4 · cs525-sp18-g07 / spark

Jan 16, 2017

[SPARK-18828][SPARKR] Refactor scripts for R · c84f7d3e

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Refactored script to remove duplications and clearer purpose for each script

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16249 from felixcheung/rscripts.

c84f7d3e

[SPARK-19232][SPARKR] Update Spark distribution download cache location on Windows · a115a543

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Windows seems to be the only place with appauthor in the path, for which we should say "Apache" (and case sensitive)
Current path of `AppData\Local\spark\spark\Cache` is a bit odd.

## How was this patch tested?

manual.

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16590 from felixcheung/rcachedir.

a115a543

[SPARK-19066][SPARKR] SparkR LDA doesn't set optimizer correctly · 12c8c216

wm624@hotmail.com authored 8 years ago

## What changes were proposed in this pull request?

spark.lda passes the optimizer "em" or "online" as a string to the backend. However, LDAWrapper doesn't set optimizer based on the value from R. Therefore, for optimizer "em", the `isDistributed` field is FALSE, which should be TRUE based on scala code.

In addition, the `summary` method should bring back the results related to `DistributedLDAModel`.

## How was this patch tested?
Manual tests by comparing with scala example.
Modified the current unit test: fix the incorrect unit test and add necessary tests for `summary` method.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16464 from wangmiao1981/new.

12c8c216

[SPARK-18801][SQL][FOLLOWUP] Alias the view with its child · e635cbb6

jiangxingbo authored 8 years ago

## What changes were proposed in this pull request?

This PR is a follow-up to address the comments https://github.com/apache/spark/pull/16233/files#r95669988 and https://github.com/apache/spark/pull/16233/files#r95662299.

We try to wrap the child by:
1. Generate the `queryOutput` by:
    1.1. If the query column names are defined, map the column names to attributes in the child output by name;
    1.2. Else set the child output attributes to `queryOutput`.
2. Map the `queryQutput` to view output by index, if the corresponding attributes don't match, try to up cast and alias the attribute in `queryOutput` to the attribute in the view output.
3. Add a Project over the child, with the new output generated by the previous steps.
If the view output doesn't have the same number of columns neither with the child output, nor with the query column names, throw an AnalysisException.

## How was this patch tested?

Add new test cases in `SQLViewSuite`.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #16561 from jiangxb1987/alias-view.

e635cbb6

[SPARK-19082][SQL] Make ignoreCorruptFiles work for Parquet · 61e48f52

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

We have a config `spark.sql.files.ignoreCorruptFiles` which can be used to ignore corrupt files when reading files in SQL. Currently the `ignoreCorruptFiles` config has two issues and can't work for Parquet:

1. We only ignore corrupt files in `FileScanRDD` . Actually, we begin to read those files as early as inferring data schema from the files. For corrupt files, we can't read the schema and fail the program. A related issue reported at http://apache-spark-developers-list.1001551.n3.nabble.com/Skip-Corrupted-Parquet-blocks-footer-tc20418.html
2. In `FileScanRDD`, we assume that we only begin to read the files when starting to consume the iterator. However, it is possibly the files are read before that. In this case, `ignoreCorruptFiles` config doesn't work too.

This patch targets Parquet datasource. If this direction is ok, we can address the same issue for other datasources like Orc.

Two main changes in this patch:

1. Replace `ParquetFileReader.readAllFootersInParallel` by implementing the logic to read footers in multi-threaded manner

We can't ignore corrupt files if we use `ParquetFileReader.readAllFootersInParallel`. So this patch implements the logic to do the similar thing in `readParquetFootersInParallel`.

2. In `FileScanRDD`, we need to ignore corrupt file too when we call `readFunction` to return iterator.

One thing to notice is:

We read schema from Parquet file's footer. The method to read footer `ParquetFileReader.readFooter` throws `RuntimeException`, instead of `IOException`, if it can't successfully read the footer. Please check out https://github.com/apache/parquet-mr/blob/df9d8e415436292ae33e1ca0b8da256640de9710/parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileReader.java#L470. So this patch catches `RuntimeException`. One concern is that it might also shadow other runtime exceptions other than reading corrupt files.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16474 from viirya/fix-ignorecorrupted-parquet-files.

61e48f52

Jan 15, 2017

[SPARK-19120] Refresh Metadata Cache After Loading Hive Tables · de62ddf7

gatorsmile authored 8 years ago

### What changes were proposed in this pull request?
```Scala
sql("CREATE TABLE tab (a STRING) STORED AS PARQUET")

// This table fetch is to fill the cache with zero leaf files
spark.table("tab").show()

sql(
s"""
|LOAD DATA LOCAL INPATH '$newPartitionDir' OVERWRITE
|INTO TABLE tab
""".stripMargin)

spark.table("tab").show()
```

In the above example, the returned result is empty after table loading. The metadata cache could be out of dated after loading new data into the table, because loading/inserting does not update the cache. So far, the metadata cache is only used for data source tables. Thus, for Hive serde tables, only `parquet` and `orc` formats are facing such issues, because the Hive serde tables in the format of parquet/orc could be converted to data source tables when `spark.sql.hive.convertMetastoreParquet`/`spark.sql.hive.convertMetastoreOrc` is on.

This PR is to refresh the metadata cache after processing the `LOAD DATA` command.

In addition, Spark SQL does not convert **partitioned** Hive tables (orc/parquet) to data source tables in the write path, but the read path is using the metadata cache for both **partitioned** and non-partitioned Hive tables (orc/parquet). That means, writing the partitioned parquet/orc tables still use `InsertIntoHiveTable`, instead of `InsertIntoHadoopFsRelationCommand`. To avoid reading the out-of-dated cache, `InsertIntoHiveTable` needs to refresh the metadata cache for partitioned tables. Note, it does not need to refresh the cache for non-partitioned parquet/orc tables, because it does not call `InsertIntoHiveTable` at all. Based on the comments, this PR will keep the existing logics unchanged. That means, we always refresh the table no matter whether the table is partitioned or not.

### How was this patch tested?
Added test cases in parquetSuites.scala

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16500 from gatorsmile/refreshInsertIntoHiveTable.

de62ddf7

[SPARK-19206][DOC][DSTREAM] Fix outdated parameter descriptions in kafka010 · a5e651f4

uncleGen authored 8 years ago

## What changes were proposed in this pull request?

Fix outdated parameter descriptions in kafka010

## How was this patch tested?

cc koeninger  zsxwing

Author: uncleGen <hustyugm@gmail.com>

Closes #16569 from uncleGen/SPARK-19206.

a5e651f4

[SPARK-18971][CORE] Upgrade Netty to 4.0.43.Final · a8567e34

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

Upgrade Netty to `4.0.43.Final` to add the fix for https://github.com/netty/netty/issues/6153

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16568 from zsxwing/SPARK-18971.

a8567e34

[MINOR][DOC] Document local[*,F] master modes · 3df2d931

Maurus Cuelenaere authored 8 years ago

## What changes were proposed in this pull request?

core/src/main/scala/org/apache/spark/SparkContext.scala contains LOCAL_N_FAILURES_REGEX master mode, but this was never documented, so do so.

## How was this patch tested?

By using the Github Markdown preview feature.

Author: Maurus Cuelenaere <mcuelenaere@gmail.com>

Closes #16562 from mcuelenaere/patch-1.

3df2d931

[SPARK-19042] spark executor can't download the jars when uber jar's http url... · c9d612f8

xiaojian.fxj authored 8 years ago

[SPARK-19042] spark executor can't download the jars when uber jar's http url contains any query strings

If the uber jars' https contains any query strings, the Executor.updateDependencies method can't can't download the jars correctly. This is because  the "localName = name.split("/").last" won't get the expected jar's url. The bug fix is the same as [SPARK-17855]

Author: xiaojian.fxj <xiaojian.fxj@alibaba-inc.com>

Closes #16509 from hustfxj/bug.

c9d612f8

[SPARK-19207][SQL] LocalSparkSession should use Slf4JLoggerFactory.INSTANCE · 9112f31b

Tsuyoshi Ozawa authored 8 years ago

## What changes were proposed in this pull request?

Using Slf4JLoggerFactory.INSTANCE instead of creating Slf4JLoggerFactory's object with constructor. It's deprecated.

## How was this patch tested?

With running StateStoreRDDSuite.

Author: Tsuyoshi Ozawa <ozawa@apache.org>

Closes #16570 from oza/SPARK-19207.

9112f31b

Jan 14, 2017

[SPARK-19151][SQL] DataFrameWriter.saveAsTable support hive overwrite · 89423539

windpiger authored 8 years ago

## What changes were proposed in this pull request?

After [SPARK-19107](https://issues.apache.org/jira/browse/SPARK-19107), we now can treat hive as a data source and create hive tables with DataFrameWriter and Catalog. However, the support is not completed, there are still some cases we do not support.

This PR implement:
DataFrameWriter.saveAsTable work with hive format with overwrite mode

## How was this patch tested?
unit test added

Author: windpiger <songjun@outlook.com>

Closes #16549 from windpiger/saveAsTableWithHiveOverwrite.

89423539

[SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor... · b6a7aa4f

hyukjinkwon authored 8 years ago

[SPARK-19221][PROJECT INFRA][R] Add winutils binaries to the path in AppVeyor tests for Hadoop libraries to call native codes properly

## What changes were proposed in this pull request?

It seems Hadoop libraries need winutils binaries for native libraries in the path.

It is not a problem in tests for now because we are only testing SparkR on Windows via AppVeyor but it can be a problem if we run Scala tests via AppVeyor as below:

```
 - SPARK-18220: read Hive orc table with varchar column *** FAILED *** (3 seconds, 937 milliseconds)
   org.apache.spark.sql.execution.QueryExecutionException: FAILED: Execution Error, return code -101 from org.apache.hadoop.hive.ql.exec.mr.MapRedTask. org.apache.hadoop.io.nativeio.NativeIO$Windows.access0(Ljava/lang/String;I)Z
   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:625)
   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$runHive$1.apply(HiveClientImpl.scala:609)
   at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$withHiveState$1.apply(HiveClientImpl.scala:283)
   ...
```

This PR proposes to add it to the `Path` for AppVeyor tests.

## How was this patch tested?

Manually via AppVeyor.

**Before**
https://ci.appveyor.com/project/spark-test/spark/build/549-windows-complete/job/gc8a1pjua2bc4i8m

**After**
https://ci.appveyor.com/project/spark-test/spark/build/572-windows-complete/job/c4vrysr5uvj2hgu7

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16584 from HyukjinKwon/set-path-appveyor.

b6a7aa4f

Jan 13, 2017

[SPARK-19180] [SQL] the offset of short should be 2 in OffHeapColumn · ad0dadaa

Yucai Yu authored 8 years ago

## What changes were proposed in this pull request?

the offset of short is 4 in OffHeapColumnVector's putShorts, but actually it should be 2.

## How was this patch tested?

unit test

Author: Yucai Yu <yucai.yu@intel.com>

Closes #16555 from yucai/offheap_short.

ad0dadaa

[SPARK-18335][SPARKR] createDataFrame to support numPartitions parameter · b0e8eb6d

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

To allow specifying number of partitions when the DataFrame is created

## How was this patch tested?

manual, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16512 from felixcheung/rnumpart.

b0e8eb6d

[SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a... · 285a7798

Vinayak authored 8 years ago

[SPARK-18687][PYSPARK][SQL] Backward compatibility - creating a Dataframe on a new SQLContext object fails with a Derby error

Change is for SQLContext to reuse the active SparkSession during construction if the sparkContext supplied is the same as the currently active SparkContext. Without this change, a new SparkSession is instantiated that results in a Derby error when attempting to create a dataframe using a new SQLContext object even though the SparkContext supplied to the new SQLContext is same as the currently active one. Refer https://issues.apache.org/jira/browse/SPARK-18687 for details on the error and a repro.

Existing unit tests and a new unit test added to pyspark-sql:

/python/run-tests --python-executables=python --modules=pyspark-sql

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Vinayak <vijoshi5@in.ibm.com>
Author: Vinayak Joshi <vijoshi@users.noreply.github.com>

Closes #16119 from vijoshi/SPARK-18687_master.

285a7798

Fix missing close-parens for In filter's toString · b040cef2

Andrew Ash authored 8 years ago

Otherwise the open parentheses isn't closed in query plan descriptions of batch scans.

    PushedFilters: [In(COL_A, [1,2,4,6,10,16,219,815], IsNotNull(COL_B), ...

Author: Andrew Ash <andrew@andrewash.com>

Closes #16558 from ash211/patch-9.

b040cef2

[SPARK-19178][SQL] convert string of large numbers to int should return null · 6b34e745

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

When we convert a string to integral, we will convert that string to `decimal(20, 0)` first, so that we can turn a string with decimal format to truncated integral, e.g. `CAST('1.2' AS int)` will return `1`.

However, this brings problems when we convert a string with large numbers to integral, e.g. `CAST('1234567890123' AS int)` will return `1912276171`, while Hive returns null as we expected.

This is a long standing bug(seems it was there the first day Spark SQL was created), this PR fixes this bug by adding the native support to convert `UTF8String` to integral.

## How was this patch tested?

new regression tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16550 from cloud-fan/string-to-int.

6b34e745

[SPARK-19142][SPARKR] spark.kmeans should take seed, initSteps, and tol as parameters · 7f24a0b6

wm624@hotmail.com authored 8 years ago

## What changes were proposed in this pull request?
spark.kmeans doesn't have interface to set initSteps, seed and tol. As Spark Kmeans algorithm doesn't take the same set of parameters as R kmeans, we should maintain a different interface in spark.kmeans.

Add missing parameters and corresponding document.

Modified existing unit tests to take additional parameters.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16523 from wangmiao1981/kmeans.

7f24a0b6

Jan 12, 2017

[SPARK-19092][SQL] Save() API of DataFrameWriter should not scan all the saved files · 3356b8b6

gatorsmile authored 8 years ago

### What changes were proposed in this pull request?
`DataFrameWriter`'s [save() API](https://github.com/gatorsmile/spark/blob/5d38f09f47a767a342a0a8219c63efa2943b5d1f/sql/core/src/main/scala/org/apache/spark/sql/DataFrameWriter.scala#L207) is performing a unnecessary full filesystem scan for the saved files. The save() API is the most basic/core API in `DataFrameWriter`. We should avoid it.

The related PR: https://github.com/apache/spark/pull/16090

### How was this patch tested?
Updated the existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16481 from gatorsmile/saveFileScan.

3356b8b6

[SPARK-19110][MLLIB][FOLLOWUP] Add a unit test for testing logPrior and... · c983267b

wm624@hotmail.com authored 8 years ago

[SPARK-19110][MLLIB][FOLLOWUP] Add a unit test for testing logPrior and logLikelihood of DistributedLDAModel in MLLIB

## What changes were proposed in this pull request?
#16491 added the fix to mllib and a unit test to ml. This followup PR, add unit tests to mllib suite.

## How was this patch tested?
Unit tests.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16524 from wangmiao1981/ldabug.

c983267b

[SPARK-17237][SQL] Remove backticks in a pivot result schema · 5585ed93

Takeshi YAMAMURO authored 8 years ago

## What changes were proposed in this pull request?
Pivoting adds backticks (e.g. 3_count(\`c\`)) in column names and, in some cases,
thes causes analysis exceptions  like;
```
scala> val df = Seq((2, 3, 4), (3, 4, 5)).toDF("a", "x", "y")
scala> df.groupBy("a").pivot("x").agg(count("y"), avg("y")).na.fill(0)
org.apache.spark.sql.AnalysisException: syntax error in attribute name: `3_count(`y`)`;
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.e$1(unresolved.scala:134)
  at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute$.parseAttributeName(unresolved.scala:144)
...
```
So, this pr proposes to remove these backticks from column names.

## How was this patch tested?
Added a test in `DataFrameAggregateSuite`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #14812 from maropu/SPARK-17237.

5585ed93

[SPARK-12757][CORE] lower "block locks were not released" log to info level · 2bc4d4e2

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

lower "block locks were not released" log to info level, as it is generating a lot of warnings in running ML, graph calls, as pointed out in the JIRA.

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16513 from felixcheung/blocklockswarn.

2bc4d4e2

[SPARK-19055][SQL][PYSPARK] Fix SparkSession initialization when SparkContext is stopped · c6c37b8a

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

In SparkSession initialization, we store created the instance of SparkSession into a class variable _instantiatedContext. Next time we can use SparkSession.builder.getOrCreate() to retrieve the existing SparkSession instance.

However, when the active SparkContext is stopped and we create another new SparkContext to use, the existing SparkSession is still associated with the stopped SparkContext. So the operations with this existing SparkSession will be failed.

We need to detect such case in SparkSession and renew the class variable _instantiatedContext if needed.

## How was this patch tested?

New test added in PySpark.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16454 from viirya/fix-pyspark-sparksession.

c6c37b8a

[SPARK-18969][SQL] Support grouping by nondeterministic expressions · 871d2666

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

Currently nondeterministic expressions are allowed in `Aggregate`(see the [comment](https://github.com/apache/spark/blob/v2.0.2/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala#L249-L251)), but the `PullOutNondeterministic` analyzer rule failed to handle `Aggregate`, this PR fixes it.

close https://github.com/apache/spark/pull/16379

There is still one remaining issue: `SELECT a + rand() FROM t GROUP BY a + rand()` is not allowed, because the 2 `rand()` are different(we generate random seed as the default seed for `rand()`). https://issues.apache.org/jira/browse/SPARK-19035 is tracking this issue.

## How was this patch tested?

a new test suite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16404 from cloud-fan/groupby.

871d2666

[SPARK-19183][SQL] Add deleteWithJob hook to internal commit protocol API · c71b2548

Eric Liang authored 8 years ago

## What changes were proposed in this pull request?

Currently in SQL we implement overwrites by calling fs.delete() directly on the original data. This is not ideal since we the original files end up deleted even if the job aborts. We should extend the commit protocol to allow file overwrites to be managed as well.

## How was this patch tested?

Existing tests. I also fixed a bunch of tests that were depending on the commit protocol implementation being set to the legacy mapreduce one.

cc rxin cloud-fan

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #16554 from ericl/add-delete-protocol.

c71b2548

[SPARK-19164][PYTHON][SQL] Remove unused UserDefinedFunction._broadcast · 5db35b31

zero323 authored 8 years ago

## What changes were proposed in this pull request?

Removes `UserDefinedFunction._broadcast` and `UserDefinedFunction.__del__` method.

## How was this patch tested?

Existing unit tests.

Author: zero323 <zero323@users.noreply.github.com>

Closes #16538 from zero323/SPARK-19164.

5db35b31

[SPARK-19158][SPARKR][EXAMPLES] Fix ml.R example fails due to lack of e1071 package. · 2c586f50

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
```ml.R``` example depends on ```e1071``` package, if it's not available in users' environment, it will fail. I think the example should not depends on third-party packages, so I update it to remove the dependency.

## How was this patch tested?
Manual test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16548 from yanboliang/spark-19158.

2c586f50

Jan 11, 2017

[SPARK-16848][SQL] Check schema validation for user-specified schema in jdbc and table APIs · 24100f16

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

This PR proposes to throw an exception for both jdbc APIs when user specified schemas are not allowed or useless.

**DataFrameReader.jdbc(...)**

``` scala
spark.read.schema(StructType(Nil)).jdbc(...)
```

**DataFrameReader.table(...)**

```scala
spark.read.schema(StructType(Nil)).table("usrdb.test")
```

## How was this patch tested?

Unit test in `JDBCSuite` and `DataFrameReaderWriterSuite`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14451 from HyukjinKwon/SPARK-16848.

24100f16

[SPARK-19132][SQL] Add test cases for row size estimation and aggregate estimation · 43fa21b3

wangzhenhua authored 8 years ago

## What changes were proposed in this pull request?

In this pr, we add more test cases for project and aggregate estimation.

## How was this patch tested?

Add test cases.

Author: wangzhenhua <wangzhenhua@huawei.com>

Closes #16551 from wzhfy/addTests.

43fa21b3

[SPARK-19149][SQL] Follow-up: simplify cache implementation. · 66fe819a

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch simplifies slightly the logical plan statistics cache implementation, as discussed in https://github.com/apache/spark/pull/16529

## How was this patch tested?
N/A - this has no behavior change.

Author: Reynold Xin <rxin@databricks.com>

Closes #16544 from rxin/SPARK-19149.

66fe819a

[SPARK-18801][SQL] Support resolve a nested view · 30a07071

jiangxingbo authored 8 years ago

## What changes were proposed in this pull request?

We should be able to resolve a nested view. The main advantage is that if you update an underlying view, the current view also gets updated.
The new approach should be compatible with older versions of SPARK/HIVE, that means:
1. The new approach should be able to resolve the views that created by older versions of SPARK/HIVE;
2. The new approach should be able to resolve the views that are currently supported by SPARK SQL.

The new approach mainly brings in the following changes:
1. Add a new operator called `View` to keep track of the CatalogTable that describes the view, and the output attributes as well as the child of the view;
2. Update the `ResolveRelations` rule to resolve the relations and views, note that a nested view should be resolved correctly;
3. Add `viewDefaultDatabase` variable to `CatalogTable` to keep track of the default database name used to resolve a view, if the `CatalogTable` is not a view, then the variable should be `None`;
4. Add `AnalysisContext` to enable us to still support a view created with CTE/Windows query;
5. Enables the view support without enabling Hive support (i.e., enableHiveSupport);
6. Fix a weird behavior: the result of a view query may have different schema if the referenced table has been changed. After this PR, we try to cast the child output attributes to that from the view schema, throw an AnalysisException if cast is not allowed.

Note this is compatible with the views defined by older versions of Spark(before 2.2), which have empty `defaultDatabase` and all the relations in `viewText` have database part defined.

## How was this patch tested?
1. Add new tests in `SessionCatalogSuite` to test the function `lookupRelation`;
2. Add new test case in `SQLViewSuite` to test resolve a nested view.

Author: jiangxingbo <jiangxb1987@gmail.com>

Closes #16233 from jiangxb1987/resolve-view.

30a07071

[SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings... · 3bc2eff8

Bryan Cutler authored 8 years ago

[SPARK-17568][CORE][DEPLOY] Add spark-submit option to override ivy settings used to resolve packages/artifacts

## What changes were proposed in this pull request?

Adding option in spark-submit to allow overriding the default IvySettings used to resolve artifacts as part of the Spark Packages functionality.  This will allow all artifact resolution to go through a central managed repository, such as Nexus or Artifactory, where site admins can better approve and control what is used with Spark apps.

This change restructures the creation of the IvySettings object in two distinct ways.  First, if the `spark.ivy.settings` option is not defined then `buildIvySettings` will create a default settings instance, as before, with defined repositories (Maven Central) included.  Second, if the option is defined, the ivy settings file will be loaded from the given path and only repositories defined within will be used for artifact resolution.
## How was this patch tested?

Existing tests for default behaviour, Manual tests that load a ivysettings.xml file with local and Nexus repositories defined.  Added new test to load a simple Ivy settings file with a local filesystem resolver.

Author: Bryan Cutler <cutlerb@gmail.com>
Author: Ian Hummel <ian@themodernlife.net>

Closes #15119 from BryanCutler/spark-custom-IvySettings.

3bc2eff8

[SPARK-19130][SPARKR] Support setting literal value as column implicitly · d749c066

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

```
df$foo <- 1
```

instead of
```
df$foo <- lit(1)
```

## How was this patch tested?

unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16510 from felixcheung/rlitcol.

d749c066

[SPARK-19021][YARN] Generailize HDFSCredentialProvider to support non HDFS security filesystems · 4239a108

jerryshao authored 8 years ago

Currently Spark can only get token renewal interval from security HDFS (hdfs://), if Spark runs with other security file systems like webHDFS (webhdfs://), wasb (wasb://), ADLS, it will ignore these tokens and not get token renewal intervals from these tokens. These will make Spark unable to work with these security clusters. So instead of only checking HDFS token, we should generalize to support different DelegationTokenIdentifier.

## How was this patch tested?

Manually verified in security cluster.

Author: jerryshao <sshao@hortonworks.com>

Closes #16432 from jerryshao/SPARK-19021.

4239a108

[SPARK-19149][SQL] Unify two sets of statistics in LogicalPlan · a6155135

wangzhenhua authored 8 years ago

## What changes were proposed in this pull request?

Currently we have two sets of statistics in LogicalPlan: a simple stats and a stats estimated by cbo, but the computing logic and naming are quite confusing, we need to unify these two sets of stats.

## How was this patch tested?

Just modify existing tests.

Author: wangzhenhua <wangzhenhua@huawei.com>
Author: Zhenhua Wang <wzh_zju@163.com>

Closes #16529 from wzhfy/unifyStats.

a6155135

Jan 10, 2017

[SPARK-19157][SQL] should be able to change spark.sql.runSQLOnFiles at runtime · 3b19c74e

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

The analyzer rule that supports to query files directly will be added to `Analyzer.extendedResolutionRules` when SparkSession is created, according to the `spark.sql.runSQLOnFiles` flag. If the flag is off when we create `SparkSession`, this rule is not added and we can not query files directly even we turn on the flag later.

This PR fixes this bug by always adding that rule to `Analyzer.extendedResolutionRules`.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16531 from cloud-fan/sql-on-files.

3b19c74e

[SPARK-19140][SS] Allow update mode for non-aggregation streaming queries · bc6c56e9

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

This PR allow update mode for non-aggregation streaming queries. It will be same as the append mode if a query has no aggregations.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16520 from zsxwing/update-without-agg.

bc6c56e9

[SPARK-18997][CORE] Recommended upgrade libthrift to 0.9.3 · 856bae6a

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Updates to libthrift 0.9.3 to address a CVE.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #16530 from srowen/SPARK-18997.

856bae6a

[SPARK-19133][SPARKR][ML] fix glm for Gamma, clarify glm family supported · 9bc3507e

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

R family is a longer list than what Spark supports.

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16511 from felixcheung/rdocglmfamily.

9bc3507e