Commits · c875d81a3de3f209b9eb03adf96b7c740b2c7b52 · cs525-sp18-g07 / spark

May 25, 2016

[SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV · c875d81a

Jurriaan Pruis authored 8 years ago

## What changes were proposed in this pull request?

Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.

See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247

This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)

https://issues.apache.org/jira/browse/SPARK-15493

## How was this patch tested?

Added a test that verifies the output is quoted correctly.

Author: Jurriaan Pruis <email@jurriaanpruis.nl>

Closes #13267 from jurriaan/quote-escaping.

c875d81a

[SPARK-15483][SQL] IncrementalExecution should use extra strategies. · 4b880674

Takuya UESHIN authored 8 years ago

## What changes were proposed in this pull request?

Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.

This pr fixes `IncrementalExecution` to include extra strategies to use them.

## How was this patch tested?

I added a test to check if extra strategies work for streams.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13261 from ueshin/issues/SPARK-15483.

4b880674

[SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS · 1cb347fb

Nick Pentreath authored 8 years ago

Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.

We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.

Tests N/A.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.

1cb347fb

[MINOR][MLLIB][STREAMING][SQL] Fix typos · 02c8072e

lfzCarlosC authored 8 years ago

fixed typos for source code for components [mllib] [streaming] and [SQL]

None and obvious.

Author: lfzCarlosC <lfz.carlos@gmail.com>

Closes #13298 from lfzCarlosC/master.

02c8072e

[MINOR][CORE] Fix a HadoopRDD log message and remove unused imports in rdd files. · d6d3e507

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes the following typos in log message and comments of `HadoopRDD.scala`. Also, this removes unused imports.
```scala
-      logWarning("Caching NewHadoopRDDs as deserialized objects usually leads to undesired" +
+      logWarning("Caching HadoopRDDs as deserialized objects usually leads to undesired" +
...
-      // since its not removed yet
+      // since it's not removed yet
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13294 from dongjoon-hyun/minor_rdd_fix_log_message.

d6d3e507

[SPARK-15520][SQL] SparkSession builder in python should also allow overriding... · 8239fdcb

Eric Liang authored 8 years ago

[SPARK-15520][SQL] SparkSession builder in python should also allow overriding confs of existing sessions

## What changes were proposed in this pull request?

This fixes the python SparkSession builder to allow setting confs correctly. This was a leftover TODO from https://github.com/apache/spark/pull/13200.

## How was this patch tested?

Python doc tests.

cc andrewor14

Author: Eric Liang <ekl@databricks.com>

Closes #13289 from ericl/spark-15520.

8239fdcb

[SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this... · 01e7b9c8

Jeff Zhang authored 8 years ago

[SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this already an existing SparkContext

## What changes were proposed in this pull request?

Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach.

## How was this patch tested?

Manually verify it in spark-shell.

rxin  Please help review it, I think this is a very critical issue for spark 2.0

Author: Jeff Zhang <zjffdu@apache.org>

Closes #13160 from zjffdu/SPARK-15345.

01e7b9c8

[SPARK-9044] Fix "Storage" tab in UI so that it reflects RDD name change. · b120fba6

Lukasz authored 8 years ago

## What changes were proposed in this pull request?

1. Making 'name' field of RDDInfo mutable.
2. In StorageListener: catching the fact that RDD's name was changed and updating it in RDDInfo.

## How was this patch tested?

1. Manual verification - the 'Storage' tab now behaves as expected.
2. The commit also contains a new unit test which verifies this.

Author: Lukasz <lgieron@gmail.com>

Closes #13264 from lgieron/SPARK-9044.

b120fba6

[SPARK-15436][SQL] Remove DescribeFunction and ShowFunctions · 4f27b8dd

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.

## How was this patch tested?
Created a new SparkSqlParserSuite.

Author: Reynold Xin <rxin@databricks.com>

Closes #13292 from rxin/SPARK-15436.

4f27b8dd

[SPARK-12071][DOC] Document the behaviour of NA in R · 9082b796

Krishna Kalyan authored 8 years ago

## What changes were proposed in this pull request?

Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`.

## How was this patch tested?

Document update, no tests.

Author: Krishna Kalyan <krishnakalyan3@gmail.com>

Closes #13268 from krishnakalyan3/spark-12071-1.

9082b796

[SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc... · cd9f1690

Holden Karau authored 8 years ago

[SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc & doc build insturctions

## What changes were proposed in this pull request?

PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).

## How was this patch tested?

built pydocs locally, tested new user build instructions

Author: Holden Karau <holden@us.ibm.com>

Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.

cd9f1690

[SPARK-15508][STREAMING][TESTS] Fix flaky test: JavaKafkaStreamSuite.testKafkaStream · c9c1c0e5

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

`JavaKafkaStreamSuite.testKafkaStream` assumes when `sent.size == result.size`, the contents of `sent` and `result` should be same. However, that's not true. The content of `result` may not be the final content.

This PR modified the test to always retry the assertions even if the contents of `sent` and `result` are not same.

Here is the failure in Jenkins: http://spark-tests.appspot.com/tests/org.apache.spark.streaming.kafka.JavaKafkaStreamSuite/testKafkaStream

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13281 from zsxwing/flaky-kafka-test.

c9c1c0e5

May 24, 2016

[SPARK-15498][TESTS] fix slow tests · 50b660d7

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes 3 slow tests:

1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13273 from cloud-fan/test.

50b660d7

[SPARK-15365][SQL] When table size statistics are not available from... · 4acababc

Parth Brahmbhatt authored 8 years ago

[SPARK-15365][SQL] When table size statistics are not available from metastore, we should fallback to HDFS

## What changes were proposed in this pull request?
Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.

## How was this patch tested?
I have executed queries locally to test.

Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>

Closes #13150 from Parth-Brahmbhatt/SPARK-15365.

4acababc

[SPARK-15518] Rename various scheduler backend for consistency · 14494da8

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch renames various scheduler backends to make them consistent:

- LocalScheduler -> LocalSchedulerBackend
- AppClient -> StandaloneAppClient
- AppClientListener -> StandaloneAppClientListener
- SparkDeploySchedulerBackend -> StandaloneSchedulerBackend
- CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend
- MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend

## How was this patch tested?
Updated test cases to reflect the name change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13288 from rxin/SPARK-15518.

14494da8

[SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException · f08bf587

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases.

**Before**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0).collect()
res1: Array[Int] = Array()   // empty
scala> spark.sql("select 1").coalesce(0)
res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").coalesce(0).collect()
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
scala> spark.sql("select 1").repartition(0)
res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").repartition(0).collect()
res4: Array[org.apache.spark.sql.Row] = Array()  // empty
```

**After**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
```

## How was this patch tested?

Pass the Jenkins tests with new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13282 from dongjoon-hyun/SPARK-15512.

f08bf587

[SPARK-15458][SQL][STREAMING] Disable schema inference for streaming datasets on file streams · e631b819

Tathagata Das authored 8 years ago

## What changes were proposed in this pull request?

If the user relies on the schema to be inferred in file streams can break easily for multiple reasons
- accidentally running on a directory which has no data
- schema changing underneath
- on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart.

To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases.

In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default.

## How was this patch tested?
Updated unit tests that test error behavior with and without schema inference enabled.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13238 from tdas/SPARK-15458.

e631b819

[SPARK-15502][DOC][ML][PYSPARK] add guide note that ALS only supports integer ids · 20900e5f

Nick Pentreath authored 8 years ago

This PR adds a note to clarify that the ML API for ALS only supports integers for user/item ids, and that other types for these columns can be used but the ids must fall within integer range.

(Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)).

Also cleaned up a reference to `mllib` in the ML doc.

## How was this patch tested?
Built and viewed User Guide doc locally.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13278 from MLnick/SPARK-15502-als-int-id-doc-note.

20900e5f

[MINOR][CORE][TEST] Update obsolete `takeSample` test case. · be99a99f

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes some obsolete comments and assertion in `takeSample` testcase of `RDDSuite.scala`.

## How was this patch tested?

This fixes the testcase only.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13260 from dongjoon-hyun/SPARK-15481.

be99a99f

[SPARK-15388][SQL] Fix spark sql CREATE FUNCTION with hive 1.2.1 · 784cc07d

wangyang authored 8 years ago

## What changes were proposed in this pull request?

spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws "org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function default.myfunc does not exist))" with hive 1.2.1.

I think it is introduced by pr #12853. Fixing it by catching Exception (not NoSuchObjectException) and string matching.

## How was this patch tested?

added a unit test and also tested it manually

Author: wangyang <wangyang@haizhi.com>

Closes #13177 from wangyang1992/fixCreateFunc2.

784cc07d

[SPARK-15405][YARN] Remove unnecessary upload of config archive. · a313a5ae

Marcelo Vanzin authored 8 years ago

We only need one copy of it. The client code that was uploading the
second copy just needs to be modified to update the metadata in the
cache, so that the AM knows where to find the configuration.

Tested by running app on YARN and verifying in the logs only one archive
is uploaded.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13232 from vanzin/SPARK-15405.

a313a5ae

[SPARK-15433] [PYSPARK] PySpark core test should not use SerDe from PythonMLLibAPI · 695d9a0f

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

Currently PySpark core test uses the `SerDe` from `PythonMLLibAPI` which includes many MLlib things. It should use `SerDeUtil` instead.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13214 from viirya/pycore-use-serdeutil.

695d9a0f

[SPARK-13135] [SQL] Don't print expressions recursively in generated code · f8763b80

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR is an up-to-date and a little bit improved version of #11019 of rxin for
- (1) preventing recursive printing of expressions in generated code.

Since the major function of this PR is indeed the above,  he should be credited for the work he did. In addition to #11019, this PR improves the followings in code generation.
- (2) Improve multiline comment indentation.
- (3) Reduce the number of empty lines (mainly consecutive empty lines).
- (4) Remove all space characters on empty lines.

**Example**
```scala
spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6)
```

**Before**
```
Generated code:
/* 001 */ public Object generate(Object[] references) {
...
/* 005 */ /**
/* 006 */ * Codegend pipeline for
/* 007 */ * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 008 */ * +- Range 1, 1, 8, 999, [id#0L]
/* 009 */ */
...
/* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 076 */
/* 077 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
/* 078 */
/* 079 */     // initialize Range
...
/* 092 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 093 */
/* 094 */       // CONSUME: WholeStageCodegen
/* 095 */
/* 096 */       // (((input[0, bigint, false] + 1) + 2) + 3)
/* 097 */       // ((input[0, bigint, false] + 1) + 2)
/* 098 */       // (input[0, bigint, false] + 1)
...
/* 107 */       // (((input[0, bigint, false] + 4) + 5) + 6)
/* 108 */       // ((input[0, bigint, false] + 4) + 5)
/* 109 */       // (input[0, bigint, false] + 4)
...
/* 126 */ }
```

**After**
```
Generated code:
/* 001 */ public Object generate(Object[] references) {
...
/* 005 */ /**
/* 006 */  * Codegend pipeline for
/* 007 */  * Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 008 */  * +- Range 1, 1, 8, 999, [id#0L]
/* 009 */  */
...
/* 075 */     // PRODUCE: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 076 */     // PRODUCE: Range 1, 1, 8, 999, [id#0L]
/* 077 */     // initialize Range
...
/* 090 */       // CONSUME: Project [(((id#0L + 1) + 2) + 3) AS (((id + 1) + 2) + 3)#3L,(((id#0L + 4) + 5) + 6) AS (((id + 4) + 5) + 6)#4L]
/* 091 */       // CONSUME: WholeStageCodegen
/* 092 */       // (((input[0, bigint, false] + 1) + 2) + 3)
...
/* 101 */       // (((input[0, bigint, false] + 4) + 5) + 6)
...
/* 118 */ }
```

## How was this patch tested?

Pass the Jenkins tests and see the result of the following command manually.
```scala
scala> spark.range(1, 1000).select('id+1+2+3, 'id+4+5+6).queryExecution.debug.codegen()
```

Author: Dongjoon Hyun <dongjoonapache.org>
Author: Reynold Xin <rxindatabricks.com>

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13192 from dongjoon-hyun/SPARK-13135.

f8763b80

[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work · c24b6b67

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

Jackson suppprts `allowNonNumericNumbers` option to parse non-standard non-numeric numbers such as "NaN", "Infinity", "INF".  Currently used Jackson version (2.5.3) doesn't support it all. This patch upgrades the library and make the two ignored tests in `JsonParsingOptionsSuite` passed.

## How was this patch tested?

`JsonParsingOptionsSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9759 from viirya/fix-json-nonnumric.

c24b6b67

[SPARK-15442][ML][PYSPARK] Add 'relativeError' param to PySpark QuantileDiscretizer · 6075f5b4

Nick Pentreath authored 8 years ago

This PR adds the `relativeError` param to PySpark's `QuantileDiscretizer` to match Scala.

Also cleaned up a duplication of `numBuckets` where the param is both a class and instance attribute (I removed the instance attr to match the style of params throughout `ml`).

Finally, cleaned up the docs for `QuantileDiscretizer` to reflect that it now uses `approxQuantile`.

## How was this patch tested?

A little doctest and built API docs locally to check HTML doc generation.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13228 from MLnick/SPARK-15442-py-relerror-param.

6075f5b4

[SPARK-15397][SQL] fix string udf locate as hive · d642b273

Daoyuan Wang authored 8 years ago

## What changes were proposed in this pull request?

in hive, `locate("aa", "aaa", 0)` would yield 0, `locate("aa", "aaa", 1)` would yield 1 and `locate("aa", "aaa", 2)` would yield 2, while in Spark, `locate("aa", "aaa", 0)` would yield 1,  `locate("aa", "aaa", 1)` would yield 2 and  `locate("aa", "aaa", 2)` would yield 0. This results from the different understanding of the third parameter in udf `locate`. It means the starting index and starts from 1, so when we use 0, the return would always be 0.

## How was this patch tested?

tested with modified `StringExpressionsSuite` and `StringFunctionsSuite`

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #13186 from adrian-wang/locate.

d642b273

May 23, 2016

Revert "[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB" · de726b0d
Andrew Or authored 8 years ago
```
This reverts commit fa244e5a.
```
de726b0d

[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows beyond 64 KB · fa244e5a

Kazuaki Ishizaki authored 8 years ago

## What changes were proposed in this pull request?

This PR splits the generated code for ```SafeProjection.apply``` by using ```ctx.splitExpressions()```. This is because the large code body for ```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method.

## How was this patch tested?

Added new tests

Author: Kazuaki Ishizaki <ishizaki@jp.ibm.com>

Closes #13243 from kiszk/SPARK-15285.

fa244e5a

[SPARK-15485][SQL][DOCS] Spark SQL Configuration · d2077164

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
So far, the page Configuration in the official documentation does not have a section for Spark SQL.
http://spark.apache.org/docs/latest/configuration.html

For Spark users, the information and default values of these public configuration parameters are very useful. This PR is to add this missing section to the configuration.html.

rxin yhuai marmbrus

#### How was this patch tested?
Below is the generated webpage.
<img width="924" alt="screenshot 2016-05-23 11 35 57" src="https://cloud.githubusercontent.com/assets/11567269/15480492/b08fefc4-20da-11e6-9fa2-7cd5b699ed35.png">
<img width="914" alt="screenshot 2016-05-23 11 37 38" src="https://cloud.githubusercontent.com/assets/11567269/15480499/c5f9482e-20da-11e6-95ff-10821add1af4.png">
<img width="923" alt="screenshot 2016-05-23 11 36 11" src="https://cloud.githubusercontent.com/assets/11567269/15480506/cbd81644-20da-11e6-9d27-effb716b2fac.png">
<img width="920" alt="screenshot 2016-05-23 11 36 18" src="https://cloud.githubusercontent.com/assets/11567269/15480511/d013e332-20da-11e6-854a-cf8813c46f36.png">

Author: gatorsmile <gatorsmile@gmail.com>

Closes #13263 from gatorsmile/configurationSQL.

d2077164

[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with... · a15ca553

WeichenXu authored 8 years ago

[SPARK-15464][ML][MLLIB][SQL][TESTS] Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code

## What changes were proposed in this pull request?

Replace SQLContext and SparkContext with SparkSession using builder pattern in python test code.

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13242 from WeichenXu123/python_doctest_update_sparksession.

a15ca553

[SPARK-15311][SQL] Disallow DML on Regular Tables when Using In-Memory Catalog · 5afd927a

gatorsmile authored 8 years ago

#### What changes were proposed in this pull request?
So far, when using In-Memory Catalog, we allow DDL operations for the tables. However, the corresponding DML operations are not supported for the tables that are neither temporary nor data source tables. For example,
```SQL
CREATE TABLE tabName(i INT, j STRING)
SELECT * FROM tabName
INSERT OVERWRITE TABLE tabName SELECT 1, 'a'
```
In the above example, before this PR fix, we will get very confusing exception messages for either `SELECT` or `INSERT`
```
org.apache.spark.sql.AnalysisException: unresolved operator 'SimpleCatalogRelation default, CatalogTable(`default`.`tbl`,CatalogTableType(MANAGED),CatalogStorageFormat(None,Some(org.apache.hadoop.mapred.TextInputFormat),Some(org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat),None,false,Map()),List(CatalogColumn(i,int,true,None), CatalogColumn(j,string,true,None)),List(),List(),List(),-1,,1463928681802,-1,Map(),None,None,None,List()), None;
```

This PR is to issue appropriate exceptions in this case. The message will be like
```
org.apache.spark.sql.AnalysisException: Please enable Hive support when operating non-temporary tables: `tbl`;
```
#### How was this patch tested?
Added a test case in `DDLSuite`.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #13093 from gatorsmile/selectAfterCreate.

5afd927a

[SPARK-15431][SQL] Support LIST FILE(s)|JAR(s) command natively · 01659bc5

Xin Wu authored 8 years ago

## What changes were proposed in this pull request?
Currently command `ADD FILE|JAR <filepath | jarpath>` is supported natively in SparkSQL. However, when this command is run, the file/jar is added to the resources that can not be looked up by `LIST FILE(s)|JAR(s)` command because the `LIST` command is passed to Hive command processor in Spark-SQL or simply not supported in Spark-shell. There is no way users can find out what files/jars are added to the spark context.
Refer to [Hive commands](https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Cli)

This PR is to support following commands:
`LIST (FILE[s] [filepath ...] | JAR[s] [jarfile ...])`

### For example:
##### LIST FILE(s)
```
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt")
res1: org.apache.spark.sql.DataFrame = []
scala> spark.sql("add file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt")
res2: org.apache.spark.sql.DataFrame = []

scala> spark.sql("list file hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt").show(false)
+----------------------------------------------+
|result                                        |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
+----------------------------------------------+

scala> spark.sql("list files").show(false)
+----------------------------------------------+
|result                                        |
+----------------------------------------------+
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test1.txt|
|hdfs://bdavm009.svl.ibm.com:8020/tmp/test.txt |
+----------------------------------------------+
```

##### LIST JAR(s)
```
scala> spark.sql("add jar /Users/xinwu/spark/core/src/test/resources/TestUDTF.jar")
res9: org.apache.spark.sql.DataFrame = [result: int]

scala> spark.sql("list jar TestUDTF.jar").show(false)
+---------------------------------------------+
|result                                       |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+

scala> spark.sql("list jars").show(false)
+---------------------------------------------+
|result                                       |
+---------------------------------------------+
|spark://192.168.1.234:50131/jars/TestUDTF.jar|
+---------------------------------------------+
```
## How was this patch tested?
New test cases are added for Spark-SQL, Spark-Shell and SparkContext API code path.

Author: Xin Wu <xinwu@us.ibm.com>
Author: xin Wu <xinwu@us.ibm.com>

Closes #13212 from xwu0226/list_command.

01659bc5

[MINOR][SPARKR][DOC] Add a description for running unit tests in Windows · a8e97d17

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

This PR adds the description for running unit tests in Windows.

## How was this patch tested?

On a bare machine (Window 7, 32bits), this was manually built and tested.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13217 from HyukjinKwon/minor-r-doc.

a8e97d17

[SPARK-15315][SQL] Adding error check to the CSV datasource writer for... · 03c7b7c4

sureshthalamati authored 8 years ago

[SPARK-15315][SQL] Adding error check to  the CSV datasource writer for unsupported complex data types.

## What changes were proposed in this pull request?

Adds error handling to the CSV writer  for unsupported complex data types.  Currently garbage gets written to the output csv files if the data frame schema has complex data types.

## How was this patch tested?

Added new unit test case.

Author: sureshthalamati <suresh.thalamati@gmail.com>

Closes #13105 from sureshthalamati/csv_complex_types_SPARK-15315.

03c7b7c4

[MINOR][SQL][DOCS] Add notes of the deterministic assumption on UDF functions · 37c617e4

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Spark assumes that UDF functions are deterministic. This PR adds explicit notes about that.

## How was this patch tested?

It's only about docs.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13087 from dongjoon-hyun/SPARK-15282.

37c617e4

[SPARK-15279][SQL] Catch conflicting SerDe when creating table · 2585d2b3

Andrew Or authored 8 years ago

## What changes were proposed in this pull request?

The user may do something like:
```
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS PARQUET
CREATE TABLE my_tab ROW FORMAT SERDE 'anything' STORED AS ... SERDE 'myserde'
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ORC
CREATE TABLE my_tab ROW FORMAT DELIMITED ... STORED AS ... SERDE 'myserde'
```
None of these should be allowed because the SerDe's conflict. As of this patch:
- `ROW FORMAT DELIMITED` is only compatible with `TEXTFILE`
- `ROW FORMAT SERDE` is only compatible with `TEXTFILE`, `RCFILE` and `SEQUENCEFILE`

## How was this patch tested?

New tests in `DDLCommandSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #13068 from andrewor14/row-format-conflict.

2585d2b3

[SPARK-15471][SQL] ScalaReflection cleanup · 07c36a2f

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

1. simplify the logic of deserializing option type.
2. simplify the logic of serializing array type, and remove silentSchemaFor
3. remove some unnecessary code.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13250 from cloud-fan/encoder.

07c36a2f

[SPARK-14031][SQL] speedup CSV writer · 80091b8a

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

Currently, we create an CSVWriter for every row, it's very expensive and memory hungry, took about 15 seconds to write out 1 mm rows (two columns).

This PR will write the rows in batch mode, create a CSVWriter for every 1k rows, which could write out 1 mm rows in about 1 seconds (15X faster).

## How was this patch tested?

Manually benchmark it.

Author: Davies Liu <davies@databricks.com>

Closes #13229 from davies/csv_writer.

80091b8a

[SPARK-15425][SQL] Disallow cross joins by default · dafcb05c

Sameer Agarwal authored 8 years ago

## What changes were proposed in this pull request?

In order to prevent users from inadvertently writing queries with cartesian joins, this patch introduces a new conf `spark.sql.crossJoin.enabled` (set to `false` by default) that if not set, results in a `SparkException` if the query contains one or more cartesian products.

## How was this patch tested?

Added a test to verify the new behavior in `JoinSuite`. Additionally, `SQLQuerySuite` and `SQLMetricsSuite` were modified to explicitly enable cartesian products.

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13209 from sameeragarwal/disallow-cartesian.

dafcb05c

May 22, 2016

[SPARK-15379][SQL] check special invalid date · fc44b694

wangyang authored 8 years ago

## What changes were proposed in this pull request?

When invalid date string like "2015-02-29 00:00:00" are cast as date or timestamp using spark sql, it used to not return null but another valid date (2015-03-01 in this case).
In this pr, invalid date string like "2016-02-29" and "2016-04-31" are returned as null when cast as date or timestamp.

## How was this patch tested?

Unit tests are added.

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: wangyang <wangyang@haizhi.com>

Closes #13169 from wangyang1992/invalid_date.

fc44b694