Commits · b5859e0bb8cc147858cb28d8bdb5ca3b4a2cec77 · cs525-sp18-g07 / spark

May 26, 2016

[SPARK-13445][SQL] Improves error message and add test coverage for Window function · b5859e0b

Sean Zhong authored 8 years ago

## What changes were proposed in this pull request?

Add more verbose error message when order by clause is missed when using Window function.

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13333 from clockfly/spark-13445.

b5859e0b

[SPARK-15457][MLLIB][ML] Eliminate some warnings from MLlib about deprecations · b0a03fee

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Several classes and methods have been deprecated and are creating lots of build warnings in branch-2.0. This issue is to identify and fix those items:
* WithSGD classes: Change to make class not deprecated, object deprecated, and public class constructor deprecated. Any public use will require a deprecated API. We need to keep a non-deprecated private API since we cannot eliminate certain uses: Python API, streaming algs, and examples.
  * Use in PythonMLlibAPI: Change to using private constructors
  * Streaming algs: No warnings after we un-deprecate the classes
  * Examples: Deprecate or change ones which use deprecated APIs
* MulticlassMetrics fields (precision, etc.)
* LinearRegressionSummary.model field

## How was this patch tested?

Existing tests.  Checked for warnings manually.

Author: Sean Owen <sowen@cloudera.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #13314 from jkbradley/warning-cleanups.

b0a03fee

[SPARK-15552][SQL] Remove unnecessary private[sql] methods in SparkSession · 0f61d6ef

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
SparkSession has a list of unnecessary private[sql] methods. These methods cause some trouble because private[sql] doesn't apply in Java. In the cases that they are easy to remove, we can simply remove them. This patch does that.

As part of this pull request, I also replaced a bunch of protected[sql] with private[sql], to tighten up visibility.

## How was this patch tested?
Updated test cases to reflect the changes.

Author: Reynold Xin <rxin@databricks.com>

Closes #13319 from rxin/SPARK-15552.

0f61d6ef

[SPARK-15520][SQL] Also set sparkContext confs when using SparkSession builder in pyspark · 594a1bf2

Eric Liang authored 8 years ago

## What changes were proposed in this pull request?

Also sets confs in the underlying sc when using SparkSession.builder.getOrCreate(). This is a bug-fix from a post-merge comment in https://github.com/apache/spark/pull/13289

## How was this patch tested?

Python doc-tests.

Author: Eric Liang <ekl@databricks.com>

Closes #13309 from ericl/spark-15520-1.

594a1bf2

[SPARK-15539][SQL] DROP TABLE throw exception if table doesn't exist · 2b1ac6ce

Andrew Or authored 8 years ago

## What changes were proposed in this pull request?

Same as #13302, but for DROP TABLE.

## How was this patch tested?

`DDLSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13307 from andrewor14/drop-table.

2b1ac6ce

[SPARK-13148][YARN] document zero-keytab Oozie application launch; add diagnostics · 01b350a4

Steve Loughran authored 8 years ago

This patch provides detail on what to do for keytabless Oozie launches of spark apps, and adds some debug-level diagnostics of what credentials have been submitted

Author: Steve Loughran <stevel@hortonworks.com>
Author: Steve Loughran <stevel@apache.org>

Closes #11033 from steveloughran/stevel/feature/SPARK-13148-oozie.

01b350a4

[SPARK-10903][SPARKR] R - Simplify SQLContext method signatures and use a singleton · c76457c8

felixcheung authored 8 years ago

Eliminate the need to pass sqlContext to method since it is a singleton - and we don't want to support multiple contexts in a R session.

Changes are done in a back compat way with deprecation warning added. Method signature for S3 methods are added in a concise, clean approach such that in the next release the deprecated signature can be taken out easily/cleanly (just delete a few lines per method).

Custom method dispatch is implemented to allow for multiple JVM reference types that are all 'jobj' in R and to avoid having to add 30 new exports.

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9192 from felixcheung/rsqlcontext.

c76457c8

[SPARK-15523][ML][MLLIB] Update JPMML to 1.2.15 · 6d506c9a

Villu Ruusmann authored 8 years ago

## What changes were proposed in this pull request?

See https://issues.apache.org/jira/browse/SPARK-15523

This PR replaces PR #13293. It's isolated to a new branch, and contains some more squashed changes.

## How was this patch tested?

1. Executed `mvn clean package` in `mllib` directory
2. Executed `dev/test-dependencies.sh --replace-manifest` in the root directory.

Author: Villu Ruusmann <villu.ruusmann@gmail.com>

Closes #13297 from vruusmann/update-jpmml.

6d506c9a

[SPARK-15492][ML][DOC] Binarization scala example copy & paste to spark-shell error · e451f7f0

wm624@hotmail.com authored 8 years ago

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
The Binarization scala example val dataFrame : Dataframe = spark.createDataFrame(data).toDF("label", "feature"), which can't be pasted in the spark-shell as Dataframe is not imported. Compared with other examples, this explicit type is not required.

So I removed Dataframe in the code.
## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually tested

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13266 from wangmiao1981/unit.

e451f7f0

[SPARK-15537][SQL] fix dir delete issue · 53d4abe9

Bo Meng authored 8 years ago

## What changes were proposed in this pull request?

For some of the test cases, e.g. `OrcSourceSuite`, it will create temp folders and temp files inside them. But after tests finish, the folders are not removed. This will cause lots of temp files created and space occupied, if we keep running the test cases.

The reason is dir.delete() won't work if dir is not empty. We need to recursively delete the content before deleting the folder.

## How was this patch tested?

Manually checked the temp folder to make sure the temp files were deleted.

Author: Bo Meng <mengbo@hotmail.com>

Closes #13304 from bomeng/SPARK-15537.

53d4abe9

[SPARK-15543][SQL] Rename DefaultSources to make them more self-describing · 361ebc28

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch renames various DefaultSources to make their names more self-describing. The choice of "DefaultSource" was from the days when we did not have a good way to specify short names.

They are now named:
- LibSVMFileFormat
- CSVFileFormat
- JdbcRelationProvider
- JsonFileFormat
- ParquetFileFormat
- TextFileFormat

Backward compatibility is maintained through aliasing.

## How was this patch tested?
Updated relevant test cases too.

Author: Reynold Xin <rxin@databricks.com>

Closes #13311 from rxin/SPARK-15543.

361ebc28

[SPARK-10372] [CORE] basic test framework for entire spark scheduler · dfc9fc02

Imran Rashid authored 8 years ago

This is a basic framework for testing the entire scheduler. The tests this adds aren't very interesting -- the point of this PR is just to setup the framework, to keep the initial change small, but it can be built upon to test more features (eg., speculation, killing tasks, blacklisting, etc.).

Author: Imran Rashid <irashid@cloudera.com>

Closes #8559 from squito/SPARK-10372-scheduler-integs.

dfc9fc02

May 25, 2016

[SPARK-15439][SPARKR] Failed to run unit test in SparkR · 06bae8af

wm624@hotmail.com authored 8 years ago

## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
There are some failures when running SparkR unit tests.
In this PR, I fixed two of these failures in test_context.R and test_sparkSQL.R
The first one is due to different masked name. I added missed names in the expected arrays.
The second one is because one PR removed the logic of a previous fix of missing subset method.

The file privilege issue is still there. I am debugging it. SparkR shell can run the test case successfully.
test_that("pipeRDD() on RDDs", {
  actual <- collect(pipeRDD(rdd, "more"))
When using run-test script, it complains no such directories as below:
cannot open file '/tmp/Rtmp4FQbah/filee2273f9d47f7': No such file or directory

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Manually test it

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13284 from wangmiao1981/R.

06bae8af

[SPARK-15533][SQL] Deprecate Dataset.explode · 06ed1fa3

Sameer Agarwal authored 8 years ago

## What changes were proposed in this pull request?

This patch deprecates `Dataset.explode` and documents appropriate workarounds to use `flatMap()` or `functions.explode()` instead.

## How was this patch tested?

N/A

Author: Sameer Agarwal <sameer@databricks.com>

Closes #13312 from sameeragarwal/deprecate.

06ed1fa3

[SPARK-15525][SQL][BUILD] Upgrade ANTLR4 SBT plugin · 527499b6

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
The ANTLR4 SBT plugin has been moved from its own repo to one on bintray. The version was also changed from `0.7.10` to `0.7.11`. The latter actually broke our build (ihji has fixed this by also adding `0.7.10` and others to the bin-tray repo).

This PR upgrades the SBT-ANTLR4 plugin and ANTLR4 to their most recent versions (`0.7.11`/`4.5.3`). I have also removed a few obsolete build configurations.

## How was this patch tested?
Manually running SBT/Maven builds.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #13299 from hvanhovell/SPARK-15525.

527499b6

[SPARK-15534][SPARK-15535][SQL] Truncate table fixes · ee682fe2

Andrew Or authored 8 years ago

## What changes were proposed in this pull request?

Two changes:
- When things fail, `TRUNCATE TABLE` just returns nothing. Instead, we should throw exceptions.
- Remove `TRUNCATE TABLE ... COLUMN`, which was never supported by either Spark or Hive.

## How was this patch tested?
Jenkins.

Author: Andrew Or <andrew@databricks.com>

Closes #13302 from andrewor14/truncate-table.

ee682fe2

Log warnings for numIterations * miniBatchFraction < 1.0 · 589cce93

Gio Borje authored 8 years ago

## What changes were proposed in this pull request?

Add a warning log for the case that `numIterations * miniBatchFraction <1.0` during gradient descent. If the product of those two numbers is less than `1.0`, then not all training examples will be used during optimization. To put this concretely, suppose that `numExamples = 100`, `miniBatchFraction = 0.2` and `numIterations = 3`. Then, 3 iterations will occur each sampling approximately 6 examples each. In the best case, each of the 6 examples are unique; hence 18/100 examples are used.

This may be counter-intuitive to most users and led to the issue during the development of another Spark  ML model: https://github.com/zhengruifeng/spark-libFM/issues/11. If a user actually does not require the training data set, it would be easier and more intuitive to use `RDD.sample`.

## How was this patch tested?

`build/mvn -DskipTests clean package` build succeeds

Author: Gio Borje <gborje@linkedin.com>

Closes #13265 from Hydrotoast/master.

589cce93

[MINOR] [PYSPARK] [EXAMPLES] Changed examples to use SparkSession.sparkContext instead of _sc · 9c297df3

Bryan Cutler authored 8 years ago

## What changes were proposed in this pull request?

Some PySpark examples need a SparkContext and get it by accessing _sc directly from the session.  These examples should use the provided property `sparkContext` in `SparkSession` instead.

## How was this patch tested?
Ran modified examples

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13303 from BryanCutler/pyspark-session-sparkContext-MINOR.

9c297df3

[SPARK-14269][SCHEDULER] Eliminate unnecessary submitStage() call. · 698ef762

Takuya UESHIN authored 8 years ago

## What changes were proposed in this pull request?

Currently a method `submitStage()` for waiting stages is called on every iteration of the event loop in `DAGScheduler` to submit all waiting stages, but most of them are not necessary because they are not related to Stage status.
The case we should try to submit waiting stages is only when their parent stages are successfully completed.

This elimination can improve `DAGScheduler` performance.

## How was this patch tested?

Added some checks and other existing tests, and our projects.

We have a project bottle-necked by `DAGScheduler`, having about 2000 stages.

Before this patch the almost all execution time in `Driver` process was spent to process `submitStage()` of `dag-scheduler-event-loop` thread but after this patch the performance was improved as follows:

|        | total execution time | `dag-scheduler-event-loop` thread time | `submitStage()` |
|--------|---------------------:|---------------------------------------:|----------------:|
| Before |              760 sec |                                710 sec |         667 sec |
| After  |              440 sec |                                 14 sec |          10 sec |

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #12060 from ueshin/issues/SPARK-14269.

698ef762

[SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV · c875d81a

Jurriaan Pruis authored 8 years ago

## What changes were proposed in this pull request?

Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.

See https://github.com/uniVocity/univocity-parsers/blob/f3eb2af26374940e60d91d1703bde54619f50c51/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java#L231-L247

This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)

https://issues.apache.org/jira/browse/SPARK-15493

## How was this patch tested?

Added a test that verifies the output is quoted correctly.

Author: Jurriaan Pruis <email@jurriaanpruis.nl>

Closes #13267 from jurriaan/quote-escaping.

c875d81a

[SPARK-15483][SQL] IncrementalExecution should use extra strategies. · 4b880674

Takuya UESHIN authored 8 years ago

## What changes were proposed in this pull request?

Extra strategies does not work for streams because `IncrementalExecution` uses modified planner with stateful operations but it does not include extra strategies.

This pr fixes `IncrementalExecution` to include extra strategies to use them.

## How was this patch tested?

I added a test to check if extra strategies work for streams.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #13261 from ueshin/issues/SPARK-15483.

4b880674

[SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS · 1cb347fb

Nick Pentreath authored 8 years ago

Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.

We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.

Tests N/A.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.

1cb347fb

[MINOR][MLLIB][STREAMING][SQL] Fix typos · 02c8072e

lfzCarlosC authored 8 years ago

fixed typos for source code for components [mllib] [streaming] and [SQL]

None and obvious.

Author: lfzCarlosC <lfz.carlos@gmail.com>

Closes #13298 from lfzCarlosC/master.

02c8072e

[MINOR][CORE] Fix a HadoopRDD log message and remove unused imports in rdd files. · d6d3e507

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes the following typos in log message and comments of `HadoopRDD.scala`. Also, this removes unused imports.
```scala
-      logWarning("Caching NewHadoopRDDs as deserialized objects usually leads to undesired" +
+      logWarning("Caching HadoopRDDs as deserialized objects usually leads to undesired" +
...
-      // since its not removed yet
+      // since it's not removed yet
```

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13294 from dongjoon-hyun/minor_rdd_fix_log_message.

d6d3e507

[SPARK-15520][SQL] SparkSession builder in python should also allow overriding... · 8239fdcb

Eric Liang authored 8 years ago

[SPARK-15520][SQL] SparkSession builder in python should also allow overriding confs of existing sessions

## What changes were proposed in this pull request?

This fixes the python SparkSession builder to allow setting confs correctly. This was a leftover TODO from https://github.com/apache/spark/pull/13200.

## How was this patch tested?

Python doc tests.

cc andrewor14

Author: Eric Liang <ekl@databricks.com>

Closes #13289 from ericl/spark-15520.

8239fdcb

[SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this... · 01e7b9c8

Jeff Zhang authored 8 years ago

[SPARK-15345][SQL][PYSPARK] SparkSession's conf doesn't take effect when this already an existing SparkContext

## What changes were proposed in this pull request?

Override the existing SparkContext is the provided SparkConf is different. PySpark part hasn't been fixed yet, will do that after the first round of review to ensure this is the correct approach.

## How was this patch tested?

Manually verify it in spark-shell.

rxin  Please help review it, I think this is a very critical issue for spark 2.0

Author: Jeff Zhang <zjffdu@apache.org>

Closes #13160 from zjffdu/SPARK-15345.

01e7b9c8

[SPARK-9044] Fix "Storage" tab in UI so that it reflects RDD name change. · b120fba6

Lukasz authored 8 years ago

## What changes were proposed in this pull request?

1. Making 'name' field of RDDInfo mutable.
2. In StorageListener: catching the fact that RDD's name was changed and updating it in RDDInfo.

## How was this patch tested?

1. Manual verification - the 'Storage' tab now behaves as expected.
2. The commit also contains a new unit test which verifies this.

Author: Lukasz <lgieron@gmail.com>

Closes #13264 from lgieron/SPARK-9044.

b120fba6

[SPARK-15436][SQL] Remove DescribeFunction and ShowFunctions · 4f27b8dd

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch removes the last two commands defined in the catalyst module: DescribeFunction and ShowFunctions. They were unnecessary since the parser could just generate DescribeFunctionCommand and ShowFunctionsCommand directly.

## How was this patch tested?
Created a new SparkSqlParserSuite.

Author: Reynold Xin <rxin@databricks.com>

Closes #13292 from rxin/SPARK-15436.

4f27b8dd

[SPARK-12071][DOC] Document the behaviour of NA in R · 9082b796

Krishna Kalyan authored 8 years ago

## What changes were proposed in this pull request?

Under Upgrading From SparkR 1.5.x to 1.6.x section added the information, SparkSQL converts `NA` in R to `null`.

## How was this patch tested?

Document update, no tests.

Author: Krishna Kalyan <krishnakalyan3@gmail.com>

Closes #13268 from krishnakalyan3/spark-12071-1.

9082b796

[SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc... · cd9f1690

Holden Karau authored 8 years ago

[SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc & doc build insturctions

## What changes were proposed in this pull request?

PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).

## How was this patch tested?

built pydocs locally, tested new user build instructions

Author: Holden Karau <holden@us.ibm.com>

Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.

cd9f1690

[SPARK-15508][STREAMING][TESTS] Fix flaky test: JavaKafkaStreamSuite.testKafkaStream · c9c1c0e5

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

`JavaKafkaStreamSuite.testKafkaStream` assumes when `sent.size == result.size`, the contents of `sent` and `result` should be same. However, that's not true. The content of `result` may not be the final content.

This PR modified the test to always retry the assertions even if the contents of `sent` and `result` are not same.

Here is the failure in Jenkins: http://spark-tests.appspot.com/tests/org.apache.spark.streaming.kafka.JavaKafkaStreamSuite/testKafkaStream

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13281 from zsxwing/flaky-kafka-test.

c9c1c0e5

May 24, 2016

[SPARK-15498][TESTS] fix slow tests · 50b660d7

Wenchen Fan authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes 3 slow tests:

1. `ParquetQuerySuite.read/write wide table`: This is not a good unit test as it runs more than 5 minutes. This PR removes it and add a new regression test in `CodeGenerationSuite`, which is more "unit".
2. `ParquetQuerySuite.returning batch for wide table`: reduce the threshold and use smaller data size.
3. `DatasetSuite.SPARK-14554: Dataset.map may generate wrong java code for wide table`: Improve `CodeFormatter.format`(introduced at https://github.com/apache/spark/pull/12979) can dramatically speed this it up.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13273 from cloud-fan/test.

50b660d7

[SPARK-15365][SQL] When table size statistics are not available from... · 4acababc

Parth Brahmbhatt authored 8 years ago

[SPARK-15365][SQL] When table size statistics are not available from metastore, we should fallback to HDFS

## What changes were proposed in this pull request?
Currently if a table is used in join operation we rely on Metastore returned size to calculate if we can convert the operation to Broadcast join. This optimization only kicks in for table's that have the statistics available in metastore. Hive generally rolls over to HDFS if the statistics are not available directly from metastore and this seems like a reasonable choice to adopt given the optimization benefit of using broadcast joins.

## How was this patch tested?
I have executed queries locally to test.

Author: Parth Brahmbhatt <pbrahmbhatt@netflix.com>

Closes #13150 from Parth-Brahmbhatt/SPARK-15365.

4acababc

[SPARK-15518] Rename various scheduler backend for consistency · 14494da8

Reynold Xin authored 8 years ago

## What changes were proposed in this pull request?
This patch renames various scheduler backends to make them consistent:

- LocalScheduler -> LocalSchedulerBackend
- AppClient -> StandaloneAppClient
- AppClientListener -> StandaloneAppClientListener
- SparkDeploySchedulerBackend -> StandaloneSchedulerBackend
- CoarseMesosSchedulerBackend -> MesosCoarseGrainedSchedulerBackend
- MesosSchedulerBackend -> MesosFineGrainedSchedulerBackend

## How was this patch tested?
Updated test cases to reflect the name change.

Author: Reynold Xin <rxin@databricks.com>

Closes #13288 from rxin/SPARK-15518.

14494da8

[SPARK-15512][CORE] repartition(0) should raise IllegalArgumentException · f08bf587

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Previously, SPARK-8893 added the constraints on positive number of partitions for repartition/coalesce operations in general. This PR adds one missing part for that and adds explicit two testcases.

**Before**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0).collect()
res1: Array[Int] = Array()   // empty
scala> spark.sql("select 1").coalesce(0)
res2: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").coalesce(0).collect()
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
scala> spark.sql("select 1").repartition(0)
res3: org.apache.spark.sql.Dataset[org.apache.spark.sql.Row] = [1: int]
scala> spark.sql("select 1").repartition(0).collect()
res4: Array[org.apache.spark.sql.Row] = Array()  // empty
```

**After**
```scala
scala> sc.parallelize(1 to 5).coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> sc.parallelize(1 to 5).repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").coalesce(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
scala> spark.sql("select 1").repartition(0)
java.lang.IllegalArgumentException: requirement failed: Number of partitions (0) must be positive.
...
```

## How was this patch tested?

Pass the Jenkins tests with new testcases.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13282 from dongjoon-hyun/SPARK-15512.

f08bf587

[SPARK-15458][SQL][STREAMING] Disable schema inference for streaming datasets on file streams · e631b819

Tathagata Das authored 8 years ago

## What changes were proposed in this pull request?

If the user relies on the schema to be inferred in file streams can break easily for multiple reasons
- accidentally running on a directory which has no data
- schema changing underneath
- on restart, the query will infer schema again, and may unexpectedly infer incorrect schema, as the file in the directory may be different at the time of the restart.

To avoid these complicated scenarios, for Spark 2.0, we are going to disable schema inferencing by default with a config, so that user is forced to consider explicitly what is the schema it wants, rather than the system trying to infer it and run into weird corner cases.

In this PR, I introduce a SQLConf that determines whether schema inference for file streams is allowed or not. It is disabled by default.

## How was this patch tested?
Updated unit tests that test error behavior with and without schema inference enabled.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13238 from tdas/SPARK-15458.

e631b819

[SPARK-15502][DOC][ML][PYSPARK] add guide note that ALS only supports integer ids · 20900e5f

Nick Pentreath authored 8 years ago

This PR adds a note to clarify that the ML API for ALS only supports integers for user/item ids, and that other types for these columns can be used but the ids must fall within integer range.

(Refer [SPARK-14891](https://issues.apache.org/jira/browse/SPARK-14891)).

Also cleaned up a reference to `mllib` in the ML doc.

## How was this patch tested?
Built and viewed User Guide doc locally.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13278 from MLnick/SPARK-15502-als-int-id-doc-note.

20900e5f

[MINOR][CORE][TEST] Update obsolete `takeSample` test case. · be99a99f

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes some obsolete comments and assertion in `takeSample` testcase of `RDDSuite.scala`.

## How was this patch tested?

This fixes the testcase only.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13260 from dongjoon-hyun/SPARK-15481.

be99a99f

[SPARK-15388][SQL] Fix spark sql CREATE FUNCTION with hive 1.2.1 · 784cc07d

wangyang authored 8 years ago

## What changes were proposed in this pull request?

spark.sql("CREATE FUNCTION myfunc AS 'com.haizhi.bdp.udf.UDFGetGeoCode'") throws "org.apache.hadoop.hive.ql.metadata.HiveException:MetaException(message:NoSuchObjectException(message:Function default.myfunc does not exist))" with hive 1.2.1.

I think it is introduced by pr #12853. Fixing it by catching Exception (not NoSuchObjectException) and string matching.

## How was this patch tested?

added a unit test and also tested it manually

Author: wangyang <wangyang@haizhi.com>

Closes #13177 from wangyang1992/fixCreateFunc2.

784cc07d

[SPARK-15405][YARN] Remove unnecessary upload of config archive. · a313a5ae

Marcelo Vanzin authored 8 years ago

We only need one copy of it. The client code that was uploading the
second copy just needs to be modified to update the metadata in the
cache, so that the AM knows where to find the configuration.

Tested by running app on YARN and verifying in the logs only one archive
is uploaded.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #13232 from vanzin/SPARK-15405.

a313a5ae