Commits · dd1abef138581f30ab7a8dfacb616fe7dd64b421 · cs525-sp18-g07 / spark

Feb 07, 2017

[SPARK-19444][ML][DOCUMENTATION] Fix imports not being present in documentation · dd1abef1

Aseem Bansal authored 8 years ago


## What changes were proposed in this pull request?

SPARK-19444 imports not being present in documentation

## How was this patch tested?

Manual

## Disclaimer

Contribution is original work and I license the work to the project under the project’s open source license

Author: Aseem Bansal <anshbansal@users.noreply.github.com>

Closes #16789 from anshbansal/patch-1.

(cherry picked from commit aee2bd2c)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

dd1abef1

Feb 06, 2017

[SPARK-19407][SS] defaultFS is used FileSystem.get instead of getting it from uri scheme · 62fab5be

uncleGen authored 8 years ago

## What changes were proposed in this pull request?

```
Caused by: java.lang.IllegalArgumentException: Wrong FS: s3a://**************/checkpoint/7b2231a3-d845-4740-bfa3-681850e5987f/metadata, expected: file:///
	at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:649)
	at org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:82)
	at org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:606)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:824)
	at org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:601)
	at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421)
	at org.apache.hadoop.fs.FileSystem.exists(FileSystem.java:1426)
	at org.apache.spark.sql.execution.streaming.StreamMetadata$.read(StreamMetadata.scala:51)
	at org.apache.spark.sql.execution.streaming.StreamExecution.<init>(StreamExecution.scala:100)
	at org.apache.spark.sql.streaming.StreamingQueryManager.createQuery(StreamingQueryManager.scala:232)
	at org.apache.spark.sql.streaming.StreamingQueryManager.startQuery(StreamingQueryManager.scala:269)
	at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:262)
```

Can easily replicate on spark standalone cluster by providing checkpoint location uri scheme anything other than "file://" and not overriding in config.

WorkAround  --conf spark.hadoop.fs.defaultFS=s3a://somebucket

 or set it in sparkConf or spark-default.conf

## How was this patch tested?

existing ut

Author: uncleGen <hustyugm@gmail.com>

Closes #16815 from uncleGen/SPARK-19407.

(cherry picked from commit 7a0a630e)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

62fab5be

[SPARK-19472][SQL] Parser should not mistake CASE WHEN(...) for a function call · f55bd4c7

Herman van Hovell authored 8 years ago


## What changes were proposed in this pull request?
The SQL parser can mistake a `WHEN (...)` used in `CASE` for a function call. This happens in cases like the following:
```sql
select case when (1) + case when 1 > 0 then 1 else 0 end = 2 then 1 else 0 end
from tb
```
This PR fixes this by re-organizing the case related parsing rules.

## How was this patch tested?
Added a regression test to the `ExpressionParserSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #16821 from hvanhovell/SPARK-19472.

(cherry picked from commit cb2677b8)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

f55bd4c7

Feb 01, 2017

[SPARK-19432][CORE] Fix an unexpected failure when connecting timeout · 7c23bd49

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

When connecting timeout, `ask` may fail with a confusing message:

```
17/02/01 23:15:19 INFO Worker: Connecting to master ...
java.lang.IllegalArgumentException: requirement failed: TransportClient has not yet been set.
        at scala.Predef$.require(Predef.scala:224)
        at org.apache.spark.rpc.netty.RpcOutboxMessage.onTimeout(Outbox.scala:70)
        at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:232)
        at org.apache.spark.rpc.netty.NettyRpcEnv$$anonfun$ask$1.applyOrElse(NettyRpcEnv.scala:231)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:138)
        at scala.concurrent.Future$$anonfun$onFailure$1.apply(Future.scala:136)
        at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
```

It's better to provide a meaningful message.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16773 from zsxwing/connect-timeout.

(cherry picked from commit 8303e20c)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

7c23bd49

[SPARK-19377][WEBUI][CORE] Killed tasks should have the status as KILLED · f9464641

Devaraj K authored 8 years ago


## What changes were proposed in this pull request?

Copying of the killed status was missing while getting the newTaskInfo object by dropping the unnecessary details to reduce the memory usage. This patch adds the copying of the killed status to newTaskInfo object, this will correct the display of the status from wrong status to KILLED status in Web UI.

## How was this patch tested?

Current behaviour of displaying tasks in stage UI page,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143	|10	|0	|SUCCESS	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|
|156	|11	|0	|SUCCESS	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		| |0.0 B / 0	|TaskKilled (killed intentionally)|

Web UI display after applying the patch,

| Index | ID | Attempt | Status | Locality Level | Executor ID / Host | Launch Time | Duration | GC Time | Input Size / Records | Write Time | Shuffle Write Size / Records | Errors |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
|143	|10	|0	|KILLED	|NODE_LOCAL	|6 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  | 0.0 B / 0	| TaskKilled (killed intentionally)|
|156	|11	|0	|KILLED	|NODE_LOCAL	|5 / x.xx.x.x stdout stderr|2017/01/25 07:49:27	|0 ms |		|0.0 B / 0		|  |0.0 B / 0	| TaskKilled (killed intentionally)|

Author: Devaraj K <devaraj@apache.org>

Closes #16725 from devaraj-kavali/SPARK-19377.

(cherry picked from commit df4a27cc)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

f9464641

[SPARK-19410][DOC] Fix brokens links in ml-pipeline and ml-tuning · 61cdc8c7

Zheng RuiFeng authored 8 years ago


## What changes were proposed in this pull request?
Fix brokens links in ml-pipeline and ml-tuning
`<div data-lang="scala">`  ->   `<div data-lang="scala" markdown="1">`

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16754 from zhengruifeng/doc_api_fix.

(cherry picked from commit 04ee8cf6)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

61cdc8c7

Jan 31, 2017

[SPARK-19378][SS] Ensure continuity of stateOperator and eventTime metrics... · d35a1268

Burak Yavuz authored 8 years ago

[SPARK-19378][SS] Ensure continuity of stateOperator and eventTime metrics even if there is no new data in trigger

In StructuredStreaming, if a new trigger was skipped because no new data arrived, we suddenly report nothing for the metrics `stateOperator`. We could however easily report the metrics from `lastExecution` to ensure continuity of metrics.

Regression test in `StreamingQueryStatusAndProgressSuite`

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #16716 from brkyvz/state-agg.

(cherry picked from commit 081b7add)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

d35a1268

[BACKPORT-2.1][SPARKR][DOCS] update R API doc for subset/extract · e43f161b

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

backport #16721 to branch-2.1

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16749 from felixcheung/rsubsetdocbackport.

e43f161b

Jan 30, 2017

[SPARK-19406][SQL] Fix function to_json to respect user-provided options · 07a1788e

gatorsmile authored 8 years ago


### What changes were proposed in this pull request?
Currently, the function `to_json` allows users to provide options for generating JSON. However, it does not pass it to `JacksonGenerator`. Thus, it ignores the user-provided options. This PR is to fix it. Below is an example.

```Scala
val df = Seq(Tuple1(Tuple1(java.sql.Timestamp.valueOf("2015-08-26 18:00:00.0")))).toDF("a")
val options = Map("timestampFormat" -> "dd/MM/yyyy HH:mm")
df.select(to_json($"a", options)).show(false)
```
The current output is like
```
+--------------------------------------+
|structtojson(a)                       |
+--------------------------------------+
|{"_1":"2015-08-26T18:00:00.000-07:00"}|
+--------------------------------------+
```

After the fix, the output is like
```
+-------------------------+
|structtojson(a)          |
+-------------------------+
|{"_1":"26/08/2015 18:00"}|
+-------------------------+
```
### How was this patch tested?
Added test cases for both `from_json` and `to_json`

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16745 from gatorsmile/toJson.

(cherry picked from commit f9156d29)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

07a1788e

[SPARK-19396][DOC] JDBC Options are Case In-sensitive · 445438c9

gatorsmile authored 8 years ago

### What changes were proposed in this pull request?
The case are not sensitive in JDBC options, after the PR https://github.com/apache/spark/pull/15884

 is merged to Spark 2.1.

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16734 from gatorsmile/fixDocCaseInsensitive.

(cherry picked from commit c0eda7e8)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

445438c9

Jan 27, 2017

[SPARK-19324][SPARKR] Spark VJM stdout output is getting dropped in SparkR · 9a49f9af

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

This affects mostly running job from the driver in client mode when results are expected to be through stdout (which should be somewhat rare, but possible)

Before:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
NULL
```

After:
```
> a <- as.DataFrame(cars)
> b <- group_by(a, "dist")
> c <- count(b)
> sparkR.callJMethod(c$countjc, "explain", TRUE)
count#11L
NULL
```

Now, `column.explain()` doesn't seem very useful (we can get more extensive output with `DataFrame.explain()`) but there are other more complex examples with calls of `println` in Scala/JVM side, that are getting dropped.

## How was this patch tested?

manual

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16670 from felixcheung/rjvmstdout.

(cherry picked from commit a7ab6f9a)
Signed-off-by: Shivaram Venkataraman <shivaram@cs.berkeley.edu>

9a49f9af

[SPARK-19333][SPARKR] Add Apache License headers to R files · 4002ee97

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

add header

## How was this patch tested?

Manual run to check vignettes html is created properly

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16709 from felixcheung/rfilelicense.

(cherry picked from commit 385d7384)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

4002ee97

Jan 26, 2017

[SPARK-18788][SPARKR] Add API for getNumPartitions · ba2a5ada

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

With doc to say this would convert DF into RDD

## How was this patch tested?

unit tests, manual tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16668 from felixcheung/rgetnumpartitions.

(cherry picked from commit 90817a6c)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

ba2a5ada

[SPARK-19220][UI] Make redirection to HTTPS apply to all URIs. (branch-2.1) · 59502bbc

Marcelo Vanzin authored 8 years ago

The redirect handler was installed only for the root of the server;
any other context ended up being served directly through the HTTP
port. Since every sub page (e.g. application UIs in the history
server) is a separate servlet context, this meant that everything
but the root was accessible via HTTP still.

The change adds separate names to each connector, and binds contexts
to specific connectors so that content is only served through the
HTTPS connector when it's enabled. In that case, the only thing that
binds to the HTTP connector is the redirect handler.

Tested with new unit tests and by checking a live history server.

(cherry picked from commit d3dcb63b)

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16711 from vanzin/SPARK-19220_2.1.

59502bbc

[SPARK-19338][SQL] Add UDF names in explain · b12a76a4

Takeshi YAMAMURO authored 8 years ago


## What changes were proposed in this pull request?
This pr added a variable for a UDF name in `ScalaUDF`.
Then, if the variable filled, `DataFrame#explain` prints the name.

## How was this patch tested?
Added a test in `UDFSuite`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #16707 from maropu/SPARK-19338.

(cherry picked from commit 9f523d31)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

b12a76a4

Jan 25, 2017

[SPARK-14804][SPARK][GRAPHX] Fix checkpointing of VertexRDD/EdgeRDD · 0d7e3852

Tathagata Das authored 8 years ago


## What changes were proposed in this pull request?

EdgeRDD/VertexRDD overrides checkpoint() and isCheckpointed() to forward these to the internal partitionRDD. So when checkpoint() is called on them, its the partitionRDD that actually gets checkpointed. However since isCheckpointed() also overridden to call partitionRDD.isCheckpointed, EdgeRDD/VertexRDD.isCheckpointed returns true even though this RDD is actually not checkpointed.

This would have been fine except the RDD's internal logic for computing the RDD depends on isCheckpointed(). So for VertexRDD/EdgeRDD, since isCheckpointed is true, when computing Spark tries to read checkpoint data of VertexRDD/EdgeRDD even though they are not actually checkpointed. Through a crazy sequence of call forwarding, it reads checkpoint data of partitionsRDD and tries to cast it to types in Vertex/EdgeRDD. This leads to ClassCastException.

The minimal fix that does not change any public behavior is to modify RDD internal to not use public override-able API for internal logic.
## How was this patch tested?

New unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #15396 from tdas/SPARK-14804.

(cherry picked from commit 47d5d0dd)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

0d7e3852

[SPARK-19064][PYSPARK] Fix pip installing of sub components · a5c10ff2

Holden Karau authored 8 years ago


## What changes were proposed in this pull request?

Fix instalation of mllib and ml sub components, and more eagerly cleanup cache files during test script & make-distribution.

## How was this patch tested?

Updated sanity test script to import mllib and ml sub-components.

Author: Holden Karau <holden@us.ibm.com>

Closes #16465 from holdenk/SPARK-19064-fix-pip-install-sub-components.

(cherry picked from commit 965c82d8)
Signed-off-by: Holden Karau <holden@us.ibm.com>

a5c10ff2

[SPARK-18750][YARN] Follow up: move test to correct directory in 2.1 branch. · 97d3353e
Marcelo Vanzin authored 8 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16704 from vanzin/SPARK-18750_2.1.
```
97d3353e

[SPARK-19307][PYSPARK] Make sure user conf is propagated to SparkContext. · c9f075ab

Marcelo Vanzin authored 8 years ago


The code was failing to propagate the user conf in the case where the
JVM was already initialized, which happens when a user submits a
python script via spark-submit.

Tested with new unit test and by running a python script in a real cluster.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16682 from vanzin/SPARK-19307.

(cherry picked from commit 92afaa93)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

c9f075ab

[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a... · af954553

Nattavut Sutyanyong authored 8 years ago

[SPARK-18863][SQL] Output non-aggregate expressions without GROUP BY in a subquery does not yield an error

## What changes were proposed in this pull request?
This PR will report proper error messages when a subquery expression contain an invalid plan. This problem is fixed by calling CheckAnalysis for the plan inside a subquery.

## How was this patch tested?
Existing tests and two new test cases on 2 forms of subquery, namely, scalar subquery and in/exists subquery.

````
-- TC 01.01
-- The column t2b in the SELECT of the subquery is invalid
-- because it is neither an aggregate function nor a GROUP BY column.
select t1a, t2b
from   t1, t2
where  t1b = t2c
and    t2b = (select max(avg)
              from   (select   t2b, avg(t2b) avg
                      from     t2
                      where    t2a = t1.t1b
                     )
             )
;

-- TC 01.02
-- Invalid due to the column t2b not part of the output from table t2.
select *
from   t1
where  t1a in (select   min(t2a)
               from     t2
               group by t2c
               having   t2c in (select   max(t3c)
                                from     t3
                                group by t3b
                                having   t3b > t2b ))
;
````

Author: Nattavut Sutyanyong <nsy.can@gmail.com>

Closes #16572 from nsyca/18863.

(cherry picked from commit f1ddca5f)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

af954553

[SPARK-18750][YARN] Avoid using "mapValues" when allocating containers. · f391ad2c

Marcelo Vanzin authored 8 years ago


That method is prone to stack overflows when the input map is really
large; instead, use plain "map". Also includes a unit test that was
tested and caused stack overflows without the fix.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16667 from vanzin/SPARK-18750.

(cherry picked from commit 76db394f)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

f391ad2c

[SPARK-16046][DOCS] Aggregations in the Spark SQL programming guide · e2f77392

aokolnychyi authored 8 years ago

## What changes were proposed in this pull request?

- A separate subsection for Aggregations under “Getting Started” in the Spark SQL programming guide. It mentions which aggregate functions are predefined and how users can create their own.
- Examples of using the `UserDefinedAggregateFunction` abstract class for untyped aggregations in Java and Scala.
- Examples of using the `Aggregator` abstract class for type-safe aggregations in Java and Scala.
- Python is not covered.
- The PR might not resolve the ticket since I do not know what exactly was planned by the author.

In total, there are four new standalone examples that can be executed via `spark-submit` or `run-example`. The updated Spark SQL programming guide references to these examples and does not contain hard-coded snippets.

## How was this patch tested?

The patch was tested locally by building the docs. The examples were run as well.

![image](https://cloud.githubusercontent.com/assets/6235869/21292915/04d9d084-c515-11e6-811a-999d598dffba.png

)

Author: aokolnychyi <okolnychyyanton@gmail.com>

Closes #16329 from aokolnychyi/SPARK-16046.

(cherry picked from commit 3fdce814)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

e2f77392

Jan 24, 2017

[SPARK-19330][DSTREAMS] Also show tooltip for successful batches · c1337879

Liwei Lin authored 8 years ago

## What changes were proposed in this pull request?

### Before
![_streaming_before](https://cloud.githubusercontent.com/assets/15843379/22181462/1e45c20c-e0c8-11e6-831c-8bf69722a4ee.png)

### After
![_streaming_after](https://cloud.githubusercontent.com/assets/15843379/22181464/23f38a40-e0c8-11e6-9a87-e27b1ffb1935.png

)

## How was this patch tested?

Manually

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16673 from lw-lin/streaming.

(cherry picked from commit 40a4cfc7)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

c1337879

[SPARK-19017][SQL] NOT IN subquery with more than one column may return incorrect results · b94fb284

Nattavut Sutyanyong authored 8 years ago


## What changes were proposed in this pull request?

This PR fixes the code in Optimizer phase where the NULL-aware expression of a NOT IN query is expanded in Rule `RewritePredicateSubquery`.

Example:
The query

 select a1,b1
 from   t1
 where  (a1,b1) not in (select a2,b2
                        from   t2);

has the (a1, b1) = (a2, b2) rewritten from (before this fix):

Join LeftAnti, ((isnull((_1#2 = a2#16)) || isnull((_2#3 = b2#17))) || ((_1#2 = a2#16) && (_2#3 = b2#17)))

to (after this fix):

Join LeftAnti, (((_1#2 = a2#16) || isnull((_1#2 = a2#16))) && ((_2#3 = b2#17) || isnull((_2#3 = b2#17))))

## How was this patch tested?

sql/test, catalyst/test and new test cases in SQLQueryTestSuite.

Author: Nattavut Sutyanyong <nsy.can@gmail.com>

Closes #16467 from nsyca/19017.

(cherry picked from commit cdb691eb)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

b94fb284

[SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case · d128b6a3

Ilya Matiach authored 8 years ago

[SPARK-16473][MLLIB] Fix BisectingKMeans Algorithm failing in edge case where no children exist in updateAssignments

## What changes were proposed in this pull request?

Fix a bug in which BisectingKMeans fails with error:
java.util.NoSuchElementException: key not found: 166
        at scala.collection.MapLike$class.default(MapLike.scala:228)
        at scala.collection.AbstractMap.default(Map.scala:58)
        at scala.collection.MapLike$class.apply(MapLike.scala:141)
        at scala.collection.AbstractMap.apply(Map.scala:58)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply$mcDJ$sp(BisectingKMeans.scala:338)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1$$anonfun$2.apply(BisectingKMeans.scala:337)
        at scala.collection.TraversableOnce$$anonfun$minBy$1.apply(TraversableOnce.scala:231)
        at scala.collection.LinearSeqOptimized$class.foldLeft(LinearSeqOptimized.scala:111)
        at scala.collection.immutable.List.foldLeft(List.scala:84)
        at scala.collection.LinearSeqOptimized$class.reduceLeft(LinearSeqOptimized.scala:125)
        at scala.collection.immutable.List.reduceLeft(List.scala:84)
        at scala.collection.TraversableOnce$class.minBy(TraversableOnce.scala:231)
        at scala.collection.AbstractTraversable.minBy(Traversable.scala:105)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:337)
        at org.apache.spark.mllib.clustering.BisectingKMeans$$anonfun$org$apache$spark$mllib$clustering$BisectingKMeans$$updateAssignments$1.apply(BisectingKMeans.scala:334)
        at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
        at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:389)

## How was this patch tested?

The dataset was run against the code change to verify that the code works.  I will try to add unit tests to the code.

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Ilya Matiach <ilmat@microsoft.com>

Closes #16355 from imatiach-msft/ilmat/fix-kmeans.

Unverified

d128b6a3

[SPARK-18823][SPARKR] add support for assigning to column · 9c04e427

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

Support for
```
df[[myname]] <- 1
df[[2]] <- df$eruptions
```

## How was this patch tested?

manual tests, unit tests

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16663 from felixcheung/rcolset.

(cherry picked from commit f27e0247)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

9c04e427

[SPARK-19268][SS] Disallow adaptive query execution for streaming queries · 570e5e11

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

As adaptive query execution may change the number of partitions in different batches, it may break streaming queries. Hence, we should disallow this feature in Structured Streaming.

## How was this patch tested?

`test("SPARK-19268: Adaptive query execution should be disallowed")`.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16683 from zsxwing/SPARK-19268.

(cherry picked from commit 60bd91a3)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

570e5e11

[SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions... · 4a2be090

hyukjinkwon authored 8 years ago

[SPARK-9435][SQL] Reuse function in Java UDF to correctly support expressions that require equality comparison between ScalaUDF

## What changes were proposed in this pull request?

Currently, running the codes in Java

```java
spark.udf().register("inc", new UDF1<Long, Long>() {
  Override
  public Long call(Long i) {
    return i + 1;
  }
}, DataTypes.LongType);

spark.range(10).toDF("x").createOrReplaceTempView("tmp");
Row result = spark.sql("SELECT inc(x) FROM tmp GROUP BY inc(x)").head();
Assert.assertEquals(7, result.getLong(0));
```

fails as below:

```
org.apache.spark.sql.AnalysisException: expression 'tmp.`x`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
Aggregate [UDF(x#19L)], [UDF(x#19L) AS UDF(x)#23L]
+- SubqueryAlias tmp, `tmp`
   +- Project [id#16L AS x#19L]
      +- Range (0, 10, step=1, splits=Some(8))

	at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
	at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:57)
```

The root cause is because we were creating the function every time when it needs to build as below:

```scala
scala> def inc(i: Int) = i + 1
inc: (i: Int)Int

scala> (inc(_: Int)).hashCode
res15: Int = 1231799381

scala> (inc(_: Int)).hashCode
res16: Int = 2109839984

scala> (inc(_: Int)) == (inc(_: Int))
res17: Boolean = false
```

This seems leading to the comparison failure between `ScalaUDF`s created from Java UDF API, for example, in `Expression.semanticEquals`.

In case of Scala one, it seems already fine.

Both can be tested easily as below if any reviewer is more comfortable with Scala:

```scala
val df = Seq((1, 10), (2, 11), (3, 12)).toDF("x", "y")
val javaUDF = new UDF1[Int, Int]  {
  override def call(i: Int): Int = i + 1
}
// spark.udf.register("inc", javaUDF, IntegerType) // Uncomment this for Java API
// spark.udf.register("inc", (i: Int) => i + 1)    // Uncomment this for Scala API
df.createOrReplaceTempView("tmp")
spark.sql("SELECT inc(y) FROM tmp GROUP BY inc(y)").show()
```

## How was this patch tested?

Unit test in `JavaUDFSuite.java` and `./dev/lint-java`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16553 from HyukjinKwon/SPARK-9435.

(cherry picked from commit e576c1ed)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

4a2be090

Jan 23, 2017

[SPARK-19306][CORE] Fix inconsistent state in DiskBlockObject when expection occurred · ed5d1e72

jerryshao authored 8 years ago


## What changes were proposed in this pull request?

In `DiskBlockObjectWriter`, when some errors happened during writing, it will call `revertPartialWritesAndClose`, if this method again failed due to some issues like out of disk, it will throw exception without resetting the state of this writer, also skipping the revert. So here propose to fix this issue to offer user a chance to recover from such issue.

## How was this patch tested?

Existing test.

Author: jerryshao <sshao@hortonworks.com>

Closes #16657 from jerryshao/SPARK-19306.

(cherry picked from commit e4974721)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

ed5d1e72

[SPARK-19155][ML] Make family case insensitive in GLM · 1e07a719

actuaryzhang authored 8 years ago


## What changes were proposed in this pull request?
This is a supplement to PR #16516 which did not make the value from `getFamily` case insensitive. Current tests of poisson/binomial glm with weight fail when specifying 'Poisson' or 'Binomial', because the calculation of `dispersion` and `pValue` checks the value of family retrieved from `getFamily`
```
model.getFamily == Binomial.name || model.getFamily == Poisson.name
```

## How was this patch tested?
Update existing tests for 'Poisson' and 'Binomial'.

yanboliang felixcheung imatiach-msft

Author: actuaryzhang <actuaryzhang10@gmail.com>

Closes #16675 from actuaryzhang/family.

(cherry picked from commit f067acef)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

1e07a719

Jan 21, 2017

[SPARK-19155][ML] MLlib GeneralizedLinearRegression family and link should case insensitive · 8daf10e3

Yanbo Liang authored 8 years ago

## What changes were proposed in this pull request?
MLlib ```GeneralizedLinearRegression``` ```family``` and ```link``` should be case insensitive. This is consistent with some other MLlib params such as [```featureSubsetStrategy```](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/tree/treeParams.scala#L415

).

## How was this patch tested?
Update corresponding tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16516 from yanboliang/spark-19133.

(cherry picked from commit 3dcad9fa)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

8daf10e3

Jan 20, 2017

[SPARK-19267][SS] Fix a race condition when stopping StateStore · 6f0ad575

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

There is a race condition when stopping StateStore which makes `StateStoreSuite.maintenance` flaky. `StateStore.stop` doesn't wait for the running task to finish, and an out-of-date task may fail `doMaintenance` and cancel the new task. Here is a reproducer: https://github.com/zsxwing/spark/commit/dde1b5b106ba034861cf19e16883cfe181faa6f3



This PR adds MaintenanceTask to eliminate the race condition.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16627 from zsxwing/SPARK-19267.

(cherry picked from commit ea31f92b)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

6f0ad575

[SPARK-18589][SQL] Fix Python UDF accessing attributes from both side of join · 4d286c90

Davies Liu authored 8 years ago

PythonUDF is unevaluable, which can not be used inside a join condition, currently the optimizer will push a PythonUDF which accessing both side of join into the join condition, then the query will fail to plan.

This PR fix this issue by checking the expression is evaluable or not before pushing it into Join.

Add a regression test.

Author: Davies Liu <davies@databricks.com>

Closes #16581 from davies/pyudf_join.

4d286c90

[SPARK-19314][SS][CATALYST] Do not allow sort before aggregation in Structured Streaming plan · 482d361c

Tathagata Das authored 8 years ago


## What changes were proposed in this pull request?

Sort in a streaming plan should be allowed only after a aggregation in complete mode. Currently it is incorrectly allowed when present anywhere in the plan. It gives unpredictable potentially incorrect results.

## How was this patch tested?
New test

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16662 from tdas/SPARK-19314.

(cherry picked from commit 552e5f08)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

482d361c

Jan 19, 2017

[SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when... · 7bc3e9ba

Wenchen Fan authored 8 years ago

[SPARK-18899][SPARK-18912][SPARK-18913][SQL] refactor the error checking when append data to an existing table

## What changes were proposed in this pull request?

When we append data to an existing table with `DataFrameWriter.saveAsTable`, we will do various checks to make sure the appended data is consistent with the existing data.

However, we get the information of the existing table by matching the table relation, instead of looking at the table metadata. This is error-prone, e.g. we only check the number of columns for `HadoopFsRelation`, we forget to check bucketing, etc.

This PR refactors the error checking by looking at the metadata of the existing table, and fix several bugs:
* SPARK-18899: We forget to check if the specified bucketing matched the existing table, which may lead to a problematic table that has different bucketing in different data files.
* SPARK-18912: We forget to check the number of columns for non-file-based data source table
* SPARK-18913: We don't support append data to a table with special column names.

## How was this patch tested?
new regression test.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16313 from cloud-fan/bug1.

(cherry picked from commit f923c849)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

7bc3e9ba

Jan 18, 2017

[SPARK-19168][STRUCTURED STREAMING] StateStore should be aborted upon error · 4cff0b50

Liwei Lin authored 8 years ago


## What changes were proposed in this pull request?

We should call `StateStore.abort()` when there should be any error before the store is committed.

## How was this patch tested?

Manually.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16547 from lw-lin/append-filter.

(cherry picked from commit 569e5068)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

4cff0b50

[SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from... · 047506ba

Shixiong Zhu authored 8 years ago

[SPARK-19113][SS][TESTS] Ignore StreamingQueryException thrown from awaitInitialization to avoid breaking tests

## What changes were proposed in this pull request?

#16492 missed one race condition: `StreamExecution.awaitInitialization` may throw fatal errors and fail the test. This PR just ignores `StreamingQueryException` thrown from `awaitInitialization` so that we can verify the exception in the `ExpectFailure` action later. It's fine since `StopStream` or `ExpectFailure` will catch `StreamingQueryException` as well.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16567 from zsxwing/SPARK-19113-2.

(cherry picked from commit c050c122)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

047506ba

[SPARK-19231][SPARKR] add error handling for download and untar for Spark release · 77202a6c

Felix Cheung authored 8 years ago


## What changes were proposed in this pull request?

When R is starting as a package and it needs to download the Spark release distribution we need to handle error for download and untar, and clean up, otherwise it will get stuck.

## How was this patch tested?

manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #16589 from felixcheung/rtarreturncode.

(cherry picked from commit 278fa1eb)
Signed-off-by: Felix Cheung <felixcheung@apache.org>

77202a6c

Jan 17, 2017

[SPARK-19066][SPARKR][BACKPORT-2.1] LDA doesn't set optimizer correctly · 29b954bb

wm624@hotmail.com authored 8 years ago

## What changes were proposed in this pull request?
Back port the fix to SPARK-19066 to 2.1 branch.

## How was this patch tested?
Unit tests

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16623 from wangmiao1981/bugport.

29b954bb

[SPARK-19129][SQL] SessionCatalog: Disallow empty part col values in partition spec · 3ec3e3f2

gatorsmile authored 8 years ago


Empty partition column values are not valid for partition specification. Before this PR, we accept users to do it; however, Hive metastore does not detect and disallow it too. Thus, users hit the following strange error.

```Scala
val df = spark.createDataFrame(Seq((0, "a"), (1, "b"))).toDF("partCol1", "name")
df.write.mode("overwrite").partitionBy("partCol1").saveAsTable("partitionedTable")
spark.sql("alter table partitionedTable drop partition(partCol1='')")
spark.table("partitionedTable").show()
```

In the above example, the WHOLE table is DROPPED when users specify a partition spec containing only one partition column with empty values.

When the partition columns contains more than one, Hive metastore APIs simply ignore the columns with empty values and treat it as partial spec. This is also not expected. This does not follow the actual Hive behaviors. This PR is to disallow users to specify such an invalid partition spec in the `SessionCatalog` APIs.

Added test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16583 from gatorsmile/disallowEmptyPartColValue.

(cherry picked from commit a23debd7)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

3ec3e3f2