Commits · 9293734d35eb3d6e4fd4ebb86f54dd5d3a35e6db · cs525-sp18-g07 / spark

Oct 05, 2016

[SPARK-17346][SQL] Add Kafka source for Structured Streaming · 9293734d

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

This PR adds a new project ` external/kafka-0-10-sql` for Structured Streaming Kafka source.

It's based on the design doc: https://docs.google.com/document/d/19t2rWe51x7tq2e5AOfrsM9qb8_m7BRuv9fel9i0PqR8/edit?usp=sharing

tdas did most of work and part of them was inspired by koeninger's work.

### Introduction

The Kafka source is a structured streaming data source to poll data from Kafka. The schema of reading data is as follows:

Column | Type
---- | ----
key | binary
value | binary
topic | string
partition | int
offset | long
timestamp | long
timestampType | int

The source can deal with deleting topics. However, the user should make sure there is no Spark job processing the data when deleting a topic.

### Configuration

The user can use `DataStreamReader.option` to set the following configurations.

Kafka Source's options | value | default | meaning
------ | ------- | ------ | -----
startingOffset | ["earliest", "latest"] | "latest" | The start point when a query is started, either "earliest" which is from the earliest offset, or "latest" which is just from the latest offset. Note: This only applies when a new Streaming query is started, and that resuming will always pick up from where the query left off.
failOnDataLost | [true, false] | true | Whether to fail the query when it's possible that data is lost (e.g., topics are deleted, or offsets are out of range). This may be a false alarm. You can disable it when it doesn't work as you expected.
subscribe | A comma-separated list of topics | (none) | The topic list to subscribe. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
subscribePattern | Java regex string | (none) | The pattern used to subscribe the topic. Only one of "subscribe" and "subscribeParttern" options can be specified for Kafka source.
kafka.consumer.poll.timeoutMs | long | 512 | The timeout in milliseconds to poll data from Kafka in executors
fetchOffset.numRetries | int | 3 | Number of times to retry before giving up fatch Kafka latest offsets.
fetchOffset.retryIntervalMs | long | 10 | milliseconds to wait before retrying to fetch Kafka offsets

Kafka's own configurations can be set via `DataStreamReader.option` with `kafka.` prefix, e.g, `stream.option("kafka.bootstrap.servers", "host:port")`

### Usage

* Subscribe to 1 topic
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1")
  .load()
```

* Subscribe to multiple topics
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribe", "topic1,topic2")
  .load()
```

* Subscribe to a pattern
```Scala
spark
  .readStream
  .format("kafka")
  .option("kafka.bootstrap.servers", "host:port")
  .option("subscribePattern", "topic.*")
  .load()
```

## How was this patch tested?

The new unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Shixiong Zhu <zsxwing@gmail.com>
Author: cody koeninger <cody@koeninger.org>

Closes #15102 from zsxwing/kafka-source.

9293734d

[SPARK-17758][SQL] Last returns wrong result in case of empty partition · 5fd54b99

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
The result of the `Last` function can be wrong when the last partition processed is empty. It can return `null` instead of the expected value. For example, this can happen when we process partitions in the following order:
```
- Partition 1 [Row1, Row2]
- Partition 2 [Row3]
- Partition 3 []
```
In this case the `Last` function will currently return a null, instead of the value of `Row3`.

This PR fixes this by adding a `valueSet` flag to the `Last` function.

## How was this patch tested?
We only used end to end tests for `DeclarativeAggregateFunction`s. I have added an evaluator for these functions so we can tests them in catalyst. I have added a `LastTestSuite` to test the `Last` aggregate function.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15348 from hvanhovell/SPARK-17758.

5fd54b99

[SPARK-17778][TESTS] Mock SparkContext to reduce memory usage of BlockManagerSuite · 221b418b

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

Mock SparkContext to reduce memory usage of BlockManagerSuite

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #15350 from zsxwing/SPARK-17778.

221b418b

[SPARK-17239][ML][DOC] Update user guide for multiclass logistic regression · 9df54f53

sethah authored 8 years ago

## What changes were proposed in this pull request?
Updates user guide to reflect that LogisticRegression now supports multiclass. Also adds new examples to show multiclass training.

## How was this patch tested?
Ran locally using spark-submit, run-example, and copy/paste from user guide into shells. Generated docs and verified correct output.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #15349 from sethah/SPARK-17239.

9df54f53

[SPARK-17328][SQL] Fix NPE with EXPLAIN DESCRIBE TABLE · 6a05eb24

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR fixes the following NPE scenario in two ways.

**Reported Error Scenario**
```scala
scala> sql("EXPLAIN DESCRIBE TABLE x").show(truncate = false)
INFO SparkSqlParser: Parsing command: EXPLAIN DESCRIBE TABLE x
java.lang.NullPointerException
```

- **DESCRIBE**: Extend `DESCRIBE` syntax to accept `TABLE`.
- **EXPLAIN**: Prevent NPE in case of the parsing failure of target statement, e.g., `EXPLAIN DESCRIBE TABLES x`.

## How was this patch tested?

Pass the Jenkins test with a new test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15357 from dongjoon-hyun/SPARK-17328.

6a05eb24

[SPARK-17258][SQL] Parse scientific decimal literals as decimals · 89516c1c

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
Currently Spark SQL parses regular decimal literals (e.g. `10.00`) as decimals and scientific decimal literals (e.g. `10.0e10`) as doubles. The difference between the two confuses most users. This PR unifies the parsing behavior and also parses scientific decimal literals as decimals.

This implications in tests are limited to a single Hive compatibility test.

## How was this patch tested?
Updated tests in `ExpressionParserSuite` and `SQLQueryTestSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14828 from hvanhovell/SPARK-17258.

89516c1c

[SPARK-17658][SPARKR] read.df/write.df API taking path optionally in SparkR · c9fe10d4

hyukjinkwon authored 8 years ago

## What changes were proposed in this pull request?

`write.df`/`read.df` API require path which is not actually always necessary in Spark. Currently, it only affects the datasources implementing `CreatableRelationProvider`. Currently, Spark currently does not have internal data sources implementing this but it'd affect other external datasources.

In addition we'd be able to use this way in Spark's JDBC datasource after https://github.com/apache/spark/pull/12601 is merged.

**Before**

 - `read.df`

  ```r
> read.df(source = "json")
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
  argument "x" is missing with no default
```

  ```r
> read.df(path = c(1, 2))
Error in dispatchFunc("read.df(path = NULL, source = NULL, schema = NULL, ...)",  :
  argument "x" is missing with no default
```

  ```r
> read.df(c(1, 2))
Error in invokeJava(isStatic = TRUE, className, methodName, ...) :
  java.lang.ClassCastException: java.lang.Double cannot be cast to java.lang.String
	at org.apache.spark.sql.execution.datasources.DataSource.hasMetadata(DataSource.scala:300)
	at
...
In if (is.na(object)) { :
...
```

 - `write.df`

  ```r
> write.df(df, source = "json")
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘write.df’ for signature ‘"function", "missing"’
```

  ```r
> write.df(df, source = c(1, 2))
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```

  ```r
> write.df(df, mode = TRUE)
Error in (function (classes, fdef, mtable)  :
  unable to find an inherited method for function ‘write.df’ for signature ‘"SparkDataFrame", "missing"’
```

**After**

- `read.df`

  ```r
> read.df(source = "json")
Error in loadDF : analysis error - Unable to infer schema for JSON at . It must be specified manually;
```

  ```r
> read.df(path = c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```

  ```r
> read.df(c(1, 2))
Error in f(x, ...) : path should be charactor, null or omitted.
```

- `write.df`

  ```r
> write.df(df, source = "json")
Error in save : illegal argument - 'path' is not specified
```

  ```r
> write.df(df, source = c(1, 2))
Error in .local(df, path, ...) :
  source should be charactor, null or omitted. It is 'parquet' by default.
```

  ```r
> write.df(df, mode = TRUE)
Error in .local(df, path, ...) :
  mode should be charactor or omitted. It is 'error' by default.
```

## How was this patch tested?

Unit tests in `test_sparkSQL.R`

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15231 from HyukjinKwon/write-default-r.

c9fe10d4

Oct 04, 2016

[SPARK-17495][SQL] Add Hash capability semantically equivalent to Hive's · a99743d0

Tejas Patil authored 8 years ago

## What changes were proposed in this pull request?

Jira : https://issues.apache.org/jira/browse/SPARK-17495

Spark internally uses Murmur3Hash for partitioning. This is different from the one used by Hive. For queries which use bucketing this leads to different results if one tries the same query on both engines. For us, we want users to have backward compatibility to that one can switch parts of applications across the engines without observing regressions.

This PR includes `HiveHash`, `HiveHashFunction`, `HiveHasher` which mimics Hive's hashing at https://github.com/apache/hive/blob/master/serde/src/java/org/apache/hadoop/hive/serde2/objectinspector/ObjectInspectorUtils.java#L638

I am intentionally not introducing any usages of this hash function in rest of the code to keep this PR small. My eventual goal is to have Hive bucketing support in Spark. Once this PR gets in, I will make hash function pluggable in relevant areas (eg. `HashPartitioning`'s `partitionIdExpression` has Murmur3 hardcoded : https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/physical/partitioning.scala#L265)

## How was this patch tested?

Added `HiveHashSuite`

Author: Tejas Patil <tejasp@fb.com>

Closes #15047 from tejasapatil/SPARK-17495_hive_hash.

a99743d0

[SPARK-17549][SQL] Only collect table size stat in driver for cached relation. · 8d969a21

Marcelo Vanzin authored 8 years ago

This reverts commit 9ac68dbc. Turns out
the original fix was correct.

Original change description:
The existing code caches all stats for all columns for each partition
in the driver; for a large relation, this causes extreme memory usage,
which leads to gc hell and application failures.

It seems that only the size in bytes of the data is actually used in the
driver, so instead just colllect that. In executors, the full stats are
still kept, but that's not a big problem; we expect the data to be distributed
and thus not really incur in too much memory pressure in each individual
executor.

There are also potential improvements on the executor side, since the data
being stored currently is very wasteful (e.g. storing boxed types vs.
primitive types for stats). But that's a separate issue.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15304 from vanzin/SPARK-17549.2.

8d969a21

[SPARKR][DOC] minor formatting and output cleanup for R vignettes · 068c198e

Felix Cheung authored 8 years ago

## What changes were proposed in this pull request?

Clean up output, format table, truncate long example output, hide warnings

(new - Left; existing - Right)
![image](https://cloud.githubusercontent.com/assets/8969467/19064018/5dcde4d0-89bc-11e6-857b-052df3f52a4e.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064034/6db09956-89bc-11e6-8e43-232d5c3fe5e6.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064058/88f09590-89bc-11e6-9993-61639e29dfdd.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064066/95ccbf64-89bc-11e6-877f-45af03ddcadc.png)

![image](https://cloud.githubusercontent.com/assets/8969467/19064082/a8445404-89bc-11e6-8532-26d8bc9b206f.png)

## How was this patch tested?

Run create-doc.sh manually

Author: Felix Cheung <felixcheung_m@hotmail.com>

Closes #15340 from felixcheung/vignettes.

068c198e

[SPARK-17744][ML] Parity check between the ml and mllib test suites for NB · c17f9718

Zheng RuiFeng authored 8 years ago

## What changes were proposed in this pull request?
1,parity check and add missing test suites for ml's NB
2,remove some unused imports

## How was this patch tested?
 manual tests in spark-shell

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15312 from zhengruifeng/nb_test_parity.

c17f9718

[SPARK-16962][CORE][SQL] Fix misaligned record accesses for SPARC architectures · 7d516088

sumansomasundar authored 8 years ago

## What changes were proposed in this pull request?

Made changes to record length offsets to make them uniform throughout various areas of Spark core and unsafe

## How was this patch tested?

This change affects only SPARC architectures and was tested on X86 architectures as well for regression.

Author: sumansomasundar <suman.somasundar@oracle.com>

Closes #14762 from sumansomasundar/master.

7d516088

[SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set... · 8e8de007

Sean Owen authored 8 years ago

[SPARK-17671][WEBUI] Spark 2.0 history server summary page is slow even set spark.history.ui.maxApplications

## What changes were proposed in this pull request?

Return Iterator of applications internally in history server, for consistency and performance. See https://github.com/apache/spark/pull/15248 for some back-story.

The code called by and calling HistoryServer.getApplicationList wants an Iterator, but this method materializes an Iterable, which potentially causes a performance problem. It's simpler too to make this internal method also pass through an Iterator.

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15321 from srowen/SPARK-17671.

8e8de007

[SPARK-17559][MLLIB] persist edges if their storage level is non in PeriodicGraphCheckpointer · 126baa8d

ding authored 8 years ago

## What changes were proposed in this pull request?
When use PeriodicGraphCheckpointer to persist graph, sometimes the edges isn't persisted. As currently only when vertices's storage level is none, graph is persisted. However there is a chance vertices's storage level is not none while edges's is none. Eg. graph created by a outerJoinVertices operation, vertices is automatically cached while edges is not. In this way, edges will not be persisted if we use PeriodicGraphCheckpointer do persist. We need separately check edges's storage level and persisted it if it's none.

## How was this patch tested?
 manual tests

Author: ding <ding@localhost.localdomain>

Closes #15124 from dding3/spark-persisitEdge.

126baa8d

[SPARK-17773] Input/Output] Add VoidObjectInspector · d2dc8c4a

Ergin Seyfe authored 8 years ago

## What changes were proposed in this pull request?
Added VoidObjectInspector to the list of PrimitiveObjectInspectors

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Executing following query was failing.
select SOME_UDAF*(a.arr)
from (
select Array(null) as arr from dim_one_row
) a

After the fix, I am getting the correct output:
res0: Array[org.apache.spark.sql.Row] = Array([null])

Author: Ergin Seyfe <eseyfe@fb.com>

Closes #15337 from seyfe/add_void_object_inspector.

d2dc8c4a

Oct 03, 2016

[SPARK-17702][SQL] Code generation including too many mutable states exceeds JVM size limit. · b1b47274

Takuya UESHIN authored 8 years ago

## What changes were proposed in this pull request?

Code generation including too many mutable states exceeds JVM size limit to extract values from `references` into fields in the constructor.
We should split the generated extractions in the constructor into smaller functions.

## How was this patch tested?

I added some tests to check if the generated codes for the expressions exceed or not.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #15275 from ueshin/issues/SPARK-17702.

b1b47274

[SPARK-17112][SQL] "select null" via JDBC triggers IllegalArgumentException in Thriftserver · c571cfb2

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Currently, Spark Thrift Server raises `IllegalArgumentException` for queries whose column types are `NullType`, e.g., `SELECT null` or `SELECT if(true,null,null)`. This PR fixes that by returning `void` like Hive 1.2.

**Before**
```sql
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
Connecting to jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Error: java.lang.IllegalArgumentException: Unrecognized type name: null (state=,code=0)
Closing: 0: jdbc:hive2://localhost:10000

$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
Connecting to jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
Error: java.lang.IllegalArgumentException: Unrecognized type name: null (state=,code=0)
Closing: 0: jdbc:hive2://localhost:10000
```

**After**
```sql
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
Connecting to jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+-------+--+
| NULL  |
+-------+--+
| NULL  |
+-------+--+
1 row selected (3.242 seconds)
Beeline version 1.2.1.spark2 by Apache Hive
Closing: 0: jdbc:hive2://localhost:10000

$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
Connecting to jdbc:hive2://localhost:10000
Connected to: Spark SQL (version 2.1.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
+-------------------------+--+
| (IF(true, NULL, NULL))  |
+-------------------------+--+
| NULL                    |
+-------------------------+--+
1 row selected (0.201 seconds)
Beeline version 1.2.1.spark2 by Apache Hive
Closing: 0: jdbc:hive2://localhost:10000
```

## How was this patch tested?

* Pass the Jenkins test with a new testsuite.
* Also, Manually, after starting Spark Thrift Server, run the following command.
```sql
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select null"
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "select if(true,null,null)"
```

**Hive 1.2**
```sql
hive> create table null_table as select null;
hive> desc null_table;
OK
_c0                     void
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15325 from dongjoon-hyun/SPARK-17112.

c571cfb2

[SPARK-17753][SQL] Allow a complex expression as the input a value based case statement · 2bbecdec

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
We currently only allow relatively simple expressions as the input for a value based case statement. Expressions like `case (a > 1) or (b = 2) when true then 1 when false then 0 end` currently fail. This PR adds support for such expressions.

## How was this patch tested?
Added a test to the ExpressionParserSuite.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15322 from hvanhovell/SPARK-17753.

2bbecdec

[SPARK-17587][PYTHON][MLLIB] SparseVector __getitem__ should follow __getitem__ contract · d8399b60

zero323 authored 8 years ago

## What changes were proposed in this pull request?

Replaces` ValueError` with `IndexError` when index passed to `ml` / `mllib` `SparseVector.__getitem__` is out of range. This ensures correct iteration behavior.

Replaces `ValueError` with `IndexError` for `DenseMatrix` and `SparkMatrix` in `ml` / `mllib`.

## How was this patch tested?

PySpark `ml` / `mllib` unit tests. Additional unit tests to prove that the problem has been resolved.

Author: zero323 <zero323@users.noreply.github.com>

Closes #15144 from zero323/SPARK-17587.

d8399b60

[SPARK-17679] [PYSPARK] remove unnecessary Py4J ListConverter patch · 1f31bdae

Jason White authored 8 years ago

## What changes were proposed in this pull request?

This PR removes a patch on ListConverter from https://github.com/apache/spark/pull/5570, as it is no longer necessary. The underlying issue in Py4J https://github.com/bartdag/py4j/issues/160 was patched in https://github.com/bartdag/py4j/commit/224b94b6665e56a93a064073886e1d803a4969d2 and is present in 0.10.3, the version currently in use in Spark.

## How was this patch tested?

The original test added in https://github.com/apache/spark/pull/5570 remains.

Author: Jason White <jason.white@shopify.com>

Closes #15254 from JasonMWhite/remove_listconverter_patch.

1f31bdae

[SPARK-17718][DOCS][MLLIB] Make loss function formulation label note clearer in MLlib docs · 1dd68d38

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Move note about labels being +1/-1 in formulation only to be just under the table of formulations.

## How was this patch tested?

Doc build

Author: Sean Owen <sowen@cloudera.com>

Closes #15330 from srowen/SPARK-17718.

1dd68d38

[SPARK-17073][SQL] generate column-level statistics · 7bf92127

Zhenhua Wang authored 8 years ago

## What changes were proposed in this pull request?

Generate basic column statistics for all the atomic types:
- numeric types: max, min, num of nulls, ndv (number of distinct values)
- date/timestamp types: they are also represented as numbers internally, so they have the same stats as above.
- string: avg length, max length, num of nulls, ndv
- binary: avg length, max length, num of nulls
- boolean: num of nulls, num of trues, num of falsies

Also support storing and loading these statistics.

One thing to notice:
We support analyzing columns independently, e.g.:
sql1: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key;`
sql2: `ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS value;`
when running sql2 to collect column stats for `value`, we don’t remove stats of columns `key` which are analyzed in sql1 and not in sql2. As a result, **users need to guarantee consistency** between sql1 and sql2. If the table has been changed before sql2, users should re-analyze column `key` when they want to analyze column `value`:
`ANALYZE TABLE src COMPUTE STATISTICS FOR COLUMNS key, value;`

## How was this patch tested?

add unit tests

Author: Zhenhua Wang <wzh_zju@163.com>

Closes #15090 from wzhfy/colStats.

7bf92127

[SPARK-17736][DOCUMENTATION][SPARKR] Update R README for rmarkdown,… · a27033c0

Jagadeesan authored 8 years ago

## What changes were proposed in this pull request?

To build R docs (which are built when R tests are run), users need to install pandoc and rmarkdown. This was done for Jenkins in ~~[SPARK-17420](https://issues.apache.org/jira/browse/SPARK-17420)~~

… pandoc]

Author: Jagadeesan <as2@us.ibm.com>

Closes #15309 from jagadeesanas2/SPARK-17736.

a27033c0

[SPARK-17598][SQL][WEB UI] User-friendly name for Spark Thrift Server in web UI · de3f71ed

Alex Bozarth authored 8 years ago

## What changes were proposed in this pull request?

The name of Spark Thrift JDBC/ODBC Server in web UI reflects the name of the class, i.e. org.apache.spark.sql.hive.thrift.HiveThriftServer2. I changed it to Thrift JDBC/ODBC Server (like Spark shell for spark-shell) as recommended by jaceklaskowski. Note the user can still change the name adding `--name "App Name"` parameter to the start script as before

## How was this patch tested?

By running the script with various parameters and checking the web ui

![screen shot 2016-09-27 at 12 19 12 pm](https://cloud.githubusercontent.com/assets/13952758/18888329/aebca47c-84ac-11e6-93d0-6e98684977c5.png)

Author: Alex Bozarth <ajbozart@us.ibm.com>

Closes #15268 from ajbozarth/spark17598.

de3f71ed

Oct 02, 2016

[SPARK-14914][CORE][SQL] Skip/fix some test cases on Windows due to limitation of Windows · 76dc2d90

Tao LI authored 8 years ago

## What changes were proposed in this pull request?

This PR proposes to fix/skip some tests failed on Windows. This PR takes over https://github.com/apache/spark/pull/12696.

**Before**

- **SparkSubmitSuite**

  ```
[info] - launch simple application with spark-submit *** FAILED *** (202 milliseconds)
[info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specifie

[info] - includes jars passed in through --jars *** FAILED *** (1 second, 625 milliseconds)
[info]   java.io.IOException: Cannot run program "./bin/spark-submit" (in directory "C:\projects\spark"): CreateProcess error=2, The system cannot find the file specified
```

- **DiskStoreSuite**

  ```
[info] - reads of memory-mapped and non memory-mapped files are equivalent *** FAILED *** (1 second, 78 milliseconds)
[info]   diskStoreMapped.remove(blockId) was false (DiskStoreSuite.scala:41)
```

**After**

- **SparkSubmitSuite**

  ```
[info] - launch simple application with spark-submit (578 milliseconds)
[info] - includes jars passed in through --jars (1 second, 875 milliseconds)
```

- **DiskStoreSuite**

  ```
[info] DiskStoreSuite:
[info] - reads of memory-mapped and non memory-mapped files are equivalent !!! CANCELED !!! (766 milliseconds
```

For `CreateTableAsSelectSuite` and `FsHistoryProviderSuite`, I could not reproduce as the Java version seems higher than the one that has the bugs about `setReadable(..)` and `setWritable(...)` but as they are bugs reported clearly, it'd be sensible to skip those. We should revert the changes for both back as soon as we drop the support of Java 7.

## How was this patch tested?

Manually tested via AppVeyor.

Closes #12696

Author: Tao LI <tl@microsoft.com>
Author: U-FAREAST\tl <tl@microsoft.com>
Author: hyukjinkwon <gurwls223@gmail.com>

Closes #15320 from HyukjinKwon/SPARK-14914.

76dc2d90

[SPARK-17509][SQL] When wrapping catalyst datatype to Hive data type avoid… · f8d7fade

Sital Kedia authored 8 years ago

## What changes were proposed in this pull request?

When wrapping catalyst datatypes to Hive data type, wrap function was doing an expensive pattern matching which was consuming around 11% of cpu time. Avoid the pattern matching by returning the wrapper only once and reuse it.

## How was this patch tested?

Tested by running the job on cluster and saw around 8% cpu improvements.

Author: Sital Kedia <skedia@fb.com>

Closes #15064 from sitalkedia/skedia/hive_wrapper.

f8d7fade

Oct 01, 2016

[SPARK-17704][ML][MLLIB] ChiSqSelector performance improvement. · b88cb63d

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Partial revert of #15277 to instead sort and store input to model rather than require sorted input

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #15299 from srowen/SPARK-17704.2.

b88cb63d

[SPARK-17717][SQL] Add Exist/find methods to Catalog [FOLLOW-UP] · af6ece33

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
We added find and exists methods for Databases, Tables and Functions to the user facing Catalog in PR https://github.com/apache/spark/pull/15301. However, it was brought up that the semantics of the  `find` methods are more in line a `get` method (get an object or else fail). So we rename these in this PR.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15308 from hvanhovell/SPARK-17717-2.

af6ece33

[SPARK-17740] Spark tests should mock / interpose HDFS to ensure that streams are closed · 4bcd9b72

Eric Liang authored 8 years ago

## What changes were proposed in this pull request?

As a followup to SPARK-17666, ensure filesystem connections are not leaked at least in unit tests. This is done here by intercepting filesystem calls as suggested by JoshRosen . At the end of each test, we assert no filesystem streams are left open.

This applies to all tests using SharedSQLContext or SharedSparkContext.

## How was this patch tested?

I verified that tests in sql and core are indeed using the filesystem backend, and fixed the detected leaks. I also checked that reverting https://github.com/apache/spark/pull/15245 causes many actual test failures due to connection leaks.

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekhliang@gmail.com>

Closes #15306 from ericl/sc-4672.

4bcd9b72

[MINOR][DOC] Add an up-to-date description for default serialization during shuffling · 15e9bbb4

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

This PR aims to make the doc up-to-date. The documentation is generally correct, but after https://issues.apache.org/jira/browse/SPARK-13926, Spark starts to choose Kyro as a default serialization library during shuffling of simple types, arrays of simple types, or string type.

## How was this patch tested?

This is a documentation update.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15315 from dongjoon-hyun/SPARK-DOC-SERIALIZER.

15e9bbb4

Sep 30, 2016

[SPARK-17739][SQL] Collapse adjacent similar Window operators · aef506e3

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

Currently, Spark does not collapse adjacent windows with the same partitioning and sorting. This PR implements `CollapseWindow` optimizer to do the followings.

1. If the partition specs and order specs are the same, collapse into the parent.
2. If the partition specs are the same and one order spec is a prefix of the other, collapse to the more specific one.

For example:
```scala
val df = spark.range(1000).select($"id" % 100 as "grp", $"id", rand() as "col1", rand() as "col2")

// Add summary statistics for all columns
import org.apache.spark.sql.expressions.Window
val cols = Seq("id", "col1", "col2")
val window = Window.partitionBy($"grp").orderBy($"id")
val result = cols.foldLeft(df) { (base, name) =>
  base.withColumn(s"${name}_avg", avg(col(name)).over(window))
      .withColumn(s"${name}_stddev", stddev(col(name)).over(window))
      .withColumn(s"${name}_min", min(col(name)).over(window))
      .withColumn(s"${name}_max", max(col(name)).over(window))
}
```

**Before**
```scala
scala> result.explain
== Physical Plan ==
Window [max(col2#19) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_max#234], [grp#17L], [id#14L ASC NULLS FIRST]
+- Window [min(col2#19) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_min#216], [grp#17L], [id#14L ASC NULLS FIRST]
   +- Window [stddev_samp(col2#19) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#191], [grp#17L], [id#14L ASC NULLS FIRST]
      +- Window [avg(col2#19) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#167], [grp#17L], [id#14L ASC NULLS FIRST]
         +- Window [max(col1#18) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_max#152], [grp#17L], [id#14L ASC NULLS FIRST]
            +- Window [min(col1#18) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_min#138], [grp#17L], [id#14L ASC NULLS FIRST]
               +- Window [stddev_samp(col1#18) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_stddev#117], [grp#17L], [id#14L ASC NULLS FIRST]
                  +- Window [avg(col1#18) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_avg#97], [grp#17L], [id#14L ASC NULLS FIRST]
                     +- Window [max(id#14L) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_max#86L], [grp#17L], [id#14L ASC NULLS FIRST]
                        +- Window [min(id#14L) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_min#76L], [grp#17L], [id#14L ASC NULLS FIRST]
                           +- *Project [grp#17L, id#14L, col1#18, col2#19, id_avg#26, id_stddev#42]
                              +- Window [stddev_samp(_w0#59) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_stddev#42], [grp#17L], [id#14L ASC NULLS FIRST]
                                 +- *Project [grp#17L, id#14L, col1#18, col2#19, id_avg#26, cast(id#14L as double) AS _w0#59]
                                    +- Window [avg(id#14L) windowspecdefinition(grp#17L, id#14L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_avg#26], [grp#17L], [id#14L ASC NULLS FIRST]
                                       +- *Sort [grp#17L ASC NULLS FIRST, id#14L ASC NULLS FIRST], false, 0
                                          +- Exchange hashpartitioning(grp#17L, 200)
                                             +- *Project [(id#14L % 100) AS grp#17L, id#14L, rand(-6329949029880411066) AS col1#18, rand(-7251358484380073081) AS col2#19]
                                                +- *Range (0, 1000, step=1, splits=Some(8))
```

**After**
```scala
scala> result.explain
== Physical Plan ==
Window [max(col2#5) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_max#220, min(col2#5) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_min#202, stddev_samp(col2#5) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_stddev#177, avg(col2#5) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col2_avg#153, max(col1#4) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_max#138, min(col1#4) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_min#124, stddev_samp(col1#4) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_stddev#103, avg(col1#4) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS col1_avg#83, max(id#0L) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_max#72L, min(id#0L) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_min#62L], [grp#3L], [id#0L ASC NULLS FIRST]
+- *Project [grp#3L, id#0L, col1#4, col2#5, id_avg#12, id_stddev#28]
   +- Window [stddev_samp(_w0#45) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_stddev#28], [grp#3L], [id#0L ASC NULLS FIRST]
      +- *Project [grp#3L, id#0L, col1#4, col2#5, id_avg#12, cast(id#0L as double) AS _w0#45]
         +- Window [avg(id#0L) windowspecdefinition(grp#3L, id#0L ASC NULLS FIRST, RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS id_avg#12], [grp#3L], [id#0L ASC NULLS FIRST]
            +- *Sort [grp#3L ASC NULLS FIRST, id#0L ASC NULLS FIRST], false, 0
               +- Exchange hashpartitioning(grp#3L, 200)
                  +- *Project [(id#0L % 100) AS grp#3L, id#0L, rand(6537478539664068821) AS col1#4, rand(-8961093871295252795) AS col2#5]
                     +- *Range (0, 1000, step=1, splits=Some(8))
```

## How was this patch tested?

Pass the Jenkins tests with a newly added testsuite.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15317 from dongjoon-hyun/SPARK-17739.

aef506e3

[SPARK-15353][CORE] Making peer selection for block replication pluggable · a26afd52

Shubham Chopra authored 8 years ago

## What changes were proposed in this pull request?

This PR makes block replication strategies pluggable. It provides two trait that can be implemented, one that maps a host to its topology and is used in the master, and the second that helps prioritize a list of peers for block replication and would run in the executors.

This patch contains default implementations of these traits that make sure current Spark behavior is unchanged.

## How was this patch tested?

This patch should not change Spark behavior in any way, and was tested with unit tests for storage.

Author: Shubham Chopra <schopra31@bloomberg.net>

Closes #13152 from shubhamchopra/RackAwareBlockReplication.

a26afd52

[SPARK-17703][SQL] Add unnamed version of addReferenceObj for minor objects. · 81455a9c

Takuya UESHIN authored 8 years ago

## What changes were proposed in this pull request?

There are many minor objects in references, which are extracted to the generated class field, e.g. `errMsg` in `GetExternalRowField` or `ValidateExternalType`, but number of fields in class is limited so we should reduce the number.
This pr adds unnamed version of `addReferenceObj` for these minor objects not to store the object into field but refer it from the `references` field at the time of use.

## How was this patch tested?

Existing tests.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #15276 from ueshin/issues/SPARK-17703.

81455a9c

[SPARK-17738] [SQL] fix ARRAY/MAP in columnar cache · f327e168

Davies Liu authored 8 years ago

## What changes were proposed in this pull request?

The actualSize() of array and map is different from the actual size, the header is Int, rather than Long.

## How was this patch tested?

The flaky test should be fixed.

Author: Davies Liu <davies@databricks.com>

Closes #15305 from davies/fix_MAP.

f327e168

[SPARK-14077][ML][FOLLOW-UP] Revert change for NB Model's Load to maintain... · 8e491af5

Zheng RuiFeng authored 8 years ago

[SPARK-14077][ML][FOLLOW-UP] Revert change for NB Model's Load to maintain compatibility with the model stored before 2.0

## What changes were proposed in this pull request?
Revert change for NB Model's Load to maintain compatibility with the model stored before 2.0

## How was this patch tested?
local build

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #15313 from zhengruifeng/revert_save_load.

8e491af5

[SPARK-14077][ML] Refactor NaiveBayes to support weighted instances · 1fad5596

Zheng RuiFeng authored 8 years ago

## What changes were proposed in this pull request?
1,support weighted data
2,use dataset/dataframe instead of rdd
3,make mllib as a wrapper to call ml

## How was this patch tested?
local manual tests in spark-shell
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12819 from zhengruifeng/weighted_nb.

1fad5596

Sep 29, 2016

[SPARK-17717][SQL] Add exist/find methods to Catalog. · 74ac1c43

Herman van Hovell authored 8 years ago

## What changes were proposed in this pull request?
The current user facing catalog does not implement methods for checking object existence or finding objects. You could theoretically do this using the `list*` commands, but this is rather cumbersome and can actually be costly when there are many objects. This PR adds `exists*` and `find*` methods for Databases, Table and Functions.

## How was this patch tested?
Added tests to `org.apache.spark.sql.internal.CatalogSuite`

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #15301 from hvanhovell/SPARK-17717.

74ac1c43

[SPARK-17697][ML] Fixed bug in summary calculations that pattern match against... · 2f739567

Bryan Cutler authored 8 years ago

[SPARK-17697][ML] Fixed bug in summary calculations that pattern match against label without casting

## What changes were proposed in this pull request?
In calling LogisticRegression.evaluate and GeneralizedLinearRegression.evaluate using a Dataset where the Label is not of a double type, calculations pattern match against a double and throw a MatchError.  This fix casts the Label column to a DoubleType to ensure there is no MatchError.

## How was this patch tested?
Added unit tests to call evaluate with a dataset that has Label as other numeric types.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #15288 from BryanCutler/binaryLOR-numericCheck-SPARK-17697.

2f739567

[SPARK-17412][DOC] All test should not be run by `root` or any admin user · 39eb3bb1

Dongjoon Hyun authored 8 years ago

## What changes were proposed in this pull request?

`FsHistoryProviderSuite` fails if `root` user runs it. The test case **SPARK-3697: ignore directories that cannot be read** depends on `setReadable(false, false)` to make test data files and expects the number of accessible files is 1. But, `root` can access all files, so it returns 2.

This PR adds the assumption explicitly on doc. `building-spark.md`.

## How was this patch tested?

This is a documentation change.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #15291 from dongjoon-hyun/SPARK-17412.

39eb3bb1

[SPARK-17676][CORE] FsHistoryProvider should ignore hidden files · 3993ebca

Imran Rashid authored 8 years ago

## What changes were proposed in this pull request?

FsHistoryProvider was writing a hidden file (to check the fs's clock).
Even though it deleted the file immediately, sometimes another thread
would try to scan the files on the fs in-between, and then there would
be an error msg logged which was very misleading for the end-user.
(The logged error was harmless, though.)

## How was this patch tested?

I added one unit test, but to be clear, that test was passing before.  The actual change in behavior in that test is just logging (after the change, there is no more logged error), which I just manually verified.

Author: Imran Rashid <irashid@cloudera.com>

Closes #15250 from squito/SPARK-17676.

3993ebca