Commits · ee735a8a85d7f015188f7cb31975f60cc969e453 · cs525-sp18-g07 / spark

Jan 06, 2017

[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for... · ee735a8a

Tathagata Das authored 8 years ago

[SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options

## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png

)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.

(cherry picked from commit b59cddab)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

ee735a8a

[SPARK-19083] sbin/start-history-server.sh script use of $@ without quotes · ce9bfe6d

zuotingbing authored 8 years ago

JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19083#



sbin/start-history-server.sh script use of $ without quotes, this will affect the length of args which used in HistoryServerArguments::parse(args: List[String])

Author: zuotingbing <zuo.tingbing9@zte.com.cn>

Closes #16484 from zuotingbing/sh.

(cherry picked from commit a9a13737)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

ce9bfe6d

[SPARK-19033][CORE] Add admin acls for history server · 4ca17888

jerryshao authored 8 years ago


## What changes were proposed in this pull request?

Current HistoryServer's ACLs is derived from application event-log, which means the newly changed ACLs cannot be applied to the old data, this will become a problem where newly added admin cannot access the old application history UI, only the new application can be affected.

So here propose to add admin ACLs for history server, any configured user/group could have the view access to all the applications, while the view ACLs derived from application run-time still take effect.

## How was this patch tested?

Unit test added.

Author: jerryshao <sshao@hortonworks.com>

Closes #16470 from jerryshao/SPARK-19033.

(cherry picked from commit 4a4c3dc9)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

4ca17888

Jan 04, 2017

[SPARK-18877][SQL][BACKPORT-2.1] CSVInferSchema.inferField` on DecimalType... · 1ecf1a95

Dongjoon Hyun authored 8 years ago

[SPARK-18877][SQL][BACKPORT-2.1] CSVInferSchema.inferField` on DecimalType should find a common type with `typeSoFar`

## What changes were proposed in this pull request?

CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`.

**decimal.csv**
```
9.03E+12
1.19E+11
```

**BEFORE**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(3,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3
```

**AFTER**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(4,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
+---------+
|      _c0|
+---------+
|9.030E+12|
| 1.19E+11|
+---------+
```

## How was this patch tested?

Pass the newly add test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16463 from dongjoon-hyun/SPARK-18877-BACKPORT-21.

1ecf1a95

Jan 03, 2017

[SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned... · 77625506

gatorsmile authored 8 years ago

[SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned Tables in InMemoryCatalog

### What changes were proposed in this pull request?
The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition.

This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`.

### How was this patch tested?
Added test cases for both HiveExternalCatalog and InMemoryCatalog

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16448 from gatorsmile/unsetSerdeProp.

(cherry picked from commit b67b35f7)
Signed-off-by: gatorsmile <gatorsmile@gmail.com>

77625506

Jan 02, 2017

[SPARK-19028][SQL] Fixed non-thread-safe functions used in SessionCatalog · 94272a96

gatorsmile authored 8 years ago


### What changes were proposed in this pull request?
Fixed non-thread-safe functions used in SessionCatalog:
- refreshTable
- lookupRelation

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16437 from gatorsmile/addSyncToLookUpTable.

(cherry picked from commit 35e97407)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

94272a96

[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide · d489e1dc

Liwei Lin authored 8 years ago

## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.

d489e1dc

[SPARK-18379][SQL] Make the parallelism of parallelPartitionDiscovery configurable. · 517f3983

genmao.ygm authored 8 years ago


## What changes were proposed in this pull request?

The largest parallelism in PartitioningAwareFileIndex #listLeafFilesInParallel() is 10000 in hard code. We may need to make this number configurable. And in PR, I reduce it to 100.

## How was this patch tested?

Existing ut.

Author: genmao.ygm <genmao.ygm@genmaoygmdeMacBook-Air.local>
Author: dylon <hustyugm@gmail.com>

Closes #15829 from uncleGen/SPARK-18379.

(cherry picked from commit 745ab8bc)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

517f3983

[MINOR][DOC] Minor doc change for YARN credential providers · 63857c8d

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

The configuration `spark.yarn.security.tokens.{service}.enabled` is deprecated. Now we should use `spark.yarn.security.credentials.{service}.enabled`. Some places in the doc is not updated yet.

## How was this patch tested?

N/A. Just doc change.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16444 from viirya/minor-credential-provider-doc.

(cherry picked from commit 0ac2f1e7)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

63857c8d

Jan 01, 2017

[SPARK-19050][SS][TESTS] Fix EventTimeWatermarkSuite 'delay in months and years handled correctly' · 3483defe

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

`monthsSinceEpoch` in this test is like `math.floor(num)`, so `monthDiff` has two possible values.

## How was this patch tested?

Jenkins.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16449 from zsxwing/watermark-test-hotfix.

(cherry picked from commit 23940473)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

3483defe

Dec 30, 2016

[SPARK-19016][SQL][DOC] Document scalable partition handling · 20ae1172

Cheng Lian authored 8 years ago


This PR documents the scalable partition handling feature in the body of the programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes #16424 from liancheng/scalable-partition-handling-doc.

(cherry picked from commit 871f6114)
Signed-off-by: Cheng Lian <lian@databricks.com>

20ae1172

Dec 29, 2016

[SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section Design... · 47ab4afe

adesharatushar authored 8 years ago

[SPARK-19003][DOCS] Add Java example in Spark Streaming Guide, section Design Patterns for using foreachRDD

## What changes were proposed in this pull request?

Added missing Java example under section "Design Patterns for using foreachRDD". Now this section has examples in all 3 languages, improving consistency of documentation.

## How was this patch tested?

Manual.
Generated docs using command "SKIP_API=1 jekyll build" and verified generated HTML page manually.

The syntax of example has been tested for correctness using sample code on Java1.7 and Spark 2.2.0-SNAPSHOT.

Author: adesharatushar <tushar_adeshara@persistent.com>

Closes #16408 from adesharatushar/streaming-doc-fix.

(cherry picked from commit dba81e1d)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

47ab4afe

Dec 28, 2016

[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding... · 80d583bd

Tathagata Das authored 8 years ago

[SPARK-18669][SS][DOCS] Update Apache docs for Structured Streaming regarding watermarking and status

## What changes were proposed in this pull request?

- Extended the Window operation section with code snippet and explanation of watermarking
- Extended the Output Mode section with a table showing the compatibility between query type and output mode
- Rewrote the Monitoring section with updated jsons generated by StreamingQuery.progress/status
- Updated API changes in the StreamingQueryListener example

TODO
- [x] Figure showing the watermarking

## How was this patch tested?

N/A

## Screenshots
### Section: Windowed Aggregation with Event Time

<img width="927" alt="screen shot 2016-12-15 at 3 33 10 pm" src="https://cloud.githubusercontent.com/assets/663212/21246197/0e02cb1a-c2dc-11e6-8816-0cd28d8201d7.png">

![image](https://cloud.githubusercontent.com/assets/663212/21246241/45b0f87a-c2dc-11e6-9c29-d0a89e07bf8d.png)

<img width="929" alt="screen shot 2016-12-15 at 3 33 46 pm" src="https://cloud.githubusercontent.com/assets/663212/21246202/1652cefa-c2dc-11e6-8c64-3c05977fb3fc.png">

----------------------------
### Section: Output Modes
![image](https://cloud.githubusercontent.com/assets/663212/21246276/8ee44948-c2dc-11e6-9fa2-30502fcf9a55.png)

----------------------------
### Section: Monitoring
![image](https://cloud.githubusercontent.com/assets/663212/21246535/3c5baeb2-c2de-11e6-88cd-ca71db7c5cf9.png)
![image](https://cloud.githubusercontent.com/assets/663212/21246574/789492c2-c2de-11e6-8471-7bef884e1837.png

)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16294 from tdas/SPARK-18669.

(cherry picked from commit 092c6725)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

80d583bd

[SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing... · 7197a7bc

Sean Owen authored 8 years ago

[SPARK-18993][BUILD] Unable to build/compile Spark in IntelliJ due to missing Scala deps in spark-tags

## What changes were proposed in this pull request?

This adds back a direct dependency on Scala library classes from spark-tags because its Scala annotations need them.

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #16418 from srowen/SPARK-18993.

(cherry picked from commit d7bce3bd)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

7197a7bc

[MINOR][DOC] Fix doc of ForeachWriter to use writeStream · ac7107fe

Carson Wang authored 8 years ago


## What changes were proposed in this pull request?

Fix the document of `ForeachWriter` to use `writeStream` instead of `write` for a streaming dataset.

## How was this patch tested?
Docs only.

Author: Carson Wang <carson.wang@intel.com>

Closes #16419 from carsonwang/FixDoc.

(cherry picked from commit 2a5f52a7)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

ac7107fe

Dec 24, 2016

[SPARK-18837][WEBUI] Very long stage descriptions do not wrap in the UI · ca25b1e5

Kousuke Saruta authored 8 years ago

## What changes were proposed in this pull request?

This issue was reported by wangyum.

In the AllJobsPage, JobPage and StagePage, the description length was limited before like as follows.

![ui-2 0 0](https://cloud.githubusercontent.com/assets/4736016/21319673/8b225246-c651-11e6-9041-4fcdd04f4dec.gif)

But recently, the limitation seems to have been accidentally removed.

![ui-2 1 0](https://cloud.githubusercontent.com/assets/4736016/21319825/104779f6-c652-11e6-8bfa-dfd800396352.gif)

The cause is that some tables are no longer `sortable` class although they were, and `sortable` class does not only mark tables as sortable but also limited the width of their child `td` elements.
The reason why now some tables are not `sortable` class is because another sortable mechanism was introduced by #13620 and #13708 with pagination feature.

To fix this issue, I've introduced new class `table-cell-width-limited` which limits the description cell width and the description is like what it was.

<img width="1260" alt="2016-12-20 1 00 34" src="https://cloud.githubusercontent.com/assets/4736016/21320478/89141c7a-c654-11e6-8494-f8f91325980b.png

">

## How was this patch tested?

Tested manually with my browser.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #16338 from sarutak/SPARK-18837.

(cherry picked from commit f2ceb2ab)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

ca25b1e5

Dec 23, 2016

[SPARK-18991][CORE] Change ContextCleaner.referenceBuffer to use... · 5bafdc45

Shixiong Zhu authored 8 years ago

[SPARK-18991][CORE] Change ContextCleaner.referenceBuffer to use ConcurrentHashMap to make it faster

## What changes were proposed in this pull request?

The time complexity of ConcurrentHashMap's `remove` is O(1). Changing ContextCleaner.referenceBuffer's type from `ConcurrentLinkedQueue` to `ConcurrentHashMap's` will make the removal much faster.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16390 from zsxwing/SPARK-18991.

(cherry picked from commit a848f0ba)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

5bafdc45

Dec 22, 2016

[SPARK-18972][CORE] Fix the netty thread names for RPC · 1857acc7

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Right now the name of threads created by Netty for Spark RPC are `shuffle-client-**` and `shuffle-server-**`. It's pretty confusing.

This PR just uses the module name in TransportConf to set the thread name. In addition, it also includes the following minor fixes:

- TransportChannelHandler.channelActive and channelInactive should call the corresponding super methods.
- Make ShuffleBlockFetcherIterator throw NoSuchElementException if it has no more elements. Otherwise,  if the caller calls `next` without `hasNext`, it will just hang.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16380 from zsxwing/SPARK-18972.

(cherry picked from commit f252cb5d)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

1857acc7

[SPARK-18985][SS] Add missing @InterfaceStability.Evolving for Structured Streaming APIs · 5e801034

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

Add missing InterfaceStability.Evolving for Structured Streaming APIs

## How was this patch tested?

Compiling the codes.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16385 from zsxwing/SPARK-18985.

(cherry picked from commit 2246ce88)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

5e801034

[SPARK-17807][CORE] split test-tags into test-JAR · 132f2297

Ryan Williams authored 8 years ago


Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.

Alternative to #16303.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #16311 from ryan-williams/tt.

(cherry picked from commit afd9bc1d)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

132f2297

[SPARK-18973][SQL] Remove SortPartitions and RedistributeData · f6853b3e

Reynold Xin authored 8 years ago


## What changes were proposed in this pull request?
SortPartitions and RedistributeData logical operators are not actually used and can be removed. Note that we do have a Sort operator (with global flag false) that subsumed SortPartitions.

## How was this patch tested?
Also updated test cases to reflect the removal.

Author: Reynold Xin <rxin@databricks.com>

Closes #16381 from rxin/SPARK-18973.

(cherry picked from commit 26151000)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

f6853b3e

[DOC] bucketing is applicable to all file-based data sources · ec0d6e21

Reynold Xin authored 8 years ago


## What changes were proposed in this pull request?
Starting Spark 2.1.0, bucketing feature is available for all file-based data sources. This patch fixes some function docs that haven't yet been updated to reflect that.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #16349 from rxin/ds-doc.

(cherry picked from commit 2e861df9)
Signed-off-by: Reynold Xin <rxin@databricks.com>

ec0d6e21

[SQL] Minor readability improvement for partition handling code · def3690f

Reynold Xin authored 8 years ago


This patch includes minor changes to improve readability for partition handling code. I'm in the middle of implementing some new feature and found some naming / implicit type inference not as intuitive.

This patch should have no semantic change and the changes should be covered by existing test cases.

Author: Reynold Xin <rxin@databricks.com>

Closes #16378 from rxin/minor-fix.

(cherry picked from commit 7c5b7b3a)
Signed-off-by: Reynold Xin <rxin@databricks.com>

def3690f

[SPARK-18908][SS] Creating StreamingQueryException should check if logicalPlan is created · 07e2a17d

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

This PR audits places using `logicalPlan` in StreamExecution and ensures they all handles the case that `logicalPlan` cannot be created.

In addition, this PR also fixes the following issues in `StreamingQueryException`:
- `StreamingQueryException` and `StreamExecution` are cycle-dependent because in the `StreamingQueryException`'s constructor, it calls `StreamExecution`'s `toDebugString` which uses `StreamingQueryException`. Hence it will output `null` value in the error message.
- Duplicated stack trace when calling Throwable.printStackTrace because StreamingQueryException's toString contains the stack trace.

## How was this patch tested?

The updated `test("max files per trigger - incorrect values")`. I found this issue when I switched from `testStream` to the real codes to verify the failure in this test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16322 from zsxwing/SPARK-18907.

(cherry picked from commit ff7d82a2)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

07e2a17d

Dec 21, 2016

[FLAKY-TEST] InputStreamsSuite.socket input stream · 9a3c5bd7

Burak Yavuz authored 8 years ago

## What changes were proposed in this pull request?

https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.streaming.InputStreamsSuite&test_name=socket+input+stream



## How was this patch tested?

Tested 2,000 times.

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #16343 from brkyvz/sock.

(cherry picked from commit afe36516)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

9a3c5bd7

[SPARK-18528][SQL] Fix a bug to initialise an iterator of aggregation buffer · 021952d5

Takeshi YAMAMURO authored 8 years ago

## What changes were proposed in this pull request?
This pr is to fix an `NullPointerException` issue caused by a following `limit + aggregate` query;
```
scala> val df = Seq(("a", 1), ("b", 2), ("c", 1), ("d", 5)).toDF("id", "value")
scala> df.limit(2).groupBy("id").count().show
WARN TaskSetManager: Lost task 0.0 in stage 9.0 (TID 8204, lvsp20hdn012.stubprod.com): java.lang.NullPointerException
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
```
The root culprit is that [`$doAgg()`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L596) skips an initialization of [the buffer iterator](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L603);

 `BaseLimitExec` sets `stopEarly=true` and `$doAgg()` exits in the middle without the initialization.

## How was this patch tested?
Added a test to check if no exception happens for limit + aggregates in `DataFrameAggregateSuite.scala`.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #15980 from maropu/SPARK-18528.

(cherry picked from commit b41ec997)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

021952d5

[SPARK-18234][SS] Made update mode public · 60e02a17

Tathagata Das authored 8 years ago


## What changes were proposed in this pull request?

Made update mode public. As part of that here are the changes.
- Update DatastreamWriter to accept "update"
- Changed package of InternalOutputModes from o.a.s.sql to o.a.s.sql.catalyst
- Added update mode state removing with watermark to StateStoreSaveExec

## How was this patch tested?

Added new tests in changed modules

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16360 from tdas/SPARK-18234.

(cherry picked from commit 83a6ace0)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

60e02a17

[SPARK-18588][SS][KAFKA] Create a new KafkaConsumer when error happens to fix the flaky test · 17ef57fe

Shixiong Zhu authored 8 years ago

## What changes were proposed in this pull request?

When KafkaSource fails on Kafka errors, we should create a new consumer to retry rather than using the existing broken one because it's possible that the broken one will fail again.

This PR also assigns a new group id to the new created consumer for a possible race condition:  the broken consumer cannot talk with the Kafka cluster in `close` but the new consumer can talk to Kafka cluster. I'm not sure if this will happen or not. Just for safety to avoid that the Kafka cluster thinks there are two consumers with the same group id in a short time window. (Note: CachedKafkaConsumer doesn't need this fix since `assign` never uses the group id.)

## How was this patch tested?

In https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/70370/console

 , it ran this flaky test 120 times and all passed.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16282 from zsxwing/kafka-fix.

(cherry picked from commit 95efc895)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

17ef57fe

[SPARK-18949][SQL][BACKPORT-2.1] Add recoverPartitions API to Catalog · 0e51bb08

gatorsmile authored 8 years ago

### What changes were proposed in this pull request?

This PR is to backport https://github.com/apache/spark/pull/16356 to Spark 2.1.1 branch.

----

Currently, we only have a SQL interface for recovering all the partitions in the directory of a table and update the catalog. `MSCK REPAIR TABLE` or `ALTER TABLE table RECOVER PARTITIONS`. (Actually, very hard for me to remember `MSCK` and have no clue what it means)

After the new "Scalable Partition Handling", the table repair becomes much more important for making visible the data in the created data source partitioned table.

Thus, this PR is to add it into the Catalog interface. After this PR, users can repair the table by
```Scala
spark.catalog.recoverPartitions("testTable")
```

### How was this patch tested?
Modified the existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16372 from gatorsmile/repairTable2.1.1.

0e51bb08

[SPARK-18954][TESTS] Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd... · 31848342

Shixiong Zhu authored 8 years ago

[SPARK-18954][TESTS] Fix flaky test: o.a.s.streaming.BasicOperationsSuite rdd cleanup - map and window

## What changes were proposed in this pull request?

The issue in this test is the cleanup of RDDs may not be able to finish before stopping StreamingContext. This PR basically just puts the assertions into `eventually` and runs it before stopping StreamingContext.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16362 from zsxwing/SPARK-18954.

(cherry picked from commit 078c71c2)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

31848342

[SPARK-18031][TESTS] Fix flaky test ExecutorAllocationManagerSuite.basic functionality · 162bdb91

Shixiong Zhu authored 8 years ago


## What changes were proposed in this pull request?

The failure is because in `test("basic functionality")`, it doesn't block until `ExecutorAllocationManager.manageAllocation` is called. This PR just adds StreamManualClock to allow the tests to block on expected wait time to make the test deterministic.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16321 from zsxwing/SPARK-18031.

(cherry picked from commit ccfe60a8)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

162bdb91

[SPARK-18894][SS] Fix event time watermark delay threshold specified in months or years · 3c8861d9

Tathagata Das authored 8 years ago


## What changes were proposed in this pull request?

Two changes
- Fix how delays specified in months and years are translated to milliseconds
- Following up on #16258, not show watermark when there is no watermarking in the query

## How was this patch tested?
Updated and new unit tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16304 from tdas/SPARK-18834-1.

(cherry picked from commit 607a1e63)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

3c8861d9

[SPARK-18947][SQL] SQLContext.tableNames should not call Catalog.listTables · bc54a14b

Wenchen Fan authored 8 years ago


## What changes were proposed in this pull request?

It's a huge waste to call `Catalog.listTables` in `SQLContext.tableNames`, which only need the table names, while `Catalog.listTables` will get the table metadata for each table name.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16352 from cloud-fan/minor.

(cherry picked from commit b7650f11)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

bc54a14b

Dec 20, 2016

[SPARK-18900][FLAKY-TEST] StateStoreSuite.maintenance · 063a98e5

Burak Yavuz authored 8 years ago

## What changes were proposed in this pull request?

It was pretty flaky before 10 days ago.
https://spark-tests.appspot.com/test-details?suite_name=org.apache.spark.sql.execution.streaming.state.StateStoreSuite&test_name=maintenance



Since no code changes went into this code path to not be so flaky, I'm just increasing the timeouts such that load related flakiness shouldn't be a problem. As you may see from the testing, I haven't been able to reproduce it.

## How was this patch tested?

2000 retries 5 times

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #16314 from brkyvz/maint-flaky.

(cherry picked from commit b2dd8ec6)
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>

063a98e5

[SPARK-18927][SS] MemorySink for StructuredStreaming can't recover from... · 3857d5ba

Burak Yavuz authored 8 years ago

[SPARK-18927][SS] MemorySink for StructuredStreaming can't recover from checkpoint if location is provided in SessionConf

## What changes were proposed in this pull request?

Checkpoint Location can be defined for a StructuredStreaming on a per-query basis by the `DataStreamWriter` options, but it can also be provided through SparkSession configurations. It should be able to recover in both cases when the OutputMode is Complete for MemorySinks.

## How was this patch tested?

Unit tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #16342 from brkyvz/chk-rec.

(cherry picked from commit caed8932)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

3857d5ba

[SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through socket for local iterator · cd297c39

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

There is a timeout failure when using `rdd.toLocalIterator()` or `df.toLocalIterator()` for a PySpark RDD and DataFrame:

    df = spark.createDataFrame([[1],[2],[3]])
    it = df.toLocalIterator()
    row = next(it)

    df2 = df.repartition(1000)  # create many empty partitions which increase materialization time so causing timeout
    it2 = df2.toLocalIterator()
    row = next(it2)

The cause of this issue is, we open a socket to serve the data from JVM side. We set timeout for connection and reading through the socket in Python side. In Python we use a generator to read the data, so we only begin to connect the socket once we start to ask data from it. If we don't consume it immediately, there is connection timeout.

In the other side, the materialization time for RDD partitions is unpredictable. So we can't set a timeout for reading data through the socket. Otherwise, it is very possibly to fail.

## How was this patch tested?

Added tests into PySpark.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16263 from viirya/fix-pyspark-localiterator.

(cherry picked from commit 95c95b71)
Signed-off-by: Davies Liu <davies.liu@gmail.com>

cd297c39

[SPARK-18761][CORE] Introduce "task reaper" to oversee task killing in executors · 2971ae56

Josh Rosen authored 8 years ago

## What changes were proposed in this pull request?

Spark's current task cancellation / task killing mechanism is "best effort" because some tasks may not be interruptible or may not respond to their "killed" flags being set. If a significant fraction of a cluster's task slots are occupied by tasks that have been marked as killed but remain running then this can lead to a situation where new jobs and tasks are starved of resources that are being used by these zombie tasks.

This patch aims to address this problem by adding a "task reaper" mechanism to executors. At a high-level, task killing now launches a new thread which attempts to kill the task and then watches the task and periodically checks whether it has been killed. The TaskReaper will periodically re-attempt to call `TaskRunner.kill()` and will log warnings if the task keeps running. I modified TaskRunner to rename its thread at the start of the task, allowing TaskReaper to take a thread dump and filter it in order to log stacktraces from the exact task thread that we are waiting to finish. If the task has not stopped after a configurable timeout then the TaskReaper will throw an exception to trigger executor JVM death, thereby forcibly freeing any resources consumed by the zombie tasks.

This feature is flagged off by default and is controlled by four new configurations under the `spark.task.reaper.*` namespace. See the updated `configuration.md` doc for details.

## How was this patch tested?

Tested via a new test case in `JobCancellationSuite`, plus manual testing.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #16189 from JoshRosen/cancellation.

2971ae56

Dec 19, 2016

[SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD & UnsafeSorter · f07e989c

Josh Rosen authored 8 years ago


## What changes were proposed in this pull request?

In order to respond to task cancellation, Spark tasks must periodically check `TaskContext.isInterrupted()`, but this check is missing on a few critical read paths used in Spark SQL, including `FileScanRDD`, `JDBCRDD`, and UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to continue running and become zombies (as also described in #16189).

This patch aims to fix this problem by adding `TaskContext.isInterrupted()` checks to these paths. Note that I could have used `InterruptibleIterator` to simply wrap a bunch of iterators but in some cases this would have an adverse performance penalty or might not be effective due to certain special uses of Iterators in Spark SQL. Instead, I inlined `InterruptibleIterator`-style logic into existing iterator subclasses.

## How was this patch tested?

Tested manually in `spark-shell` with two different reproductions of non-cancellable tasks, one involving scans of huge files and another involving sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by the changes added here.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #16340 from JoshRosen/sql-task-interruption.

(cherry picked from commit 5857b9ac)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

f07e989c

[SPARK-18921][SQL] check database existence with Hive.databaseExists instead of getDatabase · c1a26b45

Wenchen Fan authored 8 years ago


## What changes were proposed in this pull request?

It's weird that we use `Hive.getDatabase` to check the existence of a database, while Hive has a `databaseExists` interface.

What's worse, `Hive.getDatabase` will produce an error message if the database doesn't exist, which is annoying when we only want to check the database existence.

This PR fixes this and use `Hive.databaseExists` to check database existence.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16332 from cloud-fan/minor.

(cherry picked from commit 7a75ee1c)
Signed-off-by: Yin Huai <yhuai@databricks.com>

c1a26b45

[SPARK-18700][SQL] Add StripedLock for each table's relation in cache · fc1b2566

xuanyuanking authored 8 years ago

## What changes were proposed in this pull request?

As the scenario describe in [SPARK-18700](https://issues.apache.org/jira/browse/SPARK-18700

), when cachedDataSourceTables invalided, the coming few queries will fetch all FileStatus in listLeafFiles function. In the condition of table has many partitions, these jobs will occupy much memory of driver finally may cause driver OOM.

In this patch, add StripedLock for each table's relation in cache not for the whole cachedDataSourceTables, each table's load cache operation protected by it.

## How was this patch tested?

Add a multi-thread access table test in `PartitionedTablePerfStatsSuite` and check it only loading once using metrics in `HiveCatalogMetrics`

Author: xuanyuanking <xyliyuanjian@gmail.com>

Closes #16135 from xuanyuanking/SPARK-18700.

(cherry picked from commit 24482858)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

fc1b2566