Commits · d995dac1cdeec940364453675f59ce5cf2b53684 · cs525-sp18-g07 / spark

Jun 29, 2017

[SPARK-21258][SQL] Fix WindowExec complex object aggregation with spilling · d995dac1

## What changes were proposed in this pull request?
`WindowExec` currently improperly stores complex objects (UnsafeRow, UnsafeArrayData, UnsafeMapData, UTF8String) during aggregation by keeping a reference in the buffer used by `GeneratedMutableProjections` to the actual input data. Things go wrong when the input object (or the backing bytes) are reused for other things. This could happen in window functions when it starts spilling to disk. When reading the back the spill files the `UnsafeSorterSpillReader` reuses the buffer to which the `UnsafeRow` points, leading to weird corruption scenario's. Note that this only happens for aggregate functions that preserve (parts of) their input, for example `FIRST`, `LAST`, `MIN` & `MAX`.

This was not seen before, because the spilling logic was not doing actual spills as much and actually used an in-memory page. This page was not cleaned up during window processing and made sure unsafe objects point to their own dedicated memory location. This was changed by https://github.com/apache/spark/pull/16909

, after this PR Spark spills more eagerly.

This PR provides a surgical fix because we are close to releasing Spark 2.2. This change just makes sure that there cannot be any object reuse at the expensive of a little bit of performance. We will follow-up with a more subtle solution at a later point.

## How was this patch tested?
Added a regression test to `DataFrameWindowFunctionsSuite`.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #18470 from hvanhovell/SPARK-21258.

(cherry picked from commit e2f32ee4)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

d995dac1

[SPARK-21176][WEB UI] Limit number of selector threads for admin ui proxy servlets to 8 · 083adb07

IngoSchuster authored 7 years ago

## What changes were proposed in this pull request?
Please see also https://issues.apache.org/jira/browse/SPARK-21176

This change limits the number of selector threads that jetty creates to maximum 8 per proxy servlet (Jetty default is number of processors / 2).
The newHttpClient for Jettys ProxyServlet class is overwritten to avoid the Jetty defaults (which are designed for high-performance http servers).
Once https://github.com/eclipse/jetty.project/issues/1643

 is available, the code could be cleaned up to avoid the method override.

I really need this on v2.1.1 - what is the best way for a backport automatic merge works fine)? Shall I create another PR?

## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
The patch was tested manually on a Spark cluster with a head node that has 88 processors using JMX to verify that the number of selector threads is now limited to 8 per proxy.

gurvindersingh zsxwing can you please review the change?

Author: IngoSchuster <ingo.schuster@de.ibm.com>
Author: Ingo Schuster <ingo.schuster@de.ibm.com>

Closes #18437 from IngoSchuster/master.

(cherry picked from commit 88a536ba)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

083adb07

Jun 25, 2017

Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant... · 26f4f340

Wenchen Fan authored 7 years ago

Revert "[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting"

This reverts commit 6b37c863.

26f4f340

Jun 24, 2017

[SPARK-21203][SQL] Fix wrong results of insertion of Array of Struct · 0d6b701e

gatorsmile authored 7 years ago


### What changes were proposed in this pull request?
```SQL
CREATE TABLE `tab1`
(`custom_fields` ARRAY<STRUCT<`id`: BIGINT, `value`: STRING>>)
USING parquet

INSERT INTO `tab1`
SELECT ARRAY(named_struct('id', 1, 'value', 'a'), named_struct('id', 2, 'value', 'b'))

SELECT custom_fields.id, custom_fields.value FROM tab1
```

The above query always return the last struct of the array, because the rule `SimplifyCasts` incorrectly rewrites the query. The underlying cause is we always use the same `GenericInternalRow` object when doing the cast.

### How was this patch tested?

Author: gatorsmile <gatorsmile@gmail.com>

Closes #18412 from gatorsmile/castStruct.

(cherry picked from commit 2e1586f6)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

0d6b701e

[SPARK-21159][CORE] Don't try to connect to launcher in standalone cluster mode. · 6750db3f

Marcelo Vanzin authored 7 years ago


Monitoring for standalone cluster mode is not implemented (see SPARK-11033), but
the same scheduler implementation is used, and if it tries to connect to the
launcher it will fail. So fix the scheduler so it only tries that in client mode;
cluster mode applications will be correctly launched and will work, but monitoring
through the launcher handle will not be available.

Tested by running a cluster mode app with "SparkLauncher.startApplication".

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18397 from vanzin/SPARK-21159.

(cherry picked from commit bfd73a7c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

6750db3f

[SPARK-20555][SQL] Fix mapping of Oracle DECIMAL types to Spark types in read path · f12883e3

Gabor Feher authored 7 years ago

This PR is to revert some code changes in the read path of https://github.com/apache/spark/pull/14377. The original fix is https://github.com/apache/spark/pull/17830

When merging this PR, please give the credit to gaborfeher

Added a test case to OracleIntegrationSuite.scala

Author: Gabor Feher <gabor.feher@lynxanalytics.com>
Author: gatorsmile <gatorsmile@gmail.com>

Closes #18408 from gatorsmile/OracleType.

f12883e3

Jun 23, 2017

[MINOR][DOCS] Docs in DataFrameNaFunctions.scala use wrong method · bcaf06c4

Ong Ming Yang authored 7 years ago


## What changes were proposed in this pull request?

* Following the first few examples in this file, the remaining methods should also be methods of `df.na` not `df`.
* Filled in some missing parentheses

## How was this patch tested?

N/A

Author: Ong Ming Yang <me@ongmingyang.com>

Closes #18398 from ongmingyang/master.

(cherry picked from commit 4cc62951)
Signed-off-by: Xiao Li <gatorsmile@gmail.com>

bcaf06c4

[SPARK-21181] Release byteBuffers to suppress netty error messages · f8fd3b48

Dhruve Ashar authored 7 years ago

## What changes were proposed in this pull request?
We are explicitly calling release on the byteBuf's used to encode the string to Base64 to suppress the memory leak error message reported by netty. This is to make it less confusing for the user.

### Changes proposed in this fix
By explicitly invoking release on the byteBuf's we are decrement the internal reference counts for the wrappedByteBuf's. Now, when the GC kicks in, these would be reclaimed as before, just that netty wouldn't report any memory leak error messages as the internal ref. counts are now 0.

## How was this patch tested?
Ran a few spark-applications and examined the logs. The error message no longer appears.

Original PR was opened against branch-2.1 => https://github.com/apache/spark/pull/18392



Author: Dhruve Ashar <dhruveashar@gmail.com>

Closes #18407 from dhruve/master.

(cherry picked from commit 1ebe7ffe)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

f8fd3b48

Jun 22, 2017

[SPARK-21167][SS] Decode the path generated by File sink to handle special characters · 1a98d5d0

Shixiong Zhu authored 7 years ago


## What changes were proposed in this pull request?

Decode the path generated by File sink to handle special characters.

## How was this patch tested?

The added unit test.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18381 from zsxwing/SPARK-21167.

(cherry picked from commit d66b143e)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

1a98d5d0

[SPARK-18016][SQL][CATALYST][BRANCH-2.1] Code Generation: Constant Pool Limit - Class Splitting · 6b37c863

ALeksander Eskilson authored 7 years ago

## What changes were proposed in this pull request?

This is a backport patch for Spark 2.1.x of the class splitting feature over excess generated code as was merged in #18075.

## How was this patch tested?

The same test provided in #18075 is included in this patch.

Author: ALeksander Eskilson <alek.eskilson@cerner.com>

Closes #18354 from bdrillard/class_splitting_2.1.

6b37c863

Jun 20, 2017

[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are... · 8923bac1

assafmendelson authored 7 years ago

[SPARK-21123][DOCS][STRUCTURED STREAMING] Options for file stream source are in a wrong table - version to fix 2.1

## What changes were proposed in this pull request?

The description for several options of File Source for structured streaming appeared in the File Sink description instead.

This commit continues on PR #18342 and targets the fixes for the documentation of version spark version 2.1

## How was this patch tested?

Built the documentation by SKIP_API=1 jekyll build and visually inspected the structured streaming programming guide.

zsxwing This is the PR to fix version 2.1 as discussed in PR #18342

Author: assafmendelson <assaf.mendelson@gmail.com>

Closes #18363 from assafmendelson/spark-21123-for-spark2.1.

8923bac1

Jun 19, 2017

[SPARK-21138][YARN] Cannot delete staging dir when the clusters of... · 7799f35d

sharkdtu authored 7 years ago

[SPARK-21138][YARN] Cannot delete staging dir when the clusters of "spark.yarn.stagingDir" and "spark.hadoop.fs.defaultFS" are different

## What changes were proposed in this pull request?

When I set different clusters for "spark.hadoop.fs.defaultFS" and "spark.yarn.stagingDir" as follows：
```
spark.hadoop.fs.defaultFS  hdfs://tl-nn-tdw.tencent-distribute.com:54310
spark.yarn.stagingDir hdfs://ss-teg-2-v2/tmp/spark
```
The staging dir can not be deleted, it will prompt following message:
```
java.lang.IllegalArgumentException: Wrong FS: hdfs://ss-teg-2-v2/tmp/spark/.sparkStaging/application_1496819138021_77618, expected: hdfs://tl-nn-tdw.tencent-distribute.com:54310


```

## How was this patch tested?

Existing tests

Author: sharkdtu <sharkdtu@tencent.com>

Closes #18352 from sharkdtu/master.

(cherry picked from commit 3d4d11a8)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

7799f35d

[SPARK-19688][STREAMING] Not to read `spark.yarn.credentials.file` from checkpoint. · a44c118e

saturday_s authored 7 years ago

## What changes were proposed in this pull request?

Reload the `spark.yarn.credentials.file` property when restarting a streaming application from checkpoint.

## How was this patch tested?

Manual tested with 1.6.3 and 2.1.1.
I didn't test this with master because of some compile problems, but I think it will be the same result.

## Notice

This should be merged into maintenance branches too.

jira: [SPARK-21008](https://issues.apache.org/jira/browse/SPARK-21008

)

Author: saturday_s <shi.indetail@gmail.com>

Closes #18230 from saturday-shi/SPARK-21008.

(cherry picked from commit e92ffe6f)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

a44c118e

Jun 15, 2017

[SPARK-21114][TEST][2.1] Fix test failure in Spark 2.1/2.0 due to name mismatch · 0ebb3b84

gatorsmile authored 7 years ago

## What changes were proposed in this pull request?
Name mismatch between 2.1/2.0 and 2.2. Thus, the test cases failed after we backport a fix to 2.1/2.0. This PR is to fix the issue.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.1-test-maven-hadoop-2.7/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test/job/spark-branch-2.0-test-maven-hadoop-2.2/lastCompletedBuild/testReport/org.apache.spark.sql/SQLQueryTestSuite/arithmetic_sql/

## How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #18319 from gatorsmile/fixDecimal.

0ebb3b84

[SPARK-21072][SQL] TreeNode.mapChildren should only apply to the children node. · 915a2010

Xianyang Liu authored 7 years ago

## What changes were proposed in this pull request?

Just as the function name and comments of `TreeNode.mapChildren` mentioned, the function should be apply to all currently node children. So, the follow code should judge whether it is the children node.

https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/trees/TreeNode.scala#L342



## How was this patch tested?

Existing tests.

Author: Xianyang Liu <xianyang.liu@intel.com>

Closes #18284 from ConeyLiu/treenode.

(cherry picked from commit 87ab0cec)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

915a2010

[SPARK-16251][SPARK-20200][CORE][TEST] Flaky test:... · 62f2b804

Xingbo Jiang authored 7 years ago

[SPARK-16251][SPARK-20200][CORE][TEST] Flaky test: org.apache.spark.rdd.LocalCheckpointSuite.missing checkpoint block fails with informative message

## What changes were proposed in this pull request?

Currently we don't wait to confirm the removal of the block from the slave's BlockManager, if the removal takes too much time, we will fail the assertion in this test case.
The failure can be easily reproduced if we sleep for a while before we remove the block in BlockManagerSlaveEndpoint.receiveAndReply().

## How was this patch tested?
N/A

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #18314 from jiangxb1987/LocalCheckpointSuite.

(cherry picked from commit 7dc3e697)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

62f2b804

Jun 14, 2017

[SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values... · a890466b

gatorsmile authored 7 years ago

[SPARK-20211][SQL][BACKPORT-2.2] Fix the Precision and Scale of Decimal Values when the Input is BigDecimal between -1.0 and 1.0

### What changes were proposed in this pull request?

This PR is to backport https://github.com/apache/spark/pull/18244 to 2.2

---

The precision and scale of decimal values are wrong when the input is BigDecimal between -1.0 and 1.0.

The BigDecimal's precision is the digit count starts from the leftmost nonzero digit based on the [JAVA's BigDecimal definition](https://docs.oracle.com/javase/7/docs/api/java/math/BigDecimal.html

). However, our Decimal decision follows the database decimal standard, which is the total number of digits, including both to the left and the right of the decimal point. Thus, this PR is to fix the issue by doing the conversion.

Before this PR, the following queries failed:
```SQL
select 1 > 0.0001
select floor(0.0001)
select ceil(0.0001)
```

### How was this patch tested?
Added test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #18297 from gatorsmile/backport18244.

(cherry picked from commit 62651195)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

a890466b

Jun 13, 2017

[SPARK-21064][CORE][TEST] Fix the default value bug in NettyBlockTransferServiceSuite · ee0e74e6

DjvuLee authored 7 years ago


## What changes were proposed in this pull request?

The default value for `spark.port.maxRetries` is 100,
but we use 10 in the suite file.
So we change it to 100 to avoid test failure.

## How was this patch tested?
No test

Author: DjvuLee <lihu@bytedance.com>

Closes #18280 from djvulee/NettyTestBug.

(cherry picked from commit b36ce2a2)
Signed-off-by: Sean Owen <sowen@cloudera.com>

ee0e74e6

[SPARK-20920][SQL] ForkJoinPool pools are leaked when writing hive tables with many partitions · 58a8a379

Sean Owen authored 7 years ago


## What changes were proposed in this pull request?

Don't leave thread pool running from AlterTableRecoverPartitionsCommand DDL command

## How was this patch tested?

Existing tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #18216 from srowen/SPARK-20920.

(cherry picked from commit 7b7c85ed)
Signed-off-by: Sean Owen <sowen@cloudera.com>

58a8a379

Jun 08, 2017

[SPARK-20914][DOCS] Javadoc contains code that is invalid · 03cc18ba

Sean Owen authored 7 years ago


## What changes were proposed in this pull request?

Fix Java, Scala Dataset examples in scaladoc, which didn't compile.

## How was this patch tested?

Existing compilation/test

Author: Sean Owen <sowen@cloudera.com>

Closes #18215 from srowen/SPARK-20914.

(cherry picked from commit 847efe12)
Signed-off-by: Sean Owen <sowen@cloudera.com>

03cc18ba

Jun 03, 2017

[SPARK-20974][BUILD] we should run REPL tests if SQL module has code changes · afab8557

Wenchen Fan authored 7 years ago


## What changes were proposed in this pull request?

REPL module depends on SQL module, so we should run REPL tests if SQL module has code changes.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #18191 from cloud-fan/test.

(cherry picked from commit 864d94fe)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

afab8557

Jun 01, 2017

[SPARK-20922][CORE][HOTFIX] Don't use Java 8 lambdas in older branches. · 0b25a7d9
Marcelo Vanzin authored 7 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18178 from vanzin/SPARK-20922-hotfix.
```
0b25a7d9

[SPARK-20922][CORE] Add whitelist of classes that can be deserialized by the launcher. · 772a9b96

Marcelo Vanzin authored 7 years ago


Blindly deserializing classes using Java serialization opens the code up to
issues in other libraries, since just deserializing data from a stream may
end up execution code (think readObject()).

Since the launcher protocol is pretty self-contained, there's just a handful
of classes it legitimately needs to deserialize, and they're in just two
packages, so add a filter that throws errors if classes from any other
package show up in the stream.

This also maintains backwards compatibility (the updated launcher code can
still communicate with the backend code in older Spark releases).

Tested with new and existing unit tests.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18166 from vanzin/SPARK-20922.

(cherry picked from commit 8efc6e98)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

772a9b96

May 31, 2017

[SPARK-20940][CORE] Replace IllegalAccessError with IllegalStateException · dade85f7

Shixiong Zhu authored 7 years ago

## What changes were proposed in this pull request?

`IllegalAccessError` is a fatal error (a subclass of LinkageError) and its meaning is `Thrown if an application attempts to access or modify a field, or to call a method that it does not have access to`. Throwing a fatal error for AccumulatorV2 is not necessary and is pretty bad because it usually will just kill executors or SparkContext ([SPARK-20666](https://issues.apache.org/jira/browse/SPARK-20666

) is an example of killing SparkContext due to `IllegalAccessError`). I think the correct type of exception in AccumulatorV2 should be `IllegalStateException`.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18168 from zsxwing/SPARK-20940.

(cherry picked from commit 24db3582)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

dade85f7

May 30, 2017

[SPARK-20275][UI] Do not display "Completed" column for in-progress applications · 46400867

jerryshao authored 7 years ago

## What changes were proposed in this pull request?

Current HistoryServer will display completed date of in-progress application as `1969-12-31 23:59:59`, which is not so meaningful. Instead of unnecessarily showing this incorrect completed date, here propose to make this column invisible for in-progress applications.

The purpose of only making this column invisible rather than deleting this field is that: this data is fetched through REST API, and in the REST API  the format is like below shows, in which `endTime` matches `endTimeEpoch`. So instead of changing REST API to break backward compatibility, here choosing a simple solution to only make this column invisible.

```
[ {
  "id" : "local-1491805439678",
  "name" : "Spark shell",
  "attempts" : [ {
    "startTime" : "2017-04-10T06:23:57.574GMT",
    "endTime" : "1969-12-31T23:59:59.999GMT",
    "lastUpdated" : "2017-04-10T06:23:57.574GMT",
    "duration" : 0,
    "sparkUser" : "",
    "completed" : false,
    "startTimeEpoch" : 1491805437574,
    "endTimeEpoch" : -1,
    "lastUpdatedEpoch" : 1491805437574
  } ]
} ]%
```

Here is UI before changed:

<img width="1317" alt="screen shot 2017-04-10 at 3 45 57 pm" src="https://cloud.githubusercontent.com/assets/850797/24851938/17d46cc0-1e08-11e7-84c7-90120e171b41.png">

And after:

<img width="1281" alt="screen shot 2017-04-10 at 4 02 35 pm" src="https://cloud.githubusercontent.com/assets/850797/24851945/1fe9da58-1e08-11e7-8d0d-9262324f9074.png

">

## How was this patch tested?

Manual verification.

Author: jerryshao <sshao@hortonworks.com>

Closes #17588 from jerryshao/SPARK-20275.

(cherry picked from commit 52ed9b28)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

46400867

May 27, 2017

[SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities · 38f37c55

NICHOLAS T. MARION authored 7 years ago


Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest.

Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages.

Author: NICHOLAS T. MARION <nmarion@us.ibm.com>

Closes #17686 from n-marion/xss-fix.

(cherry picked from commit b512233a)
Signed-off-by: Sean Owen <sowen@cloudera.com>

38f37c55

[SPARK-20843][CORE] Add a config to set driver terminate timeout · ebd72f45

Shixiong Zhu authored 7 years ago


## What changes were proposed in this pull request?

Add a `worker` configuration to set how long to wait before forcibly killing driver.

## How was this patch tested?

Jenkins

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18126 from zsxwing/SPARK-20843.

(cherry picked from commit 6c1dbd6f)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

ebd72f45

May 26, 2017

[SPARK-20868][CORE] UnsafeShuffleWriter should verify the position after FileChannel.transferTo · 6e6adcce

Wenchen Fan authored 7 years ago

## What changes were proposed in this pull request?

Long time ago we fixed a [bug](https://issues.apache.org/jira/browse/SPARK-3948) in shuffle writer about `FileChannel.transferTo`. We were not very confident about that fix, so we added a position check after the writing, try to discover the bug earlier.

 However this checking is missing in the new `UnsafeShuffleWriter`, this PR adds it.

https://issues.apache.org/jira/browse/SPARK-18105

 maybe related to that `FileChannel.transferTo` bug, hopefully we can find out the root cause after adding this position check.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #18091 from cloud-fan/shuffle.

(cherry picked from commit d9ad7890)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

6e6adcce

May 25, 2017

[SPARK-20874][EXAMPLES] Add Structured Streaming Kafka Source to examples project · 4f6fccf1

Shixiong Zhu authored 7 years ago


## What changes were proposed in this pull request?

Add Structured Streaming Kafka Source to the `examples` project so that people can run `bin/run-example StructuredKafkaWordCount ...`.

## How was this patch tested?

manually tested it.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #18101 from zsxwing/add-missing-example-dep.

(cherry picked from commit 98c38529)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

4f6fccf1

[SPARK-20250][CORE] Improper OOM error when a task been killed while spilling data · 7fc2347b

Xianyang Liu authored 7 years ago

Currently, when a task is calling spill() but it receives a killing request from driver (e.g., speculative task), the `TaskMemoryManager` will throw an `OOM` exception.  And we don't catch `Fatal` exception when a error caused by `Thread.interrupt`. So for `ClosedByInterruptException`, we should throw `RuntimeException` instead of `OutOfMemoryError`.

https://issues.apache.org/jira/browse/SPARK-20250?jql=project%20%3D%20SPARK



Existing unit tests.

Author: Xianyang Liu <xianyang.liu@intel.com>

Closes #18090 from ConeyLiu/SPARK-20250.

(cherry picked from commit 731462a0)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

7fc2347b

May 24, 2017

[SPARK-20848][SQL][FOLLOW-UP] Shutdown the pool after reading parquet files · 7015f6f0

Liang-Chi Hsieh authored 7 years ago

## What changes were proposed in this pull request?

This is a follow-up to #18073. Taking a safer approach to shutdown the pool to prevent possible issue. Also using `ThreadUtils.newForkJoinPool` instead to set a better thread name.

## How was this patch tested?

Manually test.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #18100 from viirya/SPARK-20848-followup.

(cherry picked from commit 6b68d61c)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

7015f6f0

[SPARK-18406][CORE][BACKPORT-2.1] Race between end-of-task and completion... · c3302e81

Xingbo Jiang authored 7 years ago

[SPARK-18406][CORE][BACKPORT-2.1] Race between end-of-task and completion iterator read lock release

This is a backport PR of  #18076 to 2.1.

## What changes were proposed in this pull request?

When a TaskContext is not propagated properly to all child threads for the task, just like the reported cases in this issue, we fail to get to TID from TaskContext and that causes unable to release the lock and assertion failures. To resolve this, we have to explicitly pass the TID value to the `unlock` method.

## How was this patch tested?

Add new failing regression test case in `RDDSuite`.

Author: Xingbo Jiang <xingbo.jiang@databricks.com>

Closes #18099 from jiangxb1987/completion-iterator-2.1.

c3302e81

[SPARK-20848][SQL] Shutdown the pool after reading parquet files · 2f68631f

Liang-Chi Hsieh authored 7 years ago

## What changes were proposed in this pull request?

From JIRA: On each call to spark.read.parquet, a new ForkJoinPool is created. One of the threads in the pool is kept in the WAITING state, and never stopped, which leads to unbounded growth in number of threads.

We should shutdown the pool after reading parquet files.

## How was this patch tested?

Added a test to ParquetFileFormatSuite.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #18073 from viirya/SPARK-20848.

(cherry picked from commit f72ad303)
Signed-off-by: Wenchen Fan <wenchen@databricks.com>

2f68631f

[SPARK-20862][MLLIB][PYTHON] Avoid passing float to ndarray.reshape in LogisticRegressionModel · 13adc0fc

Bago Amirbekian authored 7 years ago


## What changes were proposed in this pull request?

Fixed TypeError with python3 and numpy 1.12.1. Numpy's `reshape` no longer takes floats as arguments as of 1.12. Also, python3 uses float division for `/`, we should be using `//` to ensure that `_dataWithBiasSize` doesn't get set to a float.

## How was this patch tested?

Existing tests run using python3 and numpy 1.12.

Author: Bago Amirbekian <bago@databricks.com>

Closes #18081 from MrBago/BF-py3floatbug.

(cherry picked from commit bc66a77b)
Signed-off-by: Yanbo Liang <ybliang8@gmail.com>

13adc0fc

May 22, 2017

[SPARK-20763][SQL][BACKPORT-2.1] The function of `month` and `day` return the... · f4538c95

liuxian authored 7 years ago

[SPARK-20763][SQL][BACKPORT-2.1] The function of `month` and `day` return the value which is not we expected.

What changes were proposed in this pull request?

This PR is to backport #17997 to Spark 2.1

when the date before "1582-10-04", the function of month and day return the value which is not we expected.
How was this patch tested?

unit tests

Author: liuxian <liu.xian3@zte.com.cn>

Closes #18054 from 10110346/wip-lx-0522.

f4538c95

[SPARK-20756][YARN] yarn-shuffle jar references unshaded guava · f5ef0762

Mark Grover authored 7 years ago


and contains scala classes

## What changes were proposed in this pull request?
This change ensures that all references to guava from within the yarn shuffle jar pointed to the shaded guava class already provided in the jar.

Also, it explicitly excludes scala classes from being added to the jar.

## How was this patch tested?
Ran unit tests on the module and they passed.
javap now returns the expected result - reference to the shaded guava under `org/spark_project` (previously this was referring to `com.google...`
```
javap -cp common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar -c org/apache/spark/network/yarn/YarnShuffleService | grep Lists
      57: invokestatic  #138                // Method org/spark_project/guava/collect/Lists.newArrayList:()Ljava/util/ArrayList;
```

Guava is still shaded in the jar:
```
jar -tf common/network-yarn/target/scala-2.11/spark-2.3.0-SNAPSHOT-yarn-shuffle.jar | grep guava | head
META-INF/maven/com.google.guava/
META-INF/maven/com.google.guava/guava/
META-INF/maven/com.google.guava/guava/pom.properties
META-INF/maven/com.google.guava/guava/pom.xml
org/spark_project/guava/
org/spark_project/guava/annotations/
org/spark_project/guava/annotations/Beta.class
org/spark_project/guava/annotations/GwtCompatible.class
org/spark_project/guava/annotations/GwtIncompatible.class
org/spark_project/guava/annotations/VisibleForTesting.class
```
(not sure if the above META-INF/* is a problem or not)

I took this jar, deployed it on a yarn cluster with shuffle service enabled, and made sure the YARN node managers came up. An application with a shuffle was run and it succeeded.

Author: Mark Grover <mark@apache.org>

Closes #17990 from markgrover/spark-20756.

(cherry picked from commit 36309110)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

f5ef0762

[SPARK-20687][MLLIB] mllib.Matrices.fromBreeze may crash when converting from Breeze sparse matrix · c3a986b1

Ignacio Bermudez authored 7 years ago

## What changes were proposed in this pull request?

When two Breeze SparseMatrices are operated, the result matrix may contain provisional 0 values extra in rowIndices and data arrays. This causes an incoherence with the colPtrs data, but Breeze get away with this incoherence by keeping a counter of the valid data.

In spark, when this matrices are converted to SparseMatrices, Sparks relies solely on rowIndices, data, and colPtrs, but these might be incorrect because of breeze internal hacks. Therefore, we need to slice both rowIndices and data, using their counter of active data

This method is at least called by BlockMatrix when performing distributed block operations, causing exceptions on valid operations.

See http://stackoverflow.com/questions/33528555/error-thrown-when-using-blockmatrix-add

## How was this patch tested?

Added a test to MatricesSuite that verifies that the conversions are valid and that code doesn't crash. Originally the same code would crash on Spark.

Bugfix for https://issues.apache.org/jira/browse/SPARK-20687



Author: Ignacio Bermudez <ignaciobermudez@gmail.com>
Author: Ignacio Bermudez Corrales <icorrales@splunk.com>

Closes #17940 from ghoto/bug-fix/SPARK-20687.

(cherry picked from commit 06dda1d5)
Signed-off-by: Sean Owen <sowen@cloudera.com>

c3a986b1

May 19, 2017

[SPARK-20781] the location of Dockerfile in docker.properties.templat is wrong · e9804b3d

liuzhaokun authored 7 years ago

[https://issues.apache.org/jira/browse/SPARK-20781](https://issues.apache.org/jira/browse/SPARK-20781)


the location of Dockerfile in docker.properties.template should be "../external/docker/spark-mesos/Dockerfile"

Author: liuzhaokun <liu.zhaokun@zte.com.cn>

Closes #18013 from liu-zhaokun/dockerfile_location.

(cherry picked from commit 749418d2)
Signed-off-by: Sean Owen <sowen@cloudera.com>

e9804b3d

[SPARK-20759] SCALA_VERSION in _config.yml should be consistent with pom.xml · c53fe793

liuzhaokun authored 7 years ago

[https://issues.apache.org/jira/browse/SPARK-20759](https://issues.apache.org/jira/browse/SPARK-20759)


SCALA_VERSION in _config.yml is 2.11.7, but 2.11.8 in pom.xml. So I think SCALA_VERSION in _config.yml should be consistent with pom.xml.

Author: liuzhaokun <liu.zhaokun@zte.com.cn>

Closes #17992 from liu-zhaokun/new.

(cherry picked from commit dba2ca2c)
Signed-off-by: Sean Owen <sowen@cloudera.com>

c53fe793

[SPARK-20798] GenerateUnsafeProjection should check if a value is null before calling the getter · e326de48

Ala Luszczak authored 7 years ago


## What changes were proposed in this pull request?

GenerateUnsafeProjection.writeStructToBuffer() did not honor the assumption that the caller must make sure that a value is not null before using the getter. This could lead to various errors. This change fixes that behavior.

Example of code generated before:
```scala
/* 059 */         final UTF8String fieldName = value.getUTF8String(0);
/* 060 */         if (value.isNullAt(0)) {
/* 061 */           rowWriter1.setNullAt(0);
/* 062 */         } else {
/* 063 */           rowWriter1.write(0, fieldName);
/* 064 */         }
```

Example of code generated now:
```scala
/* 060 */         boolean isNull1 = value.isNullAt(0);
/* 061 */         UTF8String value1 = isNull1 ? null : value.getUTF8String(0);
/* 062 */         if (isNull1) {
/* 063 */           rowWriter1.setNullAt(0);
/* 064 */         } else {
/* 065 */           rowWriter1.write(0, value1);
/* 066 */         }
```

## How was this patch tested?

Adds GenerateUnsafeProjectionSuite.

Author: Ala Luszczak <ala@databricks.com>

Closes #18030 from ala/fix-generate-unsafe-projection.

(cherry picked from commit ce8edb8b)
Signed-off-by: Herman van Hovell <hvanhovell@databricks.com>

e326de48