Commits · fe24634d14bc0973ca38222db2f58eafbf0c890d · cs525-sp18-g07 / spark

Jun 21, 2017

[SPARK-20640][CORE] Make rpc timeout and retry for shuffle registration configurable. · d107b3b9

Li Yichao authored 7 years ago

## What changes were proposed in this pull request?

Currently the shuffle service registration timeout and retry has been hardcoded. This works well for small workloads but under heavy workload when the shuffle service is busy transferring large amount of data we see significant delay in responding to the registration request, as a result we often see the executors fail to register with the shuffle service, eventually failing the job. We need to make these two parameters configurable.

## How was this patch tested?

* Updated `BlockManagerSuite` to test registration timeout and max attempts configuration actually works.

cc sitalkedia

Author: Li Yichao <lyc@zhihu.com>

Closes #18092 from liyichao/SPARK-20640.

d107b3b9

Jun 11, 2017

[SPARK-21000][MESOS] Add Mesos labels support to the Spark Dispatcher · 8da3f704

Michael Gummelt authored 7 years ago

## What changes were proposed in this pull request?

Add Mesos labels support to the Spark Dispatcher

## How was this patch tested?

unit tests

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #18220 from mgummelt/SPARK-21000-dispatcher-labels.

8da3f704

May 22, 2017

[SPARK-20814][MESOS] Restore support for spark.executor.extraClassPath. · df64fa79

Marcelo Vanzin authored 7 years ago

Restore code that was removed as part of SPARK-17979, but instead of
using the deprecated env variable name to propagate the class path, use
a new one.

Verified by running "./bin/spark-class o.a.s.executor.CoarseGrainedExecutorBackend"
manually.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #18037 from vanzin/SPARK-20814.

df64fa79

May 10, 2017

[SPARK-20393][WEBU UI] Strengthen Spark to prevent XSS vulnerabilities · b512233a

NICHOLAS T. MARION authored 7 years ago

## What changes were proposed in this pull request?

Add stripXSS and stripXSSMap to Spark Core's UIUtils. Calling these functions at any point that getParameter is called against a HttpServletRequest.

## How was this patch tested?

Unit tests, IBM Security AppScan Standard no longer showing vulnerabilities, manual verification of WebUI pages.

Author: NICHOLAS T. MARION <nmarion@us.ibm.com>

Closes #17686 from n-marion/xss-fix.

b512233a

May 08, 2017

[SPARK-20605][CORE][YARN][MESOS] Deprecate not used AM and executor port configuration · 829cd7b8

jerryshao authored 7 years ago

## What changes were proposed in this pull request?

After SPARK-10997, client mode Netty RpcEnv doesn't require to start server, so port configurations are not used any more, here propose to remove these two configurations: "spark.executor.port" and "spark.am.port".

## How was this patch tested?

Existing UTs.

Author: jerryshao <sshao@hortonworks.com>

Closes #17866 from jerryshao/SPARK-20605.

829cd7b8

[SPARK-20519][SQL][CORE] Modify to prevent some possible runtime exceptions · 0f820e2b

liuxian authored 7 years ago

Signed-off-by: liuxian <liu.xian3zte.com.cn>

## What changes were proposed in this pull request?

When the input parameter is null, may be a runtime exception occurs

## How was this patch tested?
Existing unit tests

Author: liuxian <liu.xian3@zte.com.cn>

Closes #17796 from 10110346/wip_lx_0428.

0f820e2b

Apr 27, 2017

[SPARK-20483][MINOR] Test for Mesos Coarse mode may starve other Mesos frameworks · 039e32ca

Davis Shepherd authored 7 years ago

## What changes were proposed in this pull request?

Add test case for scenarios where executor.cores is set as a
(non)divisor of spark.cores.max
This tests the change in
#17786

## How was this patch tested?

Ran the existing test suite with the new tests

dbtsai

Author: Davis Shepherd <dshepherd@netflix.com>

Closes #17788 from dgshep/add_mesos_test.

Unverified

039e32ca

[SPARK-20483] Mesos Coarse mode may starve other Mesos frameworks · 7633933e

Davis Shepherd authored 7 years ago

## What changes were proposed in this pull request?

Set maxCores to be a multiple of the smallest executor that can be launched. This ensures that we correctly detect the condition where no more executors will be launched when spark.cores.max is not a multiple of spark.executor.cores

## How was this patch tested?

This was manually tested with other sample frameworks measuring their incoming offers to determine if starvation would occur.

dbtsai mgummelt

Author: Davis Shepherd <dshepherd@netflix.com>

Closes #17786 from dgshep/fix_mesos_max_cores.

Unverified

7633933e

Apr 24, 2017

[SPARK-20453] Bump master branch version to 2.3.0-SNAPSHOT · f44c8a84

Josh Rosen authored 7 years ago

This patch bumps the master branch version to `2.3.0-SNAPSHOT`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #17753 from JoshRosen/SPARK-20453.

f44c8a84

Apr 23, 2017

[SPARK-20385][WEB-UI] Submitted Time' field, the date format needs to be... · 2eaf4f3f

郭小龙 10207633 authored 7 years ago

[SPARK-20385][WEB-UI] Submitted Time' field, the date format needs to be formatted, in running Drivers table or Completed Drivers table in master web ui.

## What changes were proposed in this pull request?
Submitted Time' field, the date format **needs to be formatted**, in running Drivers table or Completed Drivers table in master web ui.
Before fix this problem  e.g.

Completed Drivers
Submission ID	             **Submitted Time**  	             Worker	                            State	   Cores	   Memory	       Main Class
driver-20170419145755-0005	 **Wed Apr 19 14:57:55 CST 2017**	 worker-20170419145250-zdh120-40412	FAILED	   1	       1024.0 MB	   cn.zte.HdfsTest

please see the  attachment:https://issues.apache.org/jira/secure/attachment/12863977/before_fix.png

After fix this problem e.g.

Completed Drivers
Submission ID	             **Submitted Time**  	             Worker	                            State	   Cores	   Memory	       Main Class
driver-20170419145755-0006	 **2017/04/19 16:01:25**	 worker-20170419145250-zdh120-40412	         FAILED	   1	       1024.0 MB	   cn.zte.HdfsTest

please see the  attachment:https://issues.apache.org/jira/secure/attachment/12863976/after_fix.png

'Submitted Time' field, the date format **has been formatted**, in running Applications table or Completed Applicationstable in master web ui, **it is correct.**
e.g.
Running Applications
Application ID	                Name	                Cores	Memory per Executor	   **Submitted Time**	      User	   State	        Duration
app-20170419160910-0000 (kill)	SparkSQL::10.43.183.120	1	    5.0 GB	               **2017/04/19 16:09:10**	  root	   RUNNING	    53 s

**Format after the time easier to observe, and consistent with the applications table,so I think it's worth fixing.**

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: 郭小龙 10207633 <guo.xiaolong1@zte.com.cn>
Author: guoxiaolong <guo.xiaolong1@zte.com.cn>
Author: guoxiaolongzte <guo.xiaolong1@zte.com.cn>

Closes #17682 from guoxiaolongzte/SPARK-20385.

2eaf4f3f

Apr 16, 2017

[SPARK-19740][MESOS] Add support in Spark to pass arbitrary parameters into... · a888fed3

Ji Yan authored 7 years ago

[SPARK-19740][MESOS] Add support in Spark to pass arbitrary parameters into docker when running on mesos with docker containerizer

## What changes were proposed in this pull request?

Allow passing in arbitrary parameters into docker when launching spark executors on mesos with docker containerizer tnachen

## How was this patch tested?

Manually built and tested with passed in parameter

Author: Ji Yan <jiyan@Jis-MacBook-Air.local>

Closes #17109 from yanji84/ji/allow_set_docker_user.

a888fed3

Apr 12, 2017

[SPARK-18692][BUILD][DOCS] Test Java 8 unidoc build on Jenkins · ceaf77ae

hyukjinkwon authored 7 years ago

## What changes were proposed in this pull request?

This PR proposes to run Spark unidoc to test Javadoc 8 build as Javadoc 8 is easily re-breakable.

There are several problems with it:

- It introduces little extra bit of time to run the tests. In my case, it took 1.5 mins more (`Elapsed :[94.8746569157]`). How it was tested is described in "How was this patch tested?".

- > One problem that I noticed was that Unidoc appeared to be processing test sources: if we can find a way to exclude those from being processed in the first place then that might significantly speed things up.

  (see  joshrosen's [comment](https://issues.apache.org/jira/browse/SPARK-18692?focusedCommentId=15947627&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15947627))

To complete this automated build, It also suggests to fix existing Javadoc breaks / ones introduced by test codes as described above.

There fixes are similar instances that previously fixed. Please refer https://github.com/apache/spark/pull/15999 and https://github.com/apache/spark/pull/16013

Note that this only fixes **errors** not **warnings**. Please see my observation https://github.com/apache/spark/pull/17389#issuecomment-288438704 for spurious errors by warnings.

## How was this patch tested?

Manually via `jekyll build` for building tests. Also, tested via running `./dev/run-tests`.

This was tested via manually adding `time.time()` as below:

```diff
     profiles_and_goals = build_profiles + sbt_goals

     print("[info] Building Spark unidoc (w/Hive 1.2.1) using SBT with these arguments: ",
           " ".join(profiles_and_goals))

+    import time
+    st = time.time()
     exec_sbt(profiles_and_goals)
+    print("Elapsed :[%s]" % str(time.time() - st))
```

produces

```
...
========================================================================
Building Unidoc API Documentation
========================================================================
...
[info] Main Java API documentation successful.
...
Elapsed :[94.8746569157]
...

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #17477 from HyukjinKwon/SPARK-18692.

ceaf77ae

Apr 06, 2017

[SPARK-20085][MESOS] Configurable mesos labels for executors · c8fc1f3b

Kalvin Chau authored 7 years ago

## What changes were proposed in this pull request?

Add spark.mesos.task.labels configuration option to add mesos key:value labels to the executor.

 "k1:v1,k2:v2" as the format, colons separating key-value and commas to list out more than one.

Discussion of labels with mgummelt at #17404

## How was this patch tested?

Added unit tests to verify labels were added correctly, with incorrect labels being ignored and added a test to test the name of the executor.

Tested with: `./build/sbt -Pmesos mesos/test`

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Kalvin Chau <kalvin.chau@viasat.com>

Closes #17413 from kalvinnchau/mesos-labels.

c8fc1f3b

Mar 25, 2017

[SPARK-20078][MESOS] Mesos executor configurability for task name and labels · e8ddb91c

Kalvin Chau authored 8 years ago

## What changes were proposed in this pull request?

Adding configurable mesos executor names and labels using `spark.mesos.task.name` and `spark.mesos.task.labels`.

Labels were defined as `k1:v1,k2:v2`.

mgummelt

## How was this patch tested?

Added unit tests to verify labels were added correctly, with incorrect labels being ignored and added a test to test the name of the executor.

Tested with: `./build/sbt -Pmesos mesos/test`

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Kalvin Chau <kalvin.chau@viasat.com>

Closes #17404 from kalvinnchau/mesos-config.

e8ddb91c

Mar 24, 2017

[SPARK-19820][CORE] Add interface to kill tasks w/ a reason · 8e558041

Eric Liang authored 8 years ago

This commit adds a killTaskAttempt method to SparkContext, to allow users to
kill tasks so that they can be re-scheduled elsewhere.

This also refactors the task kill path to allow specifying a reason for the task kill. The reason is propagated opaquely through events, and will show up in the UI automatically as `(N killed: $reason)` and `TaskKilled: $reason`. Without this change, there is no way to provide the user feedback through the UI.

Currently used reasons are "stage cancelled", "another attempt succeeded", and "killed via SparkContext.killTask". The user can also specify a custom reason through `SparkContext.killTask`.

cc rxin

In the stage overview UI the reasons are summarized:
![1](https://cloud.githubusercontent.com/assets/14922/23929209/a83b2862-08e1-11e7-8b3e-ae1967bbe2e5.png)

Within the stage UI you can see individual task kill reasons:
![2](https://cloud.githubusercontent.com/assets/14922/23929200/9a798692-08e1-11e7-8697-72b27ad8a287.png)

Existing tests, tried killing some stages in the UI and verified the messages are as expected.

Author: Eric Liang <ekl@databricks.com>
Author: Eric Liang <ekl@google.com>

Closes #17166 from ericl/kill-reason.

8e558041

Mar 23, 2017

Typo fixup in comment · b0ae6a38

Ye Yin authored 8 years ago

## What changes were proposed in this pull request?

Fixup typo in comment.

## How was this patch tested?

Don't need.

Author: Ye Yin <eyniy@qq.com>

Closes #17396 from hustcat/fix.

b0ae6a38

Mar 10, 2017

[SPARK-17979][SPARK-14453] Remove deprecated SPARK_YARN_USER_ENV and SPARK_JAVA_OPTS · 8f0490e2

Yong Tang authored 8 years ago

This fix removes deprecated support for config `SPARK_YARN_USER_ENV`, as is mentioned in SPARK-17979.
This fix also removes deprecated support for the following:
```
SPARK_YARN_USER_ENV
SPARK_JAVA_OPTS
SPARK_CLASSPATH
SPARK_WORKER_INSTANCES
```

Related JIRA:
[SPARK-14453]: https://issues.apache.org/jira/browse/SPARK-14453
[SPARK-12344]: https://issues.apache.org/jira/browse/SPARK-12344
[SPARK-15781]: https://issues.apache.org/jira/browse/SPARK-15781

Existing tests should pass.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #17212 from yongtang/SPARK-17979.

8f0490e2

Mar 07, 2017

[SPARK-19702][MESOS] Increase default refuse_seconds timeout in the Mesos Spark Dispatcher · 2e30c0b9

Michael Gummelt authored 8 years ago

## What changes were proposed in this pull request?

Increase default refuse_seconds timeout, and make it configurable.  See JIRA for details on how this reduces the risk of starvation.

## How was this patch tested?

Unit tests, Manual testing, and Mesos/Spark integration test suite

cc susanxhuynh skonto jmlvanre

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #17031 from mgummelt/SPARK-19702-suppress-revive.

2e30c0b9

Feb 28, 2017

[SPARK-19373][MESOS] Base spark.scheduler.minRegisteredResourceRatio on... · ca3864d6

Michael Gummelt authored 8 years ago

[SPARK-19373][MESOS] Base spark.scheduler.minRegisteredResourceRatio on registered cores rather than accepted cores

## What changes were proposed in this pull request?

See JIRA

## How was this patch tested?

Unit tests, Mesos/Spark integration tests

cc skonto susanxhuynh

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #17045 from mgummelt/SPARK-19373-registered-resources.

ca3864d6

Feb 25, 2017

[SPARK-15288][MESOS] Mesos dispatcher should handle gracefully when any thread... · 410392ed

Devaraj K authored 8 years ago

[SPARK-15288][MESOS] Mesos dispatcher should handle gracefully when any thread gets UncaughtException

## What changes were proposed in this pull request?

Adding the default UncaughtExceptionHandler to the MesosClusterDispatcher.
## How was this patch tested?

I verified it manually, when any of the dispatcher thread gets uncaught exceptions then the default UncaughtExceptionHandler will handle those exceptions.

Author: Devaraj K <devaraj@apache.org>

Closes #13072 from devaraj-kavali/SPARK-15288.

410392ed

Feb 19, 2017

[SPARK-19450] Replace askWithRetry with askSync. · ba8912e5

jinxing authored 8 years ago

## What changes were proposed in this pull request?

`askSync` is already added in `RpcEndpointRef` (see SPARK-19347 and https://github.com/apache/spark/pull/16690#issuecomment-276850068) and `askWithRetry` is marked as deprecated.
As mentioned SPARK-18113(https://github.com/apache/spark/pull/16503#event-927953218):

>askWithRetry is basically an unneeded API, and a leftover from the akka days that doesn't make sense anymore. It's prone to cause deadlocks (exactly because it's blocking), it imposes restrictions on the caller (e.g. idempotency) and other things that people generally don't pay that much attention to when using it.

Since `askWithRetry` is just used inside spark and not in user logic. It might make sense to replace all of them with `askSync`.

## How was this patch tested?
This PR doesn't change code logic, existing unit test can cover.

Author: jinxing <jinxing@meituan.com>

Closes #16790 from jinxing64/SPARK-19450.

Unverified

ba8912e5

Feb 10, 2017

[SPARK-10748][MESOS] Log error instead of crashing Spark Mesos dispatcher when... · 8640dc08

Devaraj K authored 8 years ago

[SPARK-10748][MESOS] Log error instead of crashing Spark Mesos dispatcher when a job is misconfigured

## What changes were proposed in this pull request?

Now handling the spark exception which gets thrown for invalid job configuration, marking that job as failed and continuing to launch the other drivers instead of throwing the exception.
## How was this patch tested?

I verified manually, now the misconfigured jobs move to Finished Drivers section in UI and continue to launch the other jobs.

Author: Devaraj K <devaraj@apache.org>

Closes #13077 from devaraj-kavali/SPARK-10748.

Unverified

8640dc08

Feb 08, 2017

[SPARK-19409][BUILD][TEST-MAVEN] Fix ParquetAvroCompatibilitySuite failure due... · 0077bfcb

Dongjoon Hyun authored 8 years ago

[SPARK-19409][BUILD][TEST-MAVEN] Fix ParquetAvroCompatibilitySuite failure due to test dependency on avro

## What changes were proposed in this pull request?

After using Apache Parquet 1.8.2, `ParquetAvroCompatibilitySuite` fails on **Maven** test. It is because `org.apache.parquet.avro.AvroParquetWriter` in the test code used new `avro 1.8.0` specific class, `LogicalType`. This PR aims to fix the test dependency of `sql/core` module to use avro 1.8.0.

https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/2530/consoleFull

```
ParquetAvroCompatibilitySuite:
*** RUN ABORTED ***
  java.lang.NoClassDefFoundError: org/apache/avro/LogicalType
  at org.apache.parquet.avro.AvroParquetWriter.writeSupport(AvroParquetWriter.java:144)
```

## How was this patch tested?

Pass the existing test with **Maven**.

```
$ build/mvn -Pyarn -Phadoop-2.7 -Pkinesis-asl -Phive -Phive-thriftserver test
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:07 h
[INFO] Finished at: 2017-02-04T05:41:43+00:00
[INFO] Final Memory: 77M/987M
[INFO] ------------------------------------------------------------------------
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16795 from dongjoon-hyun/SPARK-19409-2.

Unverified

0077bfcb

Jan 24, 2017

[SPARK-19139][CORE] New auth mechanism for transport library. · 8f3f73ab

Marcelo Vanzin authored 8 years ago

This change introduces a new auth mechanism to the transport library,
to be used when users enable strong encryption. This auth mechanism
has better security than the currently used DIGEST-MD5.

The new protocol uses symmetric key encryption to mutually authenticate
the endpoints, and is very loosely based on ISO/IEC 9798.

The new protocol falls back to SASL when it thinks the remote end is old.
Because SASL does not support asking the server for multiple auth protocols,
which would mean we could re-use the existing SASL code by just adding a
new SASL provider, the protocol is implemented outside of the SASL API
to avoid the boilerplate of adding a new provider.

Details of the auth protocol are discussed in the included README.md
file.

This change partly undos the changes added in SPARK-13331; AES encryption
is now decoupled from SASL authentication. The encryption code itself,
though, has been re-used as part of this change.

## How was this patch tested?

- Unit tests
- Tested Spark 2.2 against Spark 1.6 shuffle service with SASL enabled
- Tested Spark 2.2 against Spark 2.2 shuffle service with SASL fallback disabled

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16521 from vanzin/SPARK-19139.

8f3f73ab

Jan 06, 2017

[SPARK-17931] Eliminate unnecessary task (de) serialization · 2e139eed

Kay Ousterhout authored 8 years ago

In the existing code, there are three layers of serialization
    involved in sending a task from the scheduler to an executor:
        - A Task object is serialized
        - The Task object is copied to a byte buffer that also
          contains serialized information about any additional JARs,
          files, and Properties needed for the task to execute. This
          byte buffer is stored as the member variable serializedTask
          in the TaskDescription class.
        - The TaskDescription is serialized (in addition to the serialized
          task + JARs, the TaskDescription class contains the task ID and
          other metadata) and sent in a LaunchTask message.

While it *is* necessary to have two layers of serialization, so that
the JAR, file, and Property info can be deserialized prior to
deserializing the Task object, the third layer of deserialization is
unnecessary.  This commit eliminates a layer of serialization by moving
the JARs, files, and Properties into the TaskDescription class.

This commit also serializes the Properties manually (by traversing the map),
as is done with the JARs and files, which reduces the final serialized size.

Unit tests

This is a simpler alternative to the approach proposed in #15505.

shivaram and I did some benchmarking of this and #15505 on a 20-machine m2.4xlarge EC2 machines (160 cores). We ran ~30 trials of code [1] (a very simple job with 10K tasks per stage) and measured the average time per stage:

Before this change: 2490ms
With this change: 2345 ms (so ~6% improvement over the baseline)
With witgo's approach in #15505: 2046 ms (~18% improvement over baseline)

The reason that #15505 has a more significant improvement is that it also moves the serialization from the TaskSchedulerImpl thread to the CoarseGrainedSchedulerBackend thread. I added that functionality on top of this change, and got almost the same improvement [1] as #15505 (average of 2103ms). I think we should decouple these two changes, both so we have some record of the improvement form each individual improvement, and because this change is more about simplifying the code base (the improvement is negligible) while the other is about performance improvement.  The plan, currently, is to merge this PR and then merge the remaining part of #15505 that moves serialization.

[1] The reason the improvement wasn't quite as good as with #15505 when we ran the benchmarks is almost certainly because, at the point when we ran the benchmarks, I hadn't updated the code to manually serialize the Properties (instead the code was using Java's default serialization for the Properties object, whereas #15505 manually serialized the Properties).  This PR has since been updated to manually serialize the Properties, just like the other maps.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #16053 from kayousterhout/SPARK-17931.

2e139eed

Jan 03, 2017

[SPARK-15555][MESOS] Driver with --supervise option cannot be killed in Mesos mode · 89bf370e

Devaraj K authored 8 years ago

## What changes were proposed in this pull request?

Not adding the Killed applications for retry.
## How was this patch tested?

I have verified manually in the Mesos cluster, with the changes the killed applications move to Finished Drivers section and will not retry.

Author: Devaraj K <devaraj@apache.org>

Closes #13323 from devaraj-kavali/SPARK-15555.

89bf370e

Dec 06, 2016

[SPARK-18662] Move resource managers to separate directory · 81e5619c

Anirudh authored 8 years ago

## What changes were proposed in this pull request?

* Moves yarn and mesos scheduler backends to resource-managers/ sub-directory (in preparation for https://issues.apache.org/jira/browse/SPARK-18278)
* Corresponding change in top-level pom.xml.

Ref: https://github.com/apache/spark/pull/16061#issuecomment-263649340

## How was this patch tested?

* Manual tests

/cc rxin

Author: Anirudh <ramanathana@google.com>

Closes #16092 from foxish/fix-scheduler-structure-2.

81e5619c