Commits · 6da6a27f673f6e45fe619e0411fbaaa14ea34bfb · cs525-sp18-g07 / spark

Feb 21, 2017

[SPARK-19626][YARN] Using the correct config to set credentials update time · 6edf02a8

Kent Yao authored 8 years ago

## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/14065

, we introduced a configurable credential manager for Spark running on YARN. Also two configs `spark.yarn.credentials.renewalTime` and `spark.yarn.credentials.updateTime` were added, one is for the credential renewer and the other updater. But now we just query `spark.yarn.credentials.renewalTime` by mistake during CREDENTIALS UPDATING, where should be actually `spark.yarn.credentials.updateTime` .

This PR fixes this mistake.

## How was this patch tested?

existing test

cc jerryshao vanzin

Author: Kent Yao <yaooqinn@hotmail.com>

Closes #16955 from yaooqinn/cred_update.

(cherry picked from commit 7363dde6)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

6edf02a8

Feb 14, 2017

[SPARK-19501][YARN] Reduce the number of HDFS RPCs during YARN deployment · f837ced4

Jong Wook Kim authored 8 years ago

## What changes were proposed in this pull request?

As discussed in [JIRA](https://issues.apache.org/jira/browse/SPARK-19501), this patch addresses the problem where too many HDFS RPCs are made when there are many URIs specified in `spark.yarn.jars`, potentially adding hundreds of RTTs to YARN before the application launches. This becomes significant when submitting the application to a non-local YARN cluster (where the RTT may be in order of 100ms, for example). For each URI specified, the current implementation makes at least two HDFS RPCs, for:

- [Calling `getFileStatus()` before uploading each file to the distributed cache in `ClientDistributedCacheManager.addResource()`](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/ClientDistributedCacheManager.scala#L71).
- [Resolving any symbolic links in each of the file URI](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L377-L379), which repeatedly makes HDFS RPCs until the all symlinks are resolved. (see [`FileContext.resolve(Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileContext.java#L2189-L2195), [`FSLinkResolver.resolve(FileContext, Path)`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FSLinkResolver.java#L79-L112), and [`AbstractFileSystem.resolvePath()`](https://github.com/apache/hadoop/blob/release-2.7.1/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/AbstractFileSystem.java#L464-L468).)

The first `getFileStatus` RPC can be removed, using `statCache` populated with the file statuses retrieved with [the previous `globStatus` call](https://github.com/apache/spark/blob/v2.1.0/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L531).

The second one can be largely reduced by caching the symlink resolution results in a mutable.HashMap. This patch adds a local variable in `yarn.Client.prepareLocalResources()` and passes it as an additional parameter to `yarn.Client.copyFileToRemote`.  [The symlink resolution code was added in 2013](https://github.com/apache/spark/commit/a35472e1dd2ea1b5a0b1fb6b382f5a98f5aeba5a#diff-b050df3f55b82065803d6e83453b9706R187

) and has not changed since. I am assuming that this is still required, but otherwise we can remove using `symlinkCache` and symlink resolution altogether.

## How was this patch tested?

This patch is based off 8e8afb3a, currently the latest YARN patch on master. All tests except a few in spark-hive passed with `./dev/run-tests` on my machine, using JDK 1.8.0_112 on macOS 10.12.3; also tested myself with this modified version of SPARK 2.2.0-SNAPSHOT which performed a normal deployment and execution on a YARN cluster without errors.

Author: Jong Wook Kim <jongwook@nyu.edu>

Closes #16916 from jongwook/SPARK-19501.

(cherry picked from commit ab9872db)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

f837ced4

Jan 25, 2017

[SPARK-18750][YARN] Follow up: move test to correct directory in 2.1 branch. · 97d3353e
Marcelo Vanzin authored 8 years ago
```
Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16704 from vanzin/SPARK-18750_2.1.
```
97d3353e

[SPARK-18750][YARN] Avoid using "mapValues" when allocating containers. · f391ad2c

Marcelo Vanzin authored 8 years ago


That method is prone to stack overflows when the input map is really
large; instead, use plain "map". Also includes a unit test that was
tested and caused stack overflows without the fix.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #16667 from vanzin/SPARK-18750.

(cherry picked from commit 76db394f)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

f391ad2c

Jan 02, 2017

[MINOR][DOC] Minor doc change for YARN credential providers · 63857c8d

Liang-Chi Hsieh authored 8 years ago

## What changes were proposed in this pull request?

The configuration `spark.yarn.security.tokens.{service}.enabled` is deprecated. Now we should use `spark.yarn.security.credentials.{service}.enabled`. Some places in the doc is not updated yet.

## How was this patch tested?

N/A. Just doc change.

Please review http://spark.apache.org/contributing.html

 before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16444 from viirya/minor-credential-provider-doc.

(cherry picked from commit 0ac2f1e7)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

63857c8d

Dec 22, 2016

[SPARK-17807][CORE] split test-tags into test-JAR · 132f2297

Ryan Williams authored 8 years ago


Remove spark-tag's compile-scope dependency (and, indirectly, spark-core's compile-scope transitive-dependency) on scalatest by splitting test-oriented tags into spark-tags' test JAR.

Alternative to #16303.

Author: Ryan Williams <ryan.blake.williams@gmail.com>

Closes #16311 from ryan-williams/tt.

(cherry picked from commit afd9bc1d)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

132f2297

Dec 15, 2016
- Preparing development version 2.1.1-SNAPSHOT · 483624c2
  Patrick Wendell authored 8 years ago
  
  483624c2
- Preparing Spark release v2.1.0-rc5 · cd0a0836
  Patrick Wendell authored 8 years ago
  
  View commits for tag v2.1.0 v2.1.0
  
  cd0a0836
- Preparing development version 2.1.1-SNAPSHOT · 62a6577b
  Patrick Wendell authored 8 years ago
  
  62a6577b
- Preparing Spark release v2.1.0-rc4 · ec317265
  Patrick Wendell authored 8 years ago
  
  ec317265
- Preparing development version 2.1.1-SNAPSHOT · a7364a82
  Patrick Wendell authored 8 years ago
  
  a7364a82
- Preparing Spark release v2.1.0-rc3 · ef2ccf94
  Patrick Wendell authored 8 years ago
  
  ef2ccf94
Dec 13, 2016

[SPARK-18840][YARN] Avoid throw exception when getting token renewal interval... · d5c4a5d0

jerryshao authored 8 years ago

[SPARK-18840][YARN] Avoid throw exception when getting token renewal interval in non HDFS security environment

## What changes were proposed in this pull request?

Fix `java.util.NoSuchElementException` when running Spark in non-hdfs security environment.

In the current code, we assume `HDFS_DELEGATION_KIND` token will be found in Credentials. But in some cloud environments, HDFS is not required, so we should avoid this exception.

## How was this patch tested?

Manually verified in local environment.

Author: jerryshao <sshao@hortonworks.com>

Closes #16265 from jerryshao/SPARK-18840.

(cherry picked from commit 43298d15)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

d5c4a5d0

Dec 08, 2016
- Preparing development version 2.1.1-SNAPSHOT · 48aa6775
  Patrick Wendell authored 8 years ago
  
  48aa6775
- Preparing Spark release v2.1.0-rc2 · 08071749
  Patrick Wendell authored 8 years ago
  
  08071749
Nov 28, 2016

[SPARK-18547][CORE] Propagate I/O encryption key when executors register. · c4cbdc86

Marcelo Vanzin authored 8 years ago


This change modifies the method used to propagate encryption keys used during
shuffle. Instead of relying on YARN's UserGroupInformation credential propagation,
this change explicitly distributes the key using the messages exchanged between
driver and executor during registration. When RPC encryption is enabled, this means
key propagation is also secure.

This allows shuffle encryption to work in non-YARN mode, which means that it's
easier to write unit tests for areas of the code that are affected by the feature.

The key is stored in the SecurityManager; because there are many instances of
that class used in the code, the key is only guaranteed to exist in the instance
managed by the SparkEnv. This path was chosen to avoid storing the key in the
SparkConf, which would risk having the key being written to disk as part of the
configuration (as, for example, is done when starting YARN applications).

Tested by new and existing unit tests (which were moved from the YARN module to
core), and by running apps with shuffle encryption enabled.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15981 from vanzin/SPARK-18547.

(cherry picked from commit 8b325b17)
Signed-off-by: Shixiong Zhu <shixiong@databricks.com>

c4cbdc86

Preparing development version 2.1.1-SNAPSHOT · 75d73d13
Patrick Wendell authored 8 years ago

75d73d13
Preparing Spark release v2.1.0-rc1 · 80aabc0b
Patrick Wendell authored 8 years ago

80aabc0b

Nov 25, 2016

[SPARK-18575][WEB] Keep same style: adjust the position of driver log links · 57dbc682

uncleGen authored 8 years ago

## What changes were proposed in this pull request?

NOT BUG, just adjust the position of driver log link to keep the same style with other executors log link.

![image](https://cloud.githubusercontent.com/assets/7402327/20590092/f8bddbb8-b25b-11e6-9aaf-3b5b3073df10.png

)

## How was this patch tested?
 no

Author: uncleGen <hustyugm@gmail.com>

Closes #16001 from uncleGen/SPARK-18575.

(cherry picked from commit f58a8aa2)
Signed-off-by: Sean Owen <sowen@cloudera.com>

Unverified

57dbc682

Nov 08, 2016

[SPARK-18357] Fix yarn files/archive broken issue andd unit tests · 876eee2b

Kishor Patil authored 8 years ago


## What changes were proposed in this pull request?

The #15627 broke functionality with yarn --files --archives does not accept any files.
This patch ensures that --files and --archives accept unique files.

## How was this patch tested?

A. I added unit tests.
B. Also, manually tested --files with --archives to throw exception if duplicate files are specified and continue if unique files are specified.

Author: Kishor Patil <kpatil@yahoo-inc.com>

Closes #15810 from kishorvpatil/SPARK18357.

(cherry picked from commit 245e5a2f)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

876eee2b

Nov 03, 2016

[SPARK-18099][YARN] Fail if same files added to distributed cache for --files and --archives · 569f77a1

Kishor Patil authored 8 years ago

## What changes were proposed in this pull request?

During spark-submit, if yarn dist cache is instructed to add same file under --files and --archives, This code change ensures the spark yarn distributed cache behaviour is retained i.e. to warn and fail if same files is mentioned in both --files and --archives.
## How was this patch tested?

Manually tested:
1. if same jar is mentioned in --jars and --files it will continue to submit the job.
- basically functionality [SPARK-14423] #12203 is unchanged
  1. if same file is mentioned in --files and --archives it will fail to submit the job.

Please review https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 before opening a pull request.

… under archives and files

Author: Kishor Patil <kpatil@yahoo-inc.com>

Closes #15627 from kishorvpatil/spark18099.

(cherry picked from commit 098e4ca9)
Signed-off-by: Tom Graves <tgraves@yahoo-inc.com>

569f77a1

Nov 02, 2016

[SPARK-18160][CORE][YARN] spark.files & spark.jars should not be passed to driver in yarn mode · bd3ea659

Jeff Zhang authored 8 years ago


## What changes were proposed in this pull request?

spark.files is still passed to driver in yarn mode, so SparkContext will still handle it which cause the error in the jira desc.

## How was this patch tested?

Tested manually in a 5 node cluster. As this issue only happens in multiple node cluster, so I didn't write test for it.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #15669 from zjffdu/SPARK-18160.

(cherry picked from commit 3c24299b)
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>

bd3ea659

Oct 26, 2016

[SPARK-18027][YARN] .sparkStaging not clean on RM ApplicationNotFoundException · 29781364

Sean Owen authored 8 years ago

## What changes were proposed in this pull request?

Cleanup YARN staging dir on all `KILLED`/`FAILED` paths in `monitorApplication`

## How was this patch tested?

Existing tests

Author: Sean Owen <sowen@cloudera.com>

Closes #15598 from srowen/SPARK-18027.

Unverified

29781364

Oct 21, 2016

[SPARK-17960][PYSPARK][UPGRADE TO PY4J 0.10.4] · 595893d3

Jagadeesan authored 8 years ago

## What changes were proposed in this pull request?

1) Upgrade the Py4J version on the Java side
2) Update the py4j src zip file we bundle with Spark

## How was this patch tested?

Existing doctests & unit tests pass

Author: Jagadeesan <as2@us.ibm.com>

Closes #15514 from jagadeesanas2/SPARK-17960.

Unverified

595893d3

Sep 27, 2016

[SPARK-16757] Set up Spark caller context to HDFS and YARN · 6a68c5d7

Weiqing Yang authored 8 years ago

## What changes were proposed in this pull request?

1. Pass `jobId` to Task.
2. Invoke Hadoop APIs.
    * A new function `setCallerContext` is added in `Utils`. `setCallerContext` function invokes APIs of   `org.apache.hadoop.ipc.CallerContext` to set up spark caller contexts, which will be written into `hdfs-audit.log` and Yarn RM audit log.
    * For HDFS: Spark sets up its caller context by invoking`org.apache.hadoop.ipc.CallerContext` in `Task` and Yarn `Client` and `ApplicationMaster`.
    * For Yarn: Spark sets up its caller context by invoking `org.apache.hadoop.ipc.CallerContext` in Yarn `Client`.

## How was this patch tested?
Manual Tests against some Spark applications in Yarn client mode and Yarn cluster mode. Need to check if spark caller contexts are written into HDFS hdfs-audit.log and Yarn RM audit log successfully.

For example, run SparkKmeans in Yarn client mode:
```
./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar hdfs://localhost:9000/lr_big.txt 2 5
```

**Before**:
There will be no Spark caller context in records of `hdfs-audit.log` and Yarn RM audit log.

**After**:
Spark caller contexts will be written in records of `hdfs-audit.log` and Yarn RM audit log.

These are records in `hdfs-audit.log`:
```
2016-09-20 11:54:24,116 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_CLIENT_AppId_application_1474394339641_0005
2016-09-20 11:54:28,164 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0005_JobId_0_StageId_0_AttemptId_0_TaskId_2_AttemptNum_0
2016-09-20 11:54:28,164 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0005_JobId_0_StageId_0_AttemptId_0_TaskId_1_AttemptNum_0
2016-09-20 11:54:28,164 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0005_JobId_0_StageId_0_AttemptId_0_TaskId_0_AttemptNum_0
```
```
2016-09-20 11:59:33,868 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=mkdirs	src=/private/tmp/hadoop-wyang/nm-local-dir/usercache/wyang/appcache/application_1474394339641_0006/container_1474394339641_0006_01_000001/spark-warehouse	dst=null	perm=wyang:supergroup:rwxr-xr-x	proto=rpc	callerContext=SPARK_APPLICATION_MASTER_AppId_application_1474394339641_0006_AttemptId_1
2016-09-20 11:59:37,214 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_1_AttemptNum_0
2016-09-20 11:59:37,215 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_2_AttemptNum_0
2016-09-20 11:59:37,215 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_0_AttemptNum_0
2016-09-20 11:59:42,391 INFO FSNamesystem.audit: allowed=true	ugi=wyang (auth:SIMPLE)	ip=/127.0.0.1	cmd=open	src=/lr_big.txt	dst=null	perm=null	proto=rpc	callerContext=SPARK_TASK_AppId_application_1474394339641_0006_AttemptId_1_JobId_0_StageId_0_AttemptId_0_TaskId_3_AttemptNum_0
```
This is a record in Yarn RM log:
```
2016-09-20 11:59:24,050 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger: USER=wyang	IP=127.0.0.1	OPERATION=Submit Application Request	TARGET=ClientRMService	RESULT=SUCCESS	APPID=application_1474394339641_0006	CALLERCONTEXT=SPARK_CLIENT_AppId_application_1474394339641_0006
```

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #14659 from Sherry302/callercontextSubmit.

6a68c5d7

Sep 20, 2016

[SPARK-17611][YARN][TEST] Make shuffle service test really test auth. · 7e418e99

Marcelo Vanzin authored 8 years ago

Currently, the code is just swallowing exceptions, and not really checking
whether the auth information was being recorded properly. Fix both problems,
and also avoid tests inadvertently affecting other tests by modifying the
shared config variable (by making it not shared).

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #15161 from vanzin/SPARK-17611.

7e418e99

Sep 14, 2016

[SPARK-17511] Yarn Dynamic Allocation: Avoid marking released container as Failed · ff6e4cbd

Kishor Patil authored 8 years ago

## What changes were proposed in this pull request?

Due to race conditions, the ` assert(numExecutorsRunning <= targetNumExecutors)` can fail causing `AssertionError`. So removed the assertion, instead moved the conditional check before launching new container:
```
java.lang.AssertionError: assertion failed
        at scala.Predef$.assert(Predef.scala:156)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1.org$apache$spark$deploy$yarn$YarnAllocator$$anonfun$$updateInternalState$1(YarnAllocator.scala:489)
        at org.apache.spark.deploy.yarn.YarnAllocator$$anonfun$runAllocatedContainers$1$$anon$1.run(YarnAllocator.scala:519)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
```
## How was this patch tested?
This was manually tested using a large ForkAndJoin job with Dynamic Allocation enabled to validate the failing job succeeds, without any such exception.

Author: Kishor Patil <kpatil@yahoo-inc.com>

Closes #15069 from kishorvpatil/SPARK-17511.

ff6e4cbd

Sep 09, 2016

[SPARK-17433] YarnShuffleService doesn't handle moving credentials levelDb · a3981c28

Thomas Graves authored 8 years ago

The secrets leveldb isn't being moved if you run spark shuffle services without yarn nm recovery on and then turn it on. This fixes that. I unfortunately missed this when I ported the patch from our internal branch 2 to master branch due to the changes for the recovery path. Note this only applies to master since it is the only place the yarn nm recovery dir is used.

Unit tests ran and tested on 8 node cluster. Fresh startup with NM recovery, fresh startup no nm recovery, switching between no nm recovery and recovery. Also tested running applications to make sure wasn't affected by rolling upgrade.

Author: Thomas Graves <tgraves@prevailsail.corp.gq1.yahoo.com>
Author: Tom Graves <tgraves@apache.org>

Closes #14999 from tgravescs/SPARK-17433.

a3981c28

Sep 07, 2016

[SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of... · 3ce3a282

Liwei Lin authored 8 years ago

[SPARK-17359][SQL][MLLIB] Use ArrayBuffer.+=(A) instead of ArrayBuffer.append(A) in performance critical paths

## What changes were proposed in this pull request?

We should generally use `ArrayBuffer.+=(A)` rather than `ArrayBuffer.append(A)`, because `append(A)` would involve extra boxing / unboxing.

## How was this patch tested?

N/A

Author: Liwei Lin <lwlin7@gmail.com>

Closes #14914 from lw-lin/append_to_plus_eq_v2.

3ce3a282

Sep 06, 2016

[SPARK-15891][YARN] Clean up some logging in the YARN AM. · 0bd00ff2

Marcelo Vanzin authored 8 years ago

To make the log file more readable, rework some of the logging done
by the AM:

- log executor command / env just once, since they're all almost the same;
  the information that changes, such as executor ID, is already available
  in other log messages.
- avoid printing logs when nothing happens, especially when updating the
  container requests in the allocator.
- print fewer log messages when requesting many unlocalized executors,
  instead of repeating the same message multiple times.
- removed some logs that seemed unnecessary.

In the process, I slightly fixed up the wording in a few log messages, and
did some minor clean up of method arguments that were redundant.

Tested by running existing unit tests, and analyzing the logs of an
application that exercises dynamic allocation by forcing executors
to be allocated and be killed in waves.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14943 from vanzin/SPARK-15891.

0bd00ff2

Sep 02, 2016

[SPARK-16711] YarnShuffleService doesn't re-init properly on YARN rolling upgrade · e79962f2

Thomas Graves authored 8 years ago

The Spark Yarn Shuffle Service doesn't re-initialize the application credentials early enough which causes any other spark executors trying to fetch from that node during a rolling upgrade to fail with "java.lang.NullPointerException: Password cannot be null if SASL is enabled". Right now the spark shuffle service relies on the Yarn nodemanager to re-register the applications, unfortunately this is after we open the port for other executors to connect. If other executors connected before the re-register they get a null pointer exception which isn't a re-tryable exception and cause them to fail pretty quickly. To solve this I added another leveldb file so that it can save and re-initialize all the applications before opening the port for other executors to connect to it. Adding another leveldb was simpler from the code structure point of view.

Most of the code changes are moving things to common util class.

Patch was tested manually on a Yarn cluster with rolling upgrade was happing while spark job was running. Without the patch I consistently get the NullPointerException, with the patch the job gets a few Connection refused exceptions but the retries kick in and the it succeeds.

Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>

Closes #14718 from tgravescs/SPARK-16711.

e79962f2

Sep 01, 2016

[SPARK-16533][CORE] resolve deadlocking in driver when executors die · a0aac4b7

Angus Gerry authored 8 years ago

## What changes were proposed in this pull request?
This pull request reverts the changes made as a part of #14605, which simply side-steps the deadlock issue. Instead, I propose the following approach:
* Use `scheduleWithFixedDelay` when calling `ExecutorAllocationManager.schedule` for scheduling executor requests. The intent of this is that if invocations are delayed beyond the default schedule interval on account of lock contention, then we avoid a situation where calls to `schedule` are made back-to-back, potentially releasing and then immediately reacquiring these locks - further exacerbating contention.
* Replace a number of calls to `askWithRetry` with `ask` inside of message handling code in `CoarseGrainedSchedulerBackend` and its ilk. This allows us queue messages with the relevant endpoints, release whatever locks we might be holding, and then block whilst awaiting the response. This change is made at the cost of being able to retry should sending the message fail, as retrying outside of the lock could easily cause race conditions if other conflicting messages have been sent whilst awaiting a response. I believe this to be the lesser of two evils, as in many cases these RPC calls are to process local components, and so failures are more likely to be deterministic, and timeouts are more likely to be caused by lock contention.

## How was this patch tested?
Existing tests, and manual tests under yarn-client mode.

Author: Angus Gerry <angolon@gmail.com>

Closes #14710 from angolon/SPARK-16533.

a0aac4b7

Aug 30, 2016

[SPARK-5682][CORE] Add encrypted shuffle in spark · 4b4e329e

Ferdinand Xu authored 8 years ago

This patch is using Apache Commons Crypto library to enable shuffle encryption support.

Author: Ferdinand Xu <cheng.a.xu@intel.com>
Author: kellyzly <kellyzly@126.com>

Closes #8880 from winningsix/SPARK-10771.

4b4e329e

Aug 24, 2016

[SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same... · 0b3a4be9

Sean Owen authored 8 years ago

[SPARK-16781][PYSPARK] java launched by PySpark as gateway may not be the same java used in the spark environment

## What changes were proposed in this pull request?

Update to py4j 0.10.3 to enable JAVA_HOME support

## How was this patch tested?

Pyspark tests

Author: Sean Owen <sowen@cloudera.com>

Closes #14748 from srowen/SPARK-16781.

0b3a4be9

Aug 17, 2016

[SPARK-16736][CORE][SQL] purge superfluous fs calls · cc97ea18

Steve Loughran authored 8 years ago

A review of the code, working back from Hadoop's `FileSystem.exists()` and `FileSystem.isDirectory()` code, then removing uses of the calls when superfluous.

1. delete is harmless if called on a nonexistent path, so don't do any checks before deletes
1. any `FileSystem.exists()` check before `getFileStatus()` or `open()` is superfluous as the operation itself does the check. Instead the `FileNotFoundException` is caught and triggers the downgraded path. When a `FileNotFoundException` was thrown before, the code still creates a new FNFE with the error messages. Though now the inner exceptions are nested, for easier diagnostics.

Initially, relying on Jenkins test runs.

One troublespot here is that some of the codepaths are clearly error situations; it's not clear that they have coverage anyway. Trying to create the failure conditions in tests would be ideal, but it will also be hard.

Author: Steve Loughran <stevel@apache.org>

Closes #14371 from steveloughran/cloud/SPARK-16736-superfluous-fs-calls.

cc97ea18

[SPARK-16930][YARN] Fix a couple of races in cluster app initialization. · e3fec51f

Marcelo Vanzin authored 8 years ago

There are two narrow races that could cause the ApplicationMaster to miss
when the user application instantiates the SparkContext, which could cause
app failures when nothing was wrong with the app. It was also possible for
a failing application to get stuck in the loop that waits for the context
for a long time, instead of failing quickly.

The change uses a promise to track the SparkContext instance, which gets
rid of the races and allows for some simplification of the code.

Tested with existing unit tests, and a new one being added to test the
timeout code.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #14542 from vanzin/SPARK-16930.

e3fec51f

Aug 11, 2016

[SPARK-17022][YARN] Handle potential deadlock in driver handling messages · ea0bf91b

WangTaoTheTonic authored 8 years ago

## What changes were proposed in this pull request?

We directly send RequestExecutors to AM instead of transfer it to yarnShedulerBackend first, to avoid potential deadlock.

## How was this patch tested?

manual tests

Author: WangTaoTheTonic <wangtao111@huawei.com>

Closes #14605 from WangTaoTheTonic/lock.

ea0bf91b

Aug 10, 2016

[SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN · ab648c00

jerryshao authored 8 years ago

## What changes were proposed in this pull request?

Add a configurable token manager for Spark on running on yarn.

### Current Problems ###

1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
2. Also this problem exits in timely token renewer and updater.

### Changes In This Proposal ###

In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:

1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.

### Behavior Changes ###

For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).

For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:

1. `spark.yarn.security.tokens.test.enabled` to true
2. `spark.yarn.security.tokens.test.class` to the full qualified class name.

So we still keep the same semantics as current code while add one new configuration.

### Current Status ###

- [x] token provider interface and management framework.
- [x] implement built-in token providers (hdfs, hbase, hive).
- [x] Coverage of unit test.
- [x] Integrated test with security cluster.

## How was this patch tested?

Unit test and integrated test.

Please suggest and review, any comment is greatly appreciated.

Author: jerryshao <sshao@hortonworks.com>

Closes #14065 from jerryshao/SPARK-16342.

ab648c00

Aug 08, 2016

[SPARK-16779][TRIVIAL] Avoid using postfix operators where they do not add... · 9216901d

Holden Karau authored 8 years ago

[SPARK-16779][TRIVIAL] Avoid using postfix operators where they do not add much and remove whitelisting

## What changes were proposed in this pull request?

Avoid using postfix operation for command execution in SQLQuerySuite where it wasn't whitelisted and audit existing whitelistings removing postfix operators from most places. Some notable places where postfix operation remains is in the XML parsing & time units (seconds, millis, etc.) where it arguably can improve readability.

## How was this patch tested?

Existing tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #14407 from holdenk/SPARK-16779.

9216901d

Jul 27, 2016

[SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per... · b14d7b5c

KevinGrealish authored 8 years ago

[SPARK-16110][YARN][PYSPARK] Fix allowing python version to be specified per submit for cluster mode.

## What changes were proposed in this pull request?

This fix allows submit of pyspark jobs to specify python 2 or 3.

Change ordering in setup for application master environment so env vars PYSPARK_PYTHON and PYSPARK_DRIVER_PYTHON can be overridden by spark.yarn.appMasterEnv.* conf settings. This applies to YARN in cluster mode. This allows them to be set per submission without needing the unset the env vars (which is not always possible - e.g. batch submit with LIVY only exposes the arguments to spark-submit)

## How was this patch tested?
Manual and existing unit tests.

Author: KevinGrealish <KevinGre@microsoft.com>

Closes #13824 from KevinGrealish/SPARK-16110.

b14d7b5c