Skip to content
Snippets Groups Projects
  • Xingbo Jiang's avatar
    ef162289
    [SPARK-20989][CORE] Fail to start multiple workers on one host if external... · ef162289
    Xingbo Jiang authored
    [SPARK-20989][CORE] Fail to start multiple workers on one host if external shuffle service is enabled in standalone mode
    
    ## What changes were proposed in this pull request?
    
    In standalone mode, if we enable external shuffle service by setting `spark.shuffle.service.enabled` to true, and then we try to start multiple workers on one host(by setting `SPARK_WORKER_INSTANCES=3` in spark-env.sh, and then run `sbin/start-slaves.sh`), we can only launch one worker on each host successfully and the rest of the workers fail to launch.
    The reason is the port of external shuffle service if configed by `spark.shuffle.service.port`, so currently we could start no more than one external shuffle service on each host. In our case, each worker tries to start a external shuffle service, and only one of them succeeded doing this.
    
    We should give explicit reason of failure instead of fail silently.
    
    ## How was this patch tested?
    Manually test by the following steps:
    1. SET `SPARK_WORKER_INSTANCES=1` in `conf/spark-env.sh`;
    2. SET `spark.shuffle.service.enabled` to `true` in `conf/spark-defaults.conf`;
    3. Run `sbin/start-all.sh`.
    
    Before the change, you will see no error in the command line, as the following:
    ```
    starting org.apache.spark.deploy.master.Master, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
    ```
    And you can see in the webUI that only one worker is running.
    
    After the change, you get explicit error messages in the command line:
    ```
    starting org.apache.spark.deploy.master.Master, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
    localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://xxx.local:7077
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls to: xxx
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls to: xxx
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls groups to:
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls groups to:
    localhost:   17/06/13 23:24:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
    localhost:   17/06/13 23:24:54 INFO Utils: Successfully started service 'sparkWorker' on port 63354.
    localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
    localhost:   	at scala.Predef$.require(Predef.scala:224)
    localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
    localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
    localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8082 spark://xxx.local:7077
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls to: xxx
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls to: xxx
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls groups to:
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls groups to:
    localhost:   17/06/13 23:24:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
    localhost:   17/06/13 23:24:56 INFO Utils: Successfully started service 'sparkWorker' on port 63359.
    localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
    localhost:   	at scala.Predef$.require(Predef.scala:224)
    localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
    localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
    localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8083 spark://xxx.local:7077
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls to: xxx
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls to: xxx
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls groups to:
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls groups to:
    localhost:   17/06/13 23:24:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
    localhost:   17/06/13 23:24:59 INFO Utils: Successfully started service 'sparkWorker' on port 63360.
    localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
    localhost:   	at scala.Predef$.require(Predef.scala:224)
    localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
    localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
    ```
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18290 from jiangxb1987/start-slave.
    ef162289
    History
    [SPARK-20989][CORE] Fail to start multiple workers on one host if external...
    Xingbo Jiang authored
    [SPARK-20989][CORE] Fail to start multiple workers on one host if external shuffle service is enabled in standalone mode
    
    ## What changes were proposed in this pull request?
    
    In standalone mode, if we enable external shuffle service by setting `spark.shuffle.service.enabled` to true, and then we try to start multiple workers on one host(by setting `SPARK_WORKER_INSTANCES=3` in spark-env.sh, and then run `sbin/start-slaves.sh`), we can only launch one worker on each host successfully and the rest of the workers fail to launch.
    The reason is the port of external shuffle service if configed by `spark.shuffle.service.port`, so currently we could start no more than one external shuffle service on each host. In our case, each worker tries to start a external shuffle service, and only one of them succeeded doing this.
    
    We should give explicit reason of failure instead of fail silently.
    
    ## How was this patch tested?
    Manually test by the following steps:
    1. SET `SPARK_WORKER_INSTANCES=1` in `conf/spark-env.sh`;
    2. SET `spark.shuffle.service.enabled` to `true` in `conf/spark-defaults.conf`;
    3. Run `sbin/start-all.sh`.
    
    Before the change, you will see no error in the command line, as the following:
    ```
    starting org.apache.spark.deploy.master.Master, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
    ```
    And you can see in the webUI that only one worker is running.
    
    After the change, you get explicit error messages in the command line:
    ```
    starting org.apache.spark.deploy.master.Master, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.master.Master-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
    localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8081 spark://xxx.local:7077
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls to: xxx
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls to: xxx
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing view acls groups to:
    localhost:   17/06/13 23:24:53 INFO SecurityManager: Changing modify acls groups to:
    localhost:   17/06/13 23:24:53 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
    localhost:   17/06/13 23:24:54 INFO Utils: Successfully started service 'sparkWorker' on port 63354.
    localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
    localhost:   	at scala.Predef$.require(Predef.scala:224)
    localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
    localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-1-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
    localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8082 spark://xxx.local:7077
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls to: xxx
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls to: xxx
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing view acls groups to:
    localhost:   17/06/13 23:24:56 INFO SecurityManager: Changing modify acls groups to:
    localhost:   17/06/13 23:24:56 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
    localhost:   17/06/13 23:24:56 INFO Utils: Successfully started service 'sparkWorker' on port 63359.
    localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
    localhost:   	at scala.Predef$.require(Predef.scala:224)
    localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
    localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-2-xxx.local.out
    localhost: starting org.apache.spark.deploy.worker.Worker, logging to /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
    localhost: failed to launch: nice -n 0 /Users/xxx/workspace/spark/bin/spark-class org.apache.spark.deploy.worker.Worker --webui-port 8083 spark://xxx.local:7077
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls to: xxx
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls to: xxx
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing view acls groups to:
    localhost:   17/06/13 23:24:59 INFO SecurityManager: Changing modify acls groups to:
    localhost:   17/06/13 23:24:59 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(xxx); groups with view permissions: Set(); users  with modify permissions: Set(xxx); groups with modify permissions: Set()
    localhost:   17/06/13 23:24:59 INFO Utils: Successfully started service 'sparkWorker' on port 63360.
    localhost:   Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Start multiple worker on one host failed because we may launch no more than one external shuffle service on each host, please set spark.shuffle.service.enabled to false or set SPARK_WORKER_INSTANCES to 1 to resolve the conflict.
    localhost:   	at scala.Predef$.require(Predef.scala:224)
    localhost:   	at org.apache.spark.deploy.worker.Worker$.main(Worker.scala:752)
    localhost:   	at org.apache.spark.deploy.worker.Worker.main(Worker.scala)
    localhost: full log in /Users/xxx/workspace/spark/logs/spark-xxx-org.apache.spark.deploy.worker.Worker-3-xxx.local.out
    ```
    
    Author: Xingbo Jiang <xingbo.jiang@databricks.com>
    
    Closes #18290 from jiangxb1987/start-slave.