Skip to content
Snippets Groups Projects
  • Yanbo Liang's avatar
    acb97157
    [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package. · acb97157
    Yanbo Liang authored
    ## What changes were proposed in this pull request?
    When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
    ```
    ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
    ```
    The following is output:
    ```
    Attaching package: ‘SparkR’
    
    The following objects are masked from ‘package:stats’:
    
        cov, filter, lag, na.omit, predict, sd, var, window
    
    The following objects are masked from ‘package:base’:
    
        as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
        rank, rbind, sample, startsWith, subset, summary, transform, union
    
    Spark not found in SPARK_HOME:
    Spark not found in the cache directory. Installation will start.
    MirrorUrl not provided.
    Looking for preferred site from apache website...
    ......
    ```
    There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.
    
    ## How was this patch tested?
    Offline test.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #15888 from yanboliang/spark-18444.
    acb97157
    History
    [SPARK-18444][SPARKR] SparkR running in yarn-cluster mode should not download Spark package.
    Yanbo Liang authored
    ## What changes were proposed in this pull request?
    When running SparkR job in yarn-cluster mode, it will download Spark package from apache website which is not necessary.
    ```
    ./bin/spark-submit --master yarn-cluster ./examples/src/main/r/dataframe.R
    ```
    The following is output:
    ```
    Attaching package: ‘SparkR’
    
    The following objects are masked from ‘package:stats’:
    
        cov, filter, lag, na.omit, predict, sd, var, window
    
    The following objects are masked from ‘package:base’:
    
        as.data.frame, colnames, colnames<-, drop, endsWith, intersect,
        rank, rbind, sample, startsWith, subset, summary, transform, union
    
    Spark not found in SPARK_HOME:
    Spark not found in the cache directory. Installation will start.
    MirrorUrl not provided.
    Looking for preferred site from apache website...
    ......
    ```
    There's no ```SPARK_HOME``` in yarn-cluster mode since the R process is in a remote host of the yarn cluster rather than in the client host. The JVM comes up first and the R process then connects to it. So in such cases we should never have to download Spark as Spark is already running.
    
    ## How was this patch tested?
    Offline test.
    
    Author: Yanbo Liang <ybliang8@gmail.com>
    
    Closes #15888 from yanboliang/spark-18444.