Skip to content
Snippets Groups Projects
  • Jeff Zhang's avatar
    f62ddc59
    [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio · f62ddc59
    Jeff Zhang authored
    ## What changes were proposed in this pull request?
    
    Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).
    ```
        if (args.isR && clusterManager == YARN) {
          val sparkRPackagePath = RUtils.localSparkRPackagePath
          if (sparkRPackagePath.isEmpty) {
            printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
          }
          val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
          if (!sparkRPackageFile.exists()) {
            printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
          }
          val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString
    
          // Distribute the SparkR package.
          // Assigns a symbol link name "sparkr" to the shipped package.
          args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")
    
          // Distribute the R package archive containing all the built R packages.
          if (!RUtils.rPackages.isEmpty) {
            val rPackageFile =
              RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
            if (!rPackageFile.exists()) {
              printErrorAndExit("Failed to zip all the built R packages.")
            }
    
            val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
            // Assigns a symbol link name "rpkg" to the shipped package.
            args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
          }
        }
    ```
    So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor.  Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.
    
    ## How was this patch tested?
    
    Verify it manually in R Studio using the following code.
    ```
    Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
    .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
    library(SparkR)
    sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
    df <- as.DataFrame(mtcars)
    head(df)
    
    ```
    
    …
    
    Author: Jeff Zhang <zjffdu@apache.org>
    
    Closes #14784 from zjffdu/SPARK-17210.
    f62ddc59
    History
    [SPARK-17210][SPARKR] sparkr.zip is not distributed to executors when running sparkr in RStudio
    Jeff Zhang authored
    ## What changes were proposed in this pull request?
    
    Spark will add sparkr.zip to archive only when it is yarn mode (SparkSubmit.scala).
    ```
        if (args.isR && clusterManager == YARN) {
          val sparkRPackagePath = RUtils.localSparkRPackagePath
          if (sparkRPackagePath.isEmpty) {
            printErrorAndExit("SPARK_HOME does not exist for R application in YARN mode.")
          }
          val sparkRPackageFile = new File(sparkRPackagePath.get, SPARKR_PACKAGE_ARCHIVE)
          if (!sparkRPackageFile.exists()) {
            printErrorAndExit(s"$SPARKR_PACKAGE_ARCHIVE does not exist for R application in YARN mode.")
          }
          val sparkRPackageURI = Utils.resolveURI(sparkRPackageFile.getAbsolutePath).toString
    
          // Distribute the SparkR package.
          // Assigns a symbol link name "sparkr" to the shipped package.
          args.archives = mergeFileLists(args.archives, sparkRPackageURI + "#sparkr")
    
          // Distribute the R package archive containing all the built R packages.
          if (!RUtils.rPackages.isEmpty) {
            val rPackageFile =
              RPackageUtils.zipRLibraries(new File(RUtils.rPackages.get), R_PACKAGE_ARCHIVE)
            if (!rPackageFile.exists()) {
              printErrorAndExit("Failed to zip all the built R packages.")
            }
    
            val rPackageURI = Utils.resolveURI(rPackageFile.getAbsolutePath).toString
            // Assigns a symbol link name "rpkg" to the shipped package.
            args.archives = mergeFileLists(args.archives, rPackageURI + "#rpkg")
          }
        }
    ```
    So it is necessary to pass spark.master from R process to JVM. Otherwise sparkr.zip won't be distributed to executor.  Besides that I also pass spark.yarn.keytab/spark.yarn.principal to spark side, because JVM process need them to access secured cluster.
    
    ## How was this patch tested?
    
    Verify it manually in R Studio using the following code.
    ```
    Sys.setenv(SPARK_HOME="/Users/jzhang/github/spark")
    .libPaths(c(file.path(Sys.getenv(), "R", "lib"), .libPaths()))
    library(SparkR)
    sparkR.session(master="yarn-client", sparkConfig = list(spark.executor.instances="1"))
    df <- as.DataFrame(mtcars)
    head(df)
    
    ```
    
    …
    
    Author: Jeff Zhang <zjffdu@apache.org>
    
    Closes #14784 from zjffdu/SPARK-17210.