Skip to content
Snippets Groups Projects
  1. Jun 15, 2017
    • Michael Gummelt's avatar
      [SPARK-20434][YARN][CORE] Move Hadoop delegation token code from yarn to core · a18d6371
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Move Hadoop delegation token code from `spark-yarn` to `spark-core`, so that other schedulers (such as Mesos), may use it.  In order to avoid exposing Hadoop interfaces in spark-core, the new Hadoop delegation token classes are kept private.  In order to provider backward compatiblity, and to allow YARN users to continue to load their own delegation token providers via Java service loading, the old YARN interfaces, as well as the client code that uses them, have been retained.
      
      Summary:
      - Move registered `yarn.security.ServiceCredentialProvider` classes from `spark-yarn` to `spark-core`.  Moved them into a new, private hierarchy under `HadoopDelegationTokenProvider`.  Client code in `HadoopDelegationTokenManager` now loads credentials from a whitelist of three providers (`HadoopFSDelegationTokenProvider`, `HiveDelegationTokenProvider`, `HBaseDelegationTokenProvider`), instead of service loading, which means that users are not able to implement their own delegation token providers, as they are in the `spark-yarn` module.
      
      - The `yarn.security.ServiceCredentialProvider` interface has been kept for backwards compatibility, and to continue to allow YARN users to implement their own delegation token provider implementations.  Client code in YARN now fetches tokens via the new `YARNHadoopDelegationTokenManager` class, which fetches tokens from the core providers through `HadoopDelegationTokenManager`, as well as service loads them from `yarn.security.ServiceCredentialProvider`.
      
      Old Hierarchy:
      
      ```
      yarn.security.ServiceCredentialProvider (service loaded)
        HadoopFSCredentialProvider
        HiveCredentialProvider
        HBaseCredentialProvider
      yarn.security.ConfigurableCredentialManager
      ```
      
      New Hierarchy:
      
      ```
      HadoopDelegationTokenManager
      HadoopDelegationTokenProvider (not service loaded)
        HadoopFSDelegationTokenProvider
        HiveDelegationTokenProvider
        HBaseDelegationTokenProvider
      
      yarn.security.ServiceCredentialProvider (service loaded)
      yarn.security.YARNHadoopDelegationTokenManager
      ```
      ## How was this patch tested?
      
      unit tests
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      Author: Dr. Stefan Schimanski <sttts@mesosphere.io>
      
      Closes #17723 from mgummelt/SPARK-20434-refactor-kerberos.
      a18d6371
  2. May 08, 2017
    • jerryshao's avatar
      [SPARK-20605][CORE][YARN][MESOS] Deprecate not used AM and executor port configuration · 829cd7b8
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      After SPARK-10997, client mode Netty RpcEnv doesn't require to start server, so port configurations are not used any more, here propose to remove these two configurations: "spark.executor.port" and "spark.am.port".
      
      ## How was this patch tested?
      
      Existing UTs.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #17866 from jerryshao/SPARK-20605.
      829cd7b8
  3. Feb 22, 2017
    • Marcelo Vanzin's avatar
      [SPARK-19554][UI,YARN] Allow SHS URL to be used for tracking in YARN RM. · 4661d30b
      Marcelo Vanzin authored
      Allow an application to use the History Server URL as the tracking
      URL in the YARN RM, so there's still a link to the web UI somewhere
      in YARN even if the driver's UI is disabled. This is useful, for
      example, if an admin wants to disable the driver UI by default for
      applications, since it's harder to secure it (since it involves non
      trivial ssl certificate and auth management that admins may not want
      to expose to user apps).
      
      This needs to be opt-in, because of the way the YARN proxy works, so
      a new configuration was added to enable the option.
      
      The YARN RM will proxy requests to live AMs instead of redirecting
      the client, so pages in the SHS UI will not render correctly since
      they'll reference invalid paths in the RM UI. The proxy base support
      in the SHS cannot be used since that would prevent direct access to
      the SHS.
      
      So, to solve this problem, for the feature to work end-to-end, a new
      YARN-specific filter was added that detects whether the requests come
      from the proxy and redirects the client appropriatly. The SHS admin has
      to add this filter manually if they want the feature to work.
      
      Tested with new unit test, and by running with the documented configuration
      set in a test cluster. Also verified the driver UI is used when it's
      enabled.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #16946 from vanzin/SPARK-19554.
      4661d30b
  4. Feb 10, 2017
  5. Feb 08, 2017
    • Sean Owen's avatar
      [SPARK-19464][CORE][YARN][TEST-HADOOP2.6] Remove support for Hadoop 2.5 and earlier · e8d3fca4
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      - Remove support for Hadoop 2.5 and earlier
      - Remove reflection and code constructs only needed to support multiple versions at once
      - Update docs to reflect newer versions
      - Remove older versions' builds and profiles.
      
      ## How was this patch tested?
      
      Existing tests
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #16810 from srowen/SPARK-19464.
      e8d3fca4
  6. Jan 17, 2017
    • jerryshao's avatar
      [SPARK-19179][YARN] Change spark.yarn.access.namenodes config and update docs · b79cc7ce
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      `spark.yarn.access.namenodes` configuration cannot actually reflects the usage of it, inside the code it is the Hadoop filesystems we get tokens, not NNs. So here propose to update the name of this configuration, also change the related code and doc.
      
      ## How was this patch tested?
      
      Local verification.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16560 from jerryshao/SPARK-19179.
      b79cc7ce
  7. Jan 11, 2017
    • jerryshao's avatar
      [SPARK-19021][YARN] Generailize HDFSCredentialProvider to support non HDFS security filesystems · 4239a108
      jerryshao authored
      Currently Spark can only get token renewal interval from security HDFS (hdfs://), if Spark runs with other security file systems like webHDFS (webhdfs://), wasb (wasb://), ADLS, it will ignore these tokens and not get token renewal intervals from these tokens. These will make Spark unable to work with these security clusters. So instead of only checking HDFS token, we should generalize to support different DelegationTokenIdentifier.
      
      ## How was this patch tested?
      
      Manually verified in security cluster.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #16432 from jerryshao/SPARK-19021.
      4239a108
  8. Jan 02, 2017
    • Liang-Chi Hsieh's avatar
      [MINOR][DOC] Minor doc change for YARN credential providers · 0ac2f1e7
      Liang-Chi Hsieh authored
      ## What changes were proposed in this pull request?
      
      The configuration `spark.yarn.security.tokens.{service}.enabled` is deprecated. Now we should use `spark.yarn.security.credentials.{service}.enabled`. Some places in the doc is not updated yet.
      
      ## How was this patch tested?
      
      N/A. Just doc change.
      
      Please review http://spark.apache.org/contributing.html before opening a pull request.
      
      Author: Liang-Chi Hsieh <viirya@gmail.com>
      
      Closes #16444 from viirya/minor-credential-provider-doc.
      0ac2f1e7
  9. Dec 05, 2016
    • Nicholas Chammas's avatar
      [DOCS][MINOR] Update location of Spark YARN shuffle jar · 5a92dc76
      Nicholas Chammas authored
      Looking at the distributions provided on spark.apache.org, I see that the Spark YARN shuffle jar is under `yarn/` and not `lib/`.
      
      This change is so minor I'm not sure it needs a JIRA. But let me know if so and I'll create one.
      
      Author: Nicholas Chammas <nicholas.chammas@gmail.com>
      
      Closes #16130 from nchammas/yarn-doc-fix.
      5a92dc76
  10. Nov 17, 2016
    • Weiqing Yang's avatar
      [YARN][DOC] Remove non-Yarn specific configurations from running-on-yarn.md · a3cac7bd
      Weiqing Yang authored
      ## What changes were proposed in this pull request?
      
      Remove `spark.driver.memory`, `spark.executor.memory`,  `spark.driver.cores`, and `spark.executor.cores` from `running-on-yarn.md` as they are not Yarn-specific, and they are also defined in`configuration.md`.
      
      ## How was this patch tested?
      Build passed & Manually check.
      
      Author: Weiqing Yang <yangweiqing001@gmail.com>
      
      Closes #15869 from weiqingy/yarnDoc.
      a3cac7bd
  11. Nov 16, 2016
  12. Aug 10, 2016
    • jerryshao's avatar
      [SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN · ab648c00
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Add a configurable token manager for Spark on running on yarn.
      
      ### Current Problems ###
      
      1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
      2. Also this problem exits in timely token renewer and updater.
      
      ### Changes In This Proposal ###
      
      In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:
      
      1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
      2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
      3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.
      
      ### Behavior Changes ###
      
      For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).
      
      For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:
      
      1. `spark.yarn.security.tokens.test.enabled` to true
      2. `spark.yarn.security.tokens.test.class` to the full qualified class name.
      
      So we still keep the same semantics as current code while add one new configuration.
      
      ### Current Status ###
      
      - [x] token provider interface and management framework.
      - [x] implement built-in token providers (hdfs, hbase, hive).
      - [x] Coverage of unit test.
      - [x] Integrated test with security cluster.
      
      ## How was this patch tested?
      
      Unit test and integrated test.
      
      Please suggest and review, any comment is greatly appreciated.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #14065 from jerryshao/SPARK-16342.
      ab648c00
  13. Jul 14, 2016
  14. Jun 29, 2016
    • jerryshao's avatar
      [SPARK-15990][YARN] Add rolling log aggregation support for Spark on yarn · 272a2f78
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Yarn supports rolling log aggregation since 2.6, previously log will only be aggregated to HDFS after application is finished, it is quite painful for long running applications like Spark Streaming, thriftserver. Also out of disk problem will be occurred when log file is too large. So here propose to add support of rolling log aggregation for Spark on yarn.
      
      One limitation for this is that log4j should be set to change to file appender, now in Spark itself uses console appender by default, in which file will not be created again once removed after aggregation. But I think lots of production users should have changed their log4j configuration instead of default on, so this is not a big problem.
      
      ## How was this patch tested?
      
      Manually verified with Hadoop 2.7.1.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #13712 from jerryshao/SPARK-15990.
      272a2f78
  15. Jun 23, 2016
    • Ryan Blue's avatar
      [SPARK-13723][YARN] Change behavior of --num-executors with dynamic allocation. · 738f134b
      Ryan Blue authored
      ## What changes were proposed in this pull request?
      
      This changes the behavior of --num-executors and spark.executor.instances when using dynamic allocation. Instead of turning dynamic allocation off, it uses the value for the initial number of executors.
      
      This changes was discussed on [SPARK-13723](https://issues.apache.org/jira/browse/SPARK-13723). I highly recommend using it while we can change the behavior for 2.0.0. In practice, the 1.x behavior causes unexpected behavior for users (it is not clear that it disables dynamic allocation) and wastes cluster resources because users rarely notice the log message.
      
      ## How was this patch tested?
      
      This patch updates tests and adds a test for Utils.getDynamicAllocationInitialExecutors.
      
      Author: Ryan Blue <blue@apache.org>
      
      Closes #13338 from rdblue/SPARK-13723-num-executors-with-dynamic-allocation.
      738f134b
  16. May 27, 2016
    • jerryshao's avatar
      [YARN][DOC][MINOR] Remove several obsolete env variables and update the doc · 1b98fa2e
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Remove several obsolete env variables not supported for Spark on YARN now, also updates the docs to include several changes with 2.0.
      
      ## How was this patch tested?
      
      N/A
      
      CC vanzin tgravescs
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #13296 from jerryshao/yarn-doc.
      1b98fa2e
  17. May 26, 2016
  18. Apr 28, 2016
  19. Apr 14, 2016
  20. Apr 05, 2016
    • Devaraj K's avatar
      [SPARK-13063][YARN] Make the SPARK YARN STAGING DIR as configurable · bc36df12
      Devaraj K authored
      ## What changes were proposed in this pull request?
      Made the SPARK YARN STAGING DIR as configurable with the configuration as 'spark.yarn.staging-dir'.
      
      ## How was this patch tested?
      
      I have verified it manually by running applications on yarn, If the 'spark.yarn.staging-dir' is configured then the value used as staging directory otherwise uses the default value i.e. file system’s home directory for the user.
      
      Author: Devaraj K <devaraj@apache.org>
      
      Closes #12082 from devaraj-kavali/SPARK-13063.
      bc36df12
  21. Apr 01, 2016
    • jerryshao's avatar
      [SPARK-12343][YARN] Simplify Yarn client and client argument · 8ba2b7f2
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Currently in Spark on YARN, configurations can be passed through SparkConf, env and command arguments, some parts are duplicated, like client argument and SparkConf. So here propose to simplify the command arguments.
      
      ## How was this patch tested?
      
      This patch is tested manually with unit test.
      
      CC vanzin tgravescs , please help to suggest this proposal. The original purpose of this JIRA is to remove `ClientArguments`, through refactoring some arguments like `--class`, `--arg` are not so easy to replace, so here I remove the most part of command line arguments, only keep the minimal set.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #11603 from jerryshao/SPARK-12343.
      8ba2b7f2
  22. Mar 18, 2016
    • Dongjoon Hyun's avatar
      [MINOR][DOCS] Update build descriptions and commands · c11ea2e4
      Dongjoon Hyun authored
      ## What changes were proposed in this pull request?
      
      This PR updates Scala and Hadoop versions in the build description and commands in `Building Spark` documents.
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dongjoon Hyun <dongjoon@apache.org>
      
      Closes #11838 from dongjoon-hyun/fix_doc_building_spark.
      c11ea2e4
  23. Mar 11, 2016
    • Marcelo Vanzin's avatar
      [SPARK-13577][YARN] Allow Spark jar to be multiple jars, archive. · 07f1c544
      Marcelo Vanzin authored
      In preparation for the demise of assemblies, this change allows the
      YARN backend to use multiple jars and globs as the "Spark jar". The
      config option has been renamed to "spark.yarn.jars" to reflect that.
      
      A second option "spark.yarn.archive" was also added; if set, this
      takes precedence and uploads an archive expected to contain the jar
      files with the Spark code and its dependencies.
      
      Existing deployments should keep working, mostly. This change drops
      support for the "SPARK_JAR" environment variable, and also does not
      fall back to using "jarOfClass" if no configuration is set, falling
      back to finding files under SPARK_HOME instead. This should be fine
      since "jarOfClass" probably wouldn't work unless you were using
      spark-submit anyway.
      
      Tested with the unit tests, and trying the different config options
      on a YARN cluster.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #11500 from vanzin/SPARK-13577.
      07f1c544
  24. Jan 21, 2016
  25. Jan 15, 2016
  26. Dec 01, 2015
  27. Nov 23, 2015
  28. Oct 20, 2015
    • vundela's avatar
      [SPARK-11105] [YARN] Distribute log4j.properties to executors · 2f6dd634
      vundela authored
      Currently log4j.properties file is not uploaded to executor's which is leading them to use the default values. This fix will make sure that file is always uploaded to distributed cache so that executor will use the latest settings.
      
      If user specifies log configurations through --files then executors will be picking configs from --files instead of $SPARK_CONF_DIR/log4j.properties
      
      Author: vundela <vsr@cloudera.com>
      Author: Srinivasa Reddy Vundela <vsr@cloudera.com>
      
      Closes #9118 from vundela/master.
      2f6dd634
  29. Oct 12, 2015
    • jerryshao's avatar
      [SPARK-10739] [YARN] Add application attempt window for Spark on Yarn · f97e9323
      jerryshao authored
      Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #8857 from jerryshao/SPARK-10739 and squashes the following commits:
      
      36eabdc [jerryshao] change the doc
      7f9b77d [jerryshao] Style change
      1c9afd0 [jerryshao] Address the comments
      caca695 [jerryshao] Add application attempt window for Spark on Yarn
      f97e9323
  30. Oct 04, 2015
  31. Sep 21, 2015
  32. Sep 17, 2015
    • yangping.wu's avatar
      [SPARK-10660] Doc describe error in the "Running Spark on YARN" page · c88bb5df
      yangping.wu authored
      In the Configuration section, the **spark.yarn.driver.memoryOverhead** and **spark.yarn.am.memoryOverhead**‘s default value should be "driverMemory * 0.10, with minimum of 384" and "AM memory * 0.10, with minimum of 384" respectively. Because from Spark 1.4.0, the **MEMORY_OVERHEAD_FACTOR** is set to 0.1.0, not 0.07.
      
      Author: yangping.wu <wyphao.2007@163.com>
      
      Closes #8797 from 397090770/SparkOnYarnDocError.
      c88bb5df
  33. Sep 15, 2015
  34. Aug 19, 2015
  35. Aug 18, 2015
  36. Aug 12, 2015
  37. Jul 27, 2015
    • Carson Wang's avatar
      [SPARK-8405] [DOC] Add how to view logs on Web UI when yarn log aggregation is enabled · 62283816
      Carson Wang authored
      Some users may not be aware that the logs are available on Web UI even if Yarn log aggregation is enabled. Update the doc to make this clear and what need to be configured.
      
      Author: Carson Wang <carson.wang@intel.com>
      
      Closes #7463 from carsonwang/YarnLogDoc and squashes the following commits:
      
      274c054 [Carson Wang] Minor text fix
      74df3a1 [Carson Wang] address comments
      5a95046 [Carson Wang] Update the text in the doc
      e5775c1 [Carson Wang] Update doc about how to view the logs on Web UI when yarn log aggregation is enabled
      62283816
  38. Jun 27, 2015
    • Neelesh Srinivas Salian's avatar
      [SPARK-3629] [YARN] [DOCS]: Improvement of the "Running Spark on YARN" document · d48e7893
      Neelesh Srinivas Salian authored
      As per the description in the JIRA, I moved the contents of the page and added a few additional content.
      
      Author: Neelesh Srinivas Salian <nsalian@cloudera.com>
      
      Closes #6924 from nssalian/SPARK-3629 and squashes the following commits:
      
      944b7a0 [Neelesh Srinivas Salian] Changed the lines about deploy-mode and added backticks to all parameters
      40dbc0b [Neelesh Srinivas Salian] Changed dfs to HDFS, deploy-mode in backticks and updated the master yarn line
      9cbc072 [Neelesh Srinivas Salian] Updated a few lines in the Launching Spark on YARN Section
      8e8db7f [Neelesh Srinivas Salian] Removed the changes in this commit to help clearly distinguish movement from update
      151c298 [Neelesh Srinivas Salian] SPARK-3629: Improvement of the Spark on YARN document
      d48e7893
  39. Jun 26, 2015
    • Marcelo Vanzin's avatar
      [SPARK-8302] Support heterogeneous cluster install paths on YARN. · 37bf76a2
      Marcelo Vanzin authored
      Some users have Hadoop installations on different paths across
      their cluster. Currently, that makes it hard to set up some
      configuration in Spark since that requires hardcoding paths to
      jar files or native libraries, which wouldn't work on such a cluster.
      
      This change introduces a couple of YARN-specific configurations
      that instruct the backend to replace certain paths when launching
      remote processes. That way, if the configuration says the Spark
      jar is in "/spark/spark.jar", and also says that "/spark" should be
      replaced with "{{SPARK_INSTALL_DIR}}", YARN will start containers
      in the NMs with "{{SPARK_INSTALL_DIR}}/spark.jar" as the location
      of the jar.
      
      Coupled with YARN's environment whitelist (which allows certain
      env variables to be exposed to containers), this allows users to
      support such heterogeneous environments, as long as a single
      replacement is enough. (Otherwise, this feature would need to be
      extended to support multiple path replacements.)
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #6752 from vanzin/SPARK-8302 and squashes the following commits:
      
      4bff8d4 [Marcelo Vanzin] Add docs, rename configs.
      0aa2a02 [Marcelo Vanzin] Only do replacement for paths that need it.
      2e9cc9d [Marcelo Vanzin] Style.
      a5e1f68 [Marcelo Vanzin] [SPARK-8302] Support heterogeneous cluster install paths on YARN.
      37bf76a2
  40. May 29, 2015
    • WangTaoTheTonic's avatar
      [SPARK-7524] [SPARK-7846] add configs for keytab and principal, pass these two... · a51b133d
      WangTaoTheTonic authored
      [SPARK-7524] [SPARK-7846] add configs for keytab and principal, pass these two configs with different way in different modes
      
      * As spark now supports long running service by updating tokens for namenode, but only accept parameters passed with "--k=v" format which is not very convinient. This patch add spark.* configs in properties file and system property.
      
      *  --principal and --keytabl options are passed to client but when we started thrift server or spark-shell these two are also passed into the Main class (org.apache.spark.sql.hive.thriftserver.HiveThriftServer2 and org.apache.spark.repl.Main).
      In these two main class, arguments passed in will be processed with some 3rd libraries, which will lead to some error: "Invalid option: --principal" or "Unrecgnised option: --principal".
      We should pass these command args in different forms, say system properties.
      
      Author: WangTaoTheTonic <wangtao111@huawei.com>
      
      Closes #6051 from WangTaoTheTonic/SPARK-7524 and squashes the following commits:
      
      e65699a [WangTaoTheTonic] change logic to loadEnvironments
      ebd9ea0 [WangTaoTheTonic] merge master
      ecfe43a [WangTaoTheTonic] pass keytab and principal seperately in different mode
      33a7f40 [WangTaoTheTonic] expand the use of the current configs
      08bb4e8 [WangTaoTheTonic] fix wrong cite
      73afa64 [WangTaoTheTonic] add configs for keytab and principal, move originals to internal
      a51b133d
Loading