Skip to content
Snippets Groups Projects
  1. Sep 21, 2016
    • Marcelo Vanzin's avatar
      [SPARK-4563][CORE] Allow driver to advertise a different network address. · 2cd1bfa4
      Marcelo Vanzin authored
      The goal of this feature is to allow the Spark driver to run in an
      isolated environment, such as a docker container, and be able to use
      the host's port forwarding mechanism to be able to accept connections
      from the outside world.
      
      The change is restricted to the driver: there is no support for achieving
      the same thing on executors (or the YARN AM for that matter). Those still
      need full access to the outside world so that, for example, connections
      can be made to an executor's block manager.
      
      The core of the change is simple: add a new configuration that tells what's
      the address the driver should bind to, which can be different than the address
      it advertises to executors (spark.driver.host). Everything else is plumbing
      the new configuration where it's needed.
      
      To use the feature, the host starting the container needs to set up the
      driver's port range to fall into a range that is being forwarded; this
      required the block manager port to need a special configuration just for
      the driver, which falls back to the existing spark.blockManager.port when
      not set. This way, users can modify the driver settings without affecting
      the executors; it would theoretically be nice to also have different
      retry counts for driver and executors, but given that docker (at least)
      allows forwarding port ranges, we can probably live without that for now.
      
      Because of the nature of the feature it's kinda hard to add unit tests;
      I just added a simple one to make sure the configuration works.
      
      This was tested with a docker image running spark-shell with the following
      command:
      
       docker blah blah blah \
         -p 38000-38100:38000-38100 \
         [image] \
         spark-shell \
           --num-executors 3 \
           --conf spark.shuffle.service.enabled=false \
           --conf spark.dynamicAllocation.enabled=false \
           --conf spark.driver.host=[host's address] \
           --conf spark.driver.port=38000 \
           --conf spark.driver.blockManager.port=38020 \
           --conf spark.ui.port=38040
      
      Running on YARN; verified the driver works, executors start up and listen
      on ephemeral ports (instead of using the driver's config), and that caching
      and shuffling (without the shuffle service) works. Clicked through the UI
      to make sure all pages (including executor thread dumps) worked. Also tested
      apps without docker, and ran unit tests.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #15120 from vanzin/SPARK-4563.
      2cd1bfa4
    • VinceShieh's avatar
      [SPARK-17219][ML] Add NaN value handling in Bucketizer · 57dc326b
      VinceShieh authored
      ## What changes were proposed in this pull request?
      This PR fixes an issue when Bucketizer is called to handle a dataset containing NaN value.
      Sometimes, null value might also be useful to users, so in these cases, Bucketizer should
      reserve one extra bucket for NaN values, instead of throwing an illegal exception.
      Before:
      ```
      Bucketizer.transform on NaN value threw an illegal exception.
      ```
      After:
      ```
      NaN values will be grouped in an extra bucket.
      ```
      ## How was this patch tested?
      New test cases added in `BucketizerSuite`.
      Signed-off-by: VinceShieh <vincent.xieintel.com>
      
      Author: VinceShieh <vincent.xie@intel.com>
      
      Closes #14858 from VinceShieh/spark-17219.
      Unverified
      57dc326b
  2. Sep 17, 2016
  3. Sep 14, 2016
  4. Sep 09, 2016
    • Satendra Kumar's avatar
      Streaming doc correction. · 7098a129
      Satendra Kumar authored
      ## What changes were proposed in this pull request?
      
      (Please fill in changes proposed in this fix)
      Streaming doc correction.
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      (If this patch involves UI changes, please attach a screenshot; otherwise, remove this)
      
      Author: Satendra Kumar <satendra@knoldus.com>
      
      Closes #14996 from satendrakumar06/patch-1.
      7098a129
  5. Sep 08, 2016
    • Gurvinder Singh's avatar
      [SPARK-15487][WEB UI] Spark Master UI to reverse proxy Application and Workers UI · 92ce8d48
      Gurvinder Singh authored
      ## What changes were proposed in this pull request?
      
      This pull request adds the functionality to enable accessing worker and application UI through master UI itself. Thus helps in accessing SparkUI when running spark cluster in closed networks e.g. Kubernetes. Cluster admin needs to expose only spark master UI and rest of the UIs can be in the private network, master UI will reverse proxy the connection request to corresponding resource. It adds the path for workers/application UIs as
      
      WorkerUI: <http/https>://master-publicIP:<port>/target/workerID/
      ApplicationUI: <http/https>://master-publicIP:<port>/target/appID/
      
      This makes it easy for users to easily protect the Spark master cluster access by putting some reverse proxy e.g. https://github.com/bitly/oauth2_proxy
      
      ## How was this patch tested?
      
      The functionality has been tested manually and there is a unit test too for testing access to worker UI with reverse proxy address.
      
      pwendell bomeng BryanCutler can you please review it, thanks.
      
      Author: Gurvinder Singh <gurvinder.singh@uninett.no>
      
      Closes #13950 from gurvindersingh/rproxy.
      92ce8d48
  6. Sep 01, 2016
  7. Aug 31, 2016
    • Jeff Zhang's avatar
      [SPARK-17178][SPARKR][SPARKSUBMIT] Allow to set sparkr shell command through --conf · fa634793
      Jeff Zhang authored
      ## What changes were proposed in this pull request?
      
      Allow user to set sparkr shell command through --conf spark.r.shell.command
      
      ## How was this patch tested?
      
      Unit test is added and also verify it manually through
      ```
      bin/sparkr --master yarn-client --conf spark.r.shell.command=/usr/local/bin/R
      ```
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #14744 from zjffdu/SPARK-17178.
      fa634793
  8. Aug 30, 2016
    • Alex Bozarth's avatar
      [SPARK-17243][WEB UI] Spark 2.0 History Server won't load with very large application history · f7beae6d
      Alex Bozarth authored
      ## What changes were proposed in this pull request?
      
      With the new History Server the summary page loads the application list via the the REST API, this makes it very slow to impossible to load with large (10K+) application history. This pr fixes this by adding the `spark.history.ui.maxApplications` conf to limit the number of applications the History Server displays. This is accomplished using a new optional `limit` param for the `applications` api. (Note this only applies to what the summary page displays, all the Application UI's are still accessible if the user knows the App ID and goes to the Application UI directly.)
      
      I've also added a new test for the `limit` param in `HistoryServerSuite.scala`
      
      ## How was this patch tested?
      
      Manual testing and dev/run-tests
      
      Author: Alex Bozarth <ajbozart@us.ibm.com>
      
      Closes #14835 from ajbozarth/spark17243.
      f7beae6d
    • Ferdinand Xu's avatar
      [SPARK-5682][CORE] Add encrypted shuffle in spark · 4b4e329e
      Ferdinand Xu authored
      This patch is using Apache Commons Crypto library to enable shuffle encryption support.
      
      Author: Ferdinand Xu <cheng.a.xu@intel.com>
      Author: kellyzly <kellyzly@126.com>
      
      Closes #8880 from winningsix/SPARK-10771.
      4b4e329e
    • Dmitriy Sokolov's avatar
      [MINOR][DOCS] Fix minor typos in python example code · d4eee993
      Dmitriy Sokolov authored
      ## What changes were proposed in this pull request?
      
      Fix minor typos python example code in streaming programming guide
      
      ## How was this patch tested?
      
      N/A
      
      Author: Dmitriy Sokolov <silentsokolov@gmail.com>
      
      Closes #14805 from silentsokolov/fix-typos.
      d4eee993
  9. Aug 29, 2016
    • Seigneurin, Alexis (CONT)'s avatar
      fixed a typo · 08913ce0
      Seigneurin, Alexis (CONT) authored
      idempotant -> idempotent
      
      Author: Seigneurin, Alexis (CONT) <Alexis.Seigneurin@capitalone.com>
      
      Closes #14833 from aseigneurin/fix-typo.
      08913ce0
  10. Aug 27, 2016
    • Sean Owen's avatar
      [SPARK-17001][ML] Enable standardScaler to standardize sparse vectors when withMean=True · e07baf14
      Sean Owen authored
      ## What changes were proposed in this pull request?
      
      Allow centering / mean scaling of sparse vectors in StandardScaler, if requested. This is for compatibility with `VectorAssembler` in common usages.
      
      ## How was this patch tested?
      
      Jenkins tests, including new caes to reflect the new behavior.
      
      Author: Sean Owen <sowen@cloudera.com>
      
      Closes #14663 from srowen/SPARK-17001.
      e07baf14
  11. Aug 26, 2016
    • Michael Gummelt's avatar
      [SPARK-16967] move mesos to module · 8e5475be
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Move Mesos code into a mvn module
      
      ## How was this patch tested?
      
      unit tests
      manually submitting a client mode and cluster mode job
      spark/mesos integration test suite
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      
      Closes #14637 from mgummelt/mesos-module.
      8e5475be
  12. Aug 25, 2016
  13. Aug 24, 2016
  14. Aug 23, 2016
  15. Aug 22, 2016
  16. Aug 21, 2016
    • wm624@hotmail.com's avatar
      [SPARK-17002][CORE] Document that spark.ssl.protocol. is required for SSL · e328f577
      wm624@hotmail.com authored
      ## What changes were proposed in this pull request?
      
      `spark.ssl.enabled`=true, but failing to set `spark.ssl.protocol` will fail and throw meaningless exception. `spark.ssl.protocol` is required when `spark.ssl.enabled`.
      
      Improvement: require `spark.ssl.protocol` when initializing SSLContext, otherwise throws an exception to indicate that.
      
      Remove the OrElse("default").
      
      Document this requirement in configure.md
      
      ## How was this patch tested?
      
      (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
      
      Manual tests:
      Build document and check document
      
      Configure `spark.ssl.enabled` only, it throws exception below:
      6/08/16 16:04:37 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users  with view permissions: Set(mwang); groups with view permissions: Set(); users  with modify permissions: Set(mwang); groups with modify permissions: Set()
      Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: spark.ssl.protocol is required when enabling SSL connections.
      	at scala.Predef$.require(Predef.scala:224)
      	at org.apache.spark.SecurityManager.<init>(SecurityManager.scala:285)
      	at org.apache.spark.deploy.master.Master$.startRpcEnvAndEndpoint(Master.scala:1026)
      	at org.apache.spark.deploy.master.Master$.main(Master.scala:1011)
      	at org.apache.spark.deploy.master.Master.main(Master.scala)
      
      Configure `spark.ssl.protocol`  and `spark.ssl.protocol`
      It works fine.
      
      Author: wm624@hotmail.com <wm624@hotmail.com>
      
      Closes #14674 from wangmiao1981/ssl.
      e328f577
  17. Aug 18, 2016
  18. Aug 16, 2016
    • sandy's avatar
      [SPARK-17089][DOCS] Remove api doc link for mapReduceTriplets operator · e28a8c58
      sandy authored
      ## What changes were proposed in this pull request?
      
      Remove the api doc link for mapReduceTriplets operator because in latest api they are remove so when user link to that api they will not get mapReduceTriplets there so its more good to remove than confuse the user.
      
      ## How was this patch tested?
      Run all the test cases
      
      ![screenshot from 2016-08-16 23-08-25](https://cloud.githubusercontent.com/assets/8075390/17709393/8cfbf75a-6406-11e6-98e6-38f7b319d833.png)
      
      Author: sandy <phalodi@gmail.com>
      
      Closes #14669 from phalodi/SPARK-17089.
      e28a8c58
    • linbojin's avatar
      [MINOR][DOC] Correct code snippet results in quick start documentation · 6f0988b1
      linbojin authored
      ## What changes were proposed in this pull request?
      
      As README.md file is updated over time. Some code snippet outputs are not correct based on new README.md file. For example:
      ```
      scala> textFile.count()
      res0: Long = 126
      ```
      should be
      ```
      scala> textFile.count()
      res0: Long = 99
      ```
      This pr is to add comments to point out this problem so that new spark learners have a correct reference.
      Also, fixed a samll bug, inside current documentation, the outputs of linesWithSpark.count() without and with cache are different (one is 15 and the other is 19)
      ```
      scala> val linesWithSpark = textFile.filter(line => line.contains("Spark"))
      linesWithSpark: org.apache.spark.rdd.RDD[String] = MapPartitionsRDD[2] at filter at <console>:27
      
      scala> textFile.filter(line => line.contains("Spark")).count() // How many lines contain "Spark"?
      res3: Long = 15
      
      ...
      
      scala> linesWithSpark.cache()
      res7: linesWithSpark.type = MapPartitionsRDD[2] at filter at <console>:27
      
      scala> linesWithSpark.count()
      res8: Long = 19
      ```
      
      ## How was this patch tested?
      
      manual test:  run `$ SKIP_API=1 jekyll serve --watch`
      
      Author: linbojin <linbojin203@gmail.com>
      
      Closes #14645 from linbojin/quick-start-documentation.
      6f0988b1
  19. Aug 13, 2016
    • Jagadeesan's avatar
      [SPARK-12370][DOCUMENTATION] Documentation should link to examples … · e46cb78b
      Jagadeesan authored
      ## What changes were proposed in this pull request?
      
      When documentation is built is should reference examples from the same build. There are times when the docs have links that point to files in the GitHub head which may not be valid on the current release. Changed that in URLs to make them point to the right tag in git using ```SPARK_VERSION_SHORT```
      
      …from its own release version] [Streaming programming guide]
      
      Author: Jagadeesan <as2@us.ibm.com>
      
      Closes #14596 from jagadeesanas2/SPARK-12370.
      e46cb78b
  20. Aug 12, 2016
  21. Aug 11, 2016
    • Jeff Zhang's avatar
      [SPARK-13081][PYSPARK][SPARK_SUBMIT] Allow set pythonExec of driver and executor through conf… · 7a9e25c3
      Jeff Zhang authored
      Before this PR, user have to export environment variable to specify the python of driver & executor which is not so convenient for users. This PR is trying to allow user to specify python through configuration "--pyspark-driver-python" & "--pyspark-executor-python"
      
      Manually test in local & yarn mode for pyspark-shell and pyspark batch mode.
      
      Author: Jeff Zhang <zjffdu@apache.org>
      
      Closes #13146 from zjffdu/SPARK-13081.
      7a9e25c3
    • hyukjinkwon's avatar
      [SPARK-16886][EXAMPLES][DOC] Fix some examples to be consistent and indentation in documentation · 7186e8c3
      hyukjinkwon authored
      ## What changes were proposed in this pull request?
      
      Originally this PR was based on #14491 but I realised that fixing examples are more sensible rather than comments.
      
      This PR fixes three things below:
      
       - Fix two wrong examples in `structured-streaming-programming-guide.md`. Loading via `read.load(..)` without `as` will be `Dataset<Row>` not `Dataset<String>` in Java.
      
      - Fix indentation across `structured-streaming-programming-guide.md`. Python has 4 spaces and Scala and Java have double spaces. These are inconsistent across the examples.
      
      - Fix `StructuredNetworkWordCountWindowed` and  `StructuredNetworkWordCount` in Java and Scala to initially load `DataFrame` and `Dataset<Row>` to be consistent with the comments and some examples in `structured-streaming-programming-guide.md` and to match Scala and Java to Python one (Python one loads it as `DataFrame` initially).
      
      ## How was this patch tested?
      
      N/A
      
      Closes https://github.com/apache/spark/pull/14491
      
      Author: hyukjinkwon <gurwls223@gmail.com>
      Author: Ganesh Chand <ganeshchand@Ganeshs-MacBook-Pro-2.local>
      
      Closes #14564 from HyukjinKwon/SPARK-16886.
      7186e8c3
    • Andrew Ash's avatar
      Correct example value for spark.ssl.YYY.XXX settings · 8a6b7037
      Andrew Ash authored
      Docs adjustment to:
      - link to other relevant section of docs
      - correct statement about the only value when actually other values are supported
      
      Author: Andrew Ash <andrew@andrewash.com>
      
      Closes #14581 from ash211/patch-10.
      8a6b7037
    • Tao Wang's avatar
      [SPARK-17010][MINOR][DOC] Wrong description in memory management document · 7a6a3c3f
      Tao Wang authored
      ## What changes were proposed in this pull request?
      
      change the remain percent to right one.
      
      ## How was this patch tested?
      
      Manual review
      
      Author: Tao Wang <wangtao111@huawei.com>
      
      Closes #14591 from WangTaoTheTonic/patch-1.
      7a6a3c3f
  22. Aug 10, 2016
    • jerryshao's avatar
      [SPARK-14743][YARN] Add a configurable credential manager for Spark running on YARN · ab648c00
      jerryshao authored
      ## What changes were proposed in this pull request?
      
      Add a configurable token manager for Spark on running on yarn.
      
      ### Current Problems ###
      
      1. Supported token provider is hard-coded, currently only hdfs, hbase and hive are supported and it is impossible for user to add new token provider without code changes.
      2. Also this problem exits in timely token renewer and updater.
      
      ### Changes In This Proposal ###
      
      In this proposal, to address the problems mentioned above and make the current code more cleaner and easier to understand, mainly has 3 changes:
      
      1. Abstract a `ServiceTokenProvider` as well as `ServiceTokenRenewable` interface for token provider. Each service wants to communicate with Spark through token way needs to implement this interface.
      2. Provide a `ConfigurableTokenManager` to manage all the register token providers, also token renewer and updater. Also this class offers the API for other modules to obtain tokens, get renewal interval and so on.
      3. Implement 3 built-in token providers `HDFSTokenProvider`, `HiveTokenProvider` and `HBaseTokenProvider` to keep the same semantics as supported today. Whether to load in these built-in token providers is controlled by configuration "spark.yarn.security.tokens.${service}.enabled", by default for all the built-in token providers are loaded.
      
      ### Behavior Changes ###
      
      For the end user there's no behavior change, we still use the same configuration `spark.yarn.security.tokens.${service}.enabled` to decide which token provider is enabled (hbase or hive).
      
      For user implemented token provider (assume the name of token provider is "test") needs to add into this class should have two configurations:
      
      1. `spark.yarn.security.tokens.test.enabled` to true
      2. `spark.yarn.security.tokens.test.class` to the full qualified class name.
      
      So we still keep the same semantics as current code while add one new configuration.
      
      ### Current Status ###
      
      - [x] token provider interface and management framework.
      - [x] implement built-in token providers (hdfs, hbase, hive).
      - [x] Coverage of unit test.
      - [x] Integrated test with security cluster.
      
      ## How was this patch tested?
      
      Unit test and integrated test.
      
      Please suggest and review, any comment is greatly appreciated.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #14065 from jerryshao/SPARK-16342.
      ab648c00
    • Timothy Chen's avatar
      [SPARK-16927][SPARK-16923] Override task properties at dispatcher. · eca58755
      Timothy Chen authored
      ## What changes were proposed in this pull request?
      
      - enable setting default properties for all jobs submitted through the dispatcher [SPARK-16927]
      - remove duplication of conf vars on cluster submitted jobs [SPARK-16923] (this is a small fix, so I'm including in the same PR)
      
      ## How was this patch tested?
      
      mesos/spark integration test suite
      manual testing
      
      Author: Timothy Chen <tnachen@gmail.com>
      
      Closes #14511 from mgummelt/override-props.
      eca58755
  23. Aug 09, 2016
    • Josh Rosen's avatar
      [SPARK-16956] Make ApplicationState.MAX_NUM_RETRY configurable · b89b3a5c
      Josh Rosen authored
      ## What changes were proposed in this pull request?
      
      This patch introduces a new configuration, `spark.deploy.maxExecutorRetries`, to let users configure an obscure behavior in the standalone master where the master will kill Spark applications which have experienced too many back-to-back executor failures. The current setting is a hardcoded constant (10); this patch replaces that with a new cluster-wide configuration.
      
      **Background:** This application-killing was added in 6b5980da (from September 2012) and I believe that it was designed to prevent a faulty application whose executors could never launch from DOS'ing the Spark cluster via an infinite series of executor launch attempts. In a subsequent patch (#1360), this feature was refined to prevent applications which have running executors from being killed by this code path.
      
      **Motivation for making this configurable:** Previously, if a Spark Standalone application experienced more than `ApplicationState.MAX_NUM_RETRY` executor failures and was left with no executors running then the Spark master would kill that application, but this behavior is problematic in environments where the Spark executors run on unstable infrastructure and can all simultaneously die. For instance, if your Spark driver runs on an on-demand EC2 instance while all workers run on ephemeral spot instances then it's possible for all executors to die at the same time while the driver stays alive. In this case, it may be desirable to keep the Spark application alive so that it can recover once new workers and executors are available. In order to accommodate this use-case, this patch modifies the Master to never kill faulty applications if `spark.deploy.maxExecutorRetries` is negative.
      
      I'd like to merge this patch into master, branch-2.0, and branch-1.6.
      
      ## How was this patch tested?
      
      I tested this manually using `spark-shell` and `local-cluster` mode. This is a tricky feature to unit test and historically this code has not changed very often, so I'd prefer to skip the additional effort of adding a testing framework and would rather rely on manual tests and review for now.
      
      Author: Josh Rosen <joshrosen@databricks.com>
      
      Closes #14544 from JoshRosen/add-setting-for-max-executor-failures.
      b89b3a5c
    • Michael Gummelt's avatar
      [SPARK-16809] enable history server links in dispatcher UI · 62e62124
      Michael Gummelt authored
      ## What changes were proposed in this pull request?
      
      Links the Spark Mesos Dispatcher UI to the history server UI
      
      - adds spark.mesos.dispatcher.historyServer.url
      - explicitly generates frameworkIDs for the launched drivers, so the dispatcher knows how to correlate drivers and frameworkIDs
      
      ## How was this patch tested?
      
      manual testing
      
      Author: Michael Gummelt <mgummelt@mesosphere.io>
      Author: Sergiusz Urbaniak <sur@mesosphere.io>
      
      Closes #14414 from mgummelt/history-server.
      62e62124
Loading