Skip to content
Snippets Groups Projects
  1. Dec 03, 2015
    • Shixiong Zhu's avatar
      [SPARK-12101][CORE] Fix thread pools that cannot cache tasks in Worker and AppClient · 649be4fa
      Shixiong Zhu authored
      `SynchronousQueue` cannot cache any task. This issue is similar to #9978. It's an easy fix. Just use the fixed `ThreadUtils.newDaemonCachedThreadPool`.
      
      Author: Shixiong Zhu <shixiong@databricks.com>
      
      Closes #10108 from zsxwing/fix-threadpool.
      649be4fa
    • Steve Loughran's avatar
      [SPARK-11314][YARN] add service API and test service for Yarn Cluster schedulers · 8fa3e474
      Steve Loughran authored
      This is purely the yarn/src/main and yarn/src/test bits of the YARN ATS integration: the extension model to load and run implementations of `SchedulerExtensionService` in the yarn cluster scheduler process —and to stop them afterwards.
      
      There's duplication between the two schedulers, yarn-client and yarn-cluster, at least in terms of setting everything up, because the common superclass, `YarnSchedulerBackend` is in spark-core, and the extension services need the YARN app/attempt IDs.
      
      If you look at how the the extension services are loaded, the case class `SchedulerExtensionServiceBinding` is used to pass in config info -currently just the spark context and the yarn IDs, of which one, the attemptID, will be null when running client-side. I'm passing in a case class to ensure that it would be possible in future to add extra arguments to the binding class, yet, as the method signature will not have changed, still be able to load existing services.
      
      There's no functional extension service here, just one for testing. The real tests come in the bigger pull requests. At the same time, there's no restriction of this extension service purely to the ATS history publisher. Anything else that wants to listen to the spark context and publish events could use this, and I'd also consider writing one for the YARN-913 registry service, so that the URLs of the web UI would be locatable through that (low priority; would make more sense if integrated with a REST client).
      
      There's no minicluster test. Given the test execution overhead of setting up minicluster tests, it'd  probably be better to add an extension service into one of the existing tests.
      
      Author: Steve Loughran <stevel@hortonworks.com>
      
      Closes #9182 from steveloughran/stevel/feature/SPARK-1537-service.
      8fa3e474
  2. Dec 01, 2015
  3. Nov 23, 2015
  4. Nov 19, 2015
  5. Nov 17, 2015
    • Holden Karau's avatar
      [SPARK-11771][YARN][TRIVIAL] maximum memory in yarn is controlled by two... · 52c734b5
      Holden Karau authored
      [SPARK-11771][YARN][TRIVIAL] maximum memory in yarn is controlled by two params have both in error msg
      
      When we exceed the max memory tell users to increase both params instead of just the one.
      
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #9758 from holdenk/SPARK-11771-maximum-memory-in-yarn-is-controlled-by-two-params-have-both-in-error-msg.
      52c734b5
  6. Nov 16, 2015
    • jerryshao's avatar
      [SPARK-11718][YARN][CORE] Fix explicitly killed executor dies silently issue · 24477d27
      jerryshao authored
      Currently if dynamic allocation is enabled, explicitly killing executor will not get response, so the executor metadata is wrong in driver side. Which will make dynamic allocation on Yarn fail to work.
      
      The problem is  `disableExecutor` returns false for pending killing executors when `onDisconnect` is detected, so no further implementation is done.
      
      One solution is to bypass these explicitly killed executors to use `super.onDisconnect` to remove executor. This is simple.
      
      Another solution is still querying the loss reason for these explicitly kill executors. Since executor may get killed and informed in the same AM-RM communication, so current way of adding pending loss reason request is not worked (container complete is already processed), here we should store this loss reason for later query.
      
      Here this PR chooses solution 2.
      
      Please help to review. vanzin I think this part is changed by you previously, would you please help to review? Thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #9684 from jerryshao/SPARK-11718.
      24477d27
  7. Nov 10, 2015
  8. Nov 06, 2015
    • Thomas Graves's avatar
      [SPARK-11555] spark on yarn spark-class --num-workers doesn't work · f6680cdc
      Thomas Graves authored
      I tested the various options with both spark-submit and spark-class of specifying number of executors in both client and cluster mode where it applied.
      
      --num-workers, --num-executors, spark.executor.instances, SPARK_EXECUTOR_INSTANCES, default nothing supplied
      
      Author: Thomas Graves <tgraves@staydecay.corp.gq1.yahoo.com>
      
      Closes #9523 from tgravescs/SPARK-11555.
      f6680cdc
  9. Nov 04, 2015
    • Marcelo Vanzin's avatar
      [SPARK-10622][CORE][YARN] Differentiate dead from "mostly dead" executors. · 8790ee6d
      Marcelo Vanzin authored
      In YARN mode, when preemption is enabled, we may leave executors in a
      zombie state while we wait to retrieve the reason for which the executor
      exited. This is so that we don't account for failed tasks that were
      running on a preempted executor.
      
      The issue is that while we wait for this information, the scheduler
      might decide to schedule tasks on the executor, which will never be
      able to run them. Other side effects include the block manager still
      considering the executor available to cache blocks, for example.
      
      So, when we know that an executor went down but we don't know why,
      stop everything related to the executor, except its running tasks.
      Only when we know the reason for the exit (or give up waiting for
      it) we do update the running tasks.
      
      This is achieved by a new `disableExecutor()` method in the
      `Schedulable` interface. For managers that do not behave like this
      (i.e. every one but YARN), the existing `executorLost()` method
      will behave the same way it did before.
      
      On top of that change, a few minor changes that made debugging easier,
      and fixed some other minor issues:
      - The cluster-mode AM was printing a misleading log message every
        time an executor disconnected from the driver (because the akka
        actor system was shared between driver and AM).
      - Avoid sending unnecessary requests for an executor's exit reason
        when we already know it was explicitly disabled / killed. This
        avoids both multiple requests, and unnecessary requests that would
        just cause warning messages on the AM (in the explicit kill case).
      - Tone down a log message about the executor being lost when it
        exited normally (e.g. preemption)
      - Wake up the AM monitor thread when requests for executor loss
        reasons arrive too, so that we can more quickly remove executors
        from this zombie state.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #8887 from vanzin/SPARK-10622.
      8790ee6d
  10. Nov 02, 2015
    • Marcelo Vanzin's avatar
      [SPARK-10997][CORE] Add "client mode" to netty rpc env. · 71d1c907
      Marcelo Vanzin authored
      "Client mode" means the RPC env will not listen for incoming connections.
      This allows certain processes in the Spark stack (such as Executors or
      tha YARN client-mode AM) to act as pure clients when using the netty-based
      RPC backend, reducing the number of sockets needed by the app and also the
      number of open ports.
      
      Client connections are also preferred when endpoints that actually have
      a listening socket are involved; so, for example, if a Worker connects
      to a Master and the Master needs to send a message to a Worker endpoint,
      that client connection will be used, even though the Worker is also
      listening for incoming connections.
      
      With this change, the workaround for SPARK-10987 isn't necessary anymore, and
      is removed. The AM connects to the driver in "client mode", and that connection
      is used for all driver <-> AM communication, and so the AM is properly notified
      when the connection goes down.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9210 from vanzin/SPARK-10997.
      71d1c907
    • jerryshao's avatar
      [SPARK-9817][YARN] Improve the locality calculation of containers by taking... · a930e624
      jerryshao authored
      [SPARK-9817][YARN] Improve the locality calculation of containers by taking pending container requests into consideraion
      
      This is a follow-up PR to further improve the locality calculation by considering the pending container's request. Since the locality preferences of tasks may be shifted from time to time, current localities of pending container requests may not fully match the new preferences, this PR improve it by removing outdated, unmatched container requests and replace with new requests.
      
      sryza please help to review, thanks a lot.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #8100 from jerryshao/SPARK-9817.
      a930e624
  11. Nov 01, 2015
  12. Oct 31, 2015
  13. Oct 27, 2015
    • Kay Ousterhout's avatar
      [SPARK-11178] Improving naming around task failures. · b960a890
      Kay Ousterhout authored
      Commit af3bc59d introduced new
      functionality so that if an executor dies for a reason that's not
      caused by one of the tasks running on the executor (e.g., due to
      pre-emption), Spark doesn't count the failure towards the maximum
      number of failures for the task.  That commit introduced some vague
      naming that this commit attempts to fix; in particular:
      
      (1) The variable "isNormalExit", which was used to refer to cases where
      the executor died for a reason unrelated to the tasks running on the
      machine, has been renamed (and reversed) to "exitCausedByApp". The problem
      with the existing name is that it's not clear (at least to me!) what it
      means for an exit to be "normal"; the new name is intended to make the
      purpose of this variable more clear.
      
      (2) The variable "shouldEventuallyFailJob" has been renamed to
      "countTowardsTaskFailures". This variable is used to determine whether
      a task's failure should be counted towards the maximum number of failures
      allowed for a task before the associated Stage is aborted. The problem
      with the existing name is that it can be confused with implying that
      the task's failure should immediately cause the stage to fail because it
      is somehow fatal (this is the case for a fetch failure, for example: if
      a task fails because of a fetch failure, there's no point in retrying,
      and the whole stage should be failed).
      
      Author: Kay Ousterhout <kayousterhout@gmail.com>
      
      Closes #9164 from kayousterhout/SPARK-11178.
      b960a890
  14. Oct 20, 2015
    • vundela's avatar
      [SPARK-11105] [YARN] Distribute log4j.properties to executors · 2f6dd634
      vundela authored
      Currently log4j.properties file is not uploaded to executor's which is leading them to use the default values. This fix will make sure that file is always uploaded to distributed cache so that executor will use the latest settings.
      
      If user specifies log configurations through --files then executors will be picking configs from --files instead of $SPARK_CONF_DIR/log4j.properties
      
      Author: vundela <vsr@cloudera.com>
      Author: Srinivasa Reddy Vundela <vsr@cloudera.com>
      
      Closes #9118 from vundela/master.
      2f6dd634
    • Holden Karau's avatar
      [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9 · e18b571c
      Holden Karau authored
      Upgrade to Py4j0.9
      
      Author: Holden Karau <holden@pigscanfly.ca>
      Author: Holden Karau <holden@us.ibm.com>
      
      Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
      e18b571c
  15. Oct 19, 2015
  16. Oct 17, 2015
  17. Oct 13, 2015
  18. Oct 12, 2015
    • jerryshao's avatar
      [SPARK-10739] [YARN] Add application attempt window for Spark on Yarn · f97e9323
      jerryshao authored
      Add application attempt window for Spark on Yarn to ignore old out of window failures, this is useful for long running applications to recover from failures.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #8857 from jerryshao/SPARK-10739 and squashes the following commits:
      
      36eabdc [jerryshao] change the doc
      7f9b77d [jerryshao] Style change
      1c9afd0 [jerryshao] Address the comments
      caca695 [jerryshao] Add application attempt window for Spark on Yarn
      f97e9323
    • Marcelo Vanzin's avatar
      [SPARK-11023] [YARN] Avoid creating URIs from local paths directly. · 149472a0
      Marcelo Vanzin authored
      The issue is that local paths on Windows, when provided with drive
      letters or backslashes, are not valid URIs.
      
      Instead of trying to figure out whether paths are URIs or not, use
      Utils.resolveURI() which does that for us.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9049 from vanzin/SPARK-11023 and squashes the following commits:
      
      77021f2 [Marcelo Vanzin] [SPARK-11023] [yarn] Avoid creating URIs from local paths directly.
      149472a0
  19. Oct 09, 2015
    • Marcelo Vanzin's avatar
      [SPARK-8673] [LAUNCHER] API and infrastructure for communicating with child apps. · 015f7ef5
      Marcelo Vanzin authored
      This change adds an API that encapsulates information about an app
      launched using the library. It also creates a socket-based communication
      layer for apps that are launched as child processes; the launching
      application listens for connections from launched apps, and once
      communication is established, the channel can be used to send updates
      to the launching app, or to send commands to the child app.
      
      The change also includes hooks for local, standalone/client and yarn
      masters.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #7052 from vanzin/SPARK-8673.
      015f7ef5
  20. Oct 08, 2015
    • Marcelo Vanzin's avatar
      [SPARK-10987] [YARN] Workaround for missing netty rpc disconnection event. · 56a9692f
      Marcelo Vanzin authored
      In YARN client mode, when the AM connects to the driver, it may be the case
      that the driver never needs to send a message back to the AM (i.e., no
      dynamic allocation or preemption). This triggers an issue in the netty rpc
      backend where no disconnection event is sent to endpoints, and the AM never
      exits after the driver shuts down.
      
      The real fix is too complicated, so this is a quick hack to unblock YARN
      client mode until we can work on the real fix. It forces the driver to
      send a message to the AM when the AM registers, thus establishing that
      connection and enabling the disconnection event when the driver goes
      away.
      
      Also, a minor side issue: when the executor is shutting down, it needs
      to send an "ack" back to the driver when using the netty rpc backend; but
      that "ack" wasn't being sent because the handler was shutting down the rpc
      env before returning. So added a change to delay the shutdown a little bit,
      allowing the ack to be sent back.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9021 from vanzin/SPARK-10987.
      56a9692f
  21. Oct 07, 2015
    • Marcelo Vanzin's avatar
      [SPARK-10300] [BUILD] [TESTS] Add support for test tags in run-tests.py. · 94fc57af
      Marcelo Vanzin authored
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #8775 from vanzin/SPARK-10300.
      94fc57af
    • Marcelo Vanzin's avatar
      [SPARK-10964] [YARN] Correctly register the AM with the driver. · 6ca27f85
      Marcelo Vanzin authored
      The `self` method returns null when called from the constructor;
      instead, registration should happen in the `onStart` method, at
      which point the `self` reference has already been initialized.
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #9005 from vanzin/SPARK-10964.
      6ca27f85
    • Marcelo Vanzin's avatar
      [SPARK-10812] [YARN] Fix shutdown of token renewer. · 4b747551
      Marcelo Vanzin authored
      A recent change to fix the referenced bug caused this exception in
      the `SparkContext.stop()` path:
      
      org.apache.spark.SparkException: YarnSparkHadoopUtil is not available in non-YARN mode!
              at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:167)
              at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.stop(YarnClientSchedulerBackend.scala:182)
              at org.apache.spark.scheduler.TaskSchedulerImpl.stop(TaskSchedulerImpl.scala:440)
              at org.apache.spark.scheduler.DAGScheduler.stop(DAGScheduler.scala:1579)
              at org.apache.spark.SparkContext$$anonfun$stop$7.apply$mcV$sp(SparkContext.scala:1730)
              at org.apache.spark.util.Utils$.tryLogNonFatalError(Utils.scala:1185)
              at org.apache.spark.SparkContext.stop(SparkContext.scala:1729)
      
      Author: Marcelo Vanzin <vanzin@cloudera.com>
      
      Closes #8996 from vanzin/SPARK-10812.
      4b747551
  22. Oct 06, 2015
  23. Oct 03, 2015
  24. Sep 29, 2015
  25. Sep 28, 2015
    • jerryshao's avatar
      [SPARK-10790] [YARN] Fix initial executor number not set issue and consolidate the codes · 353c30bd
      jerryshao authored
      This bug is introduced in [SPARK-9092](https://issues.apache.org/jira/browse/SPARK-9092), `targetExecutorNumber` should use `minExecutors` if `initialExecutors` is not set. Using 0 instead will meet the problem as mentioned in [SPARK-10790](https://issues.apache.org/jira/browse/SPARK-10790).
      
      Also consolidate and simplify some similar code snippets to keep the consistent semantics.
      
      Author: jerryshao <sshao@hortonworks.com>
      
      Closes #8910 from jerryshao/SPARK-10790.
      353c30bd
    • Holden Karau's avatar
      [SPARK-10812] [YARN] Spark hadoop util support switching to yarn · d8d50ed3
      Holden Karau authored
      While this is likely not a huge issue for real production systems, for test systems which may setup a Spark Context and tear it down and stand up a Spark Context with a different master (e.g. some local mode & some yarn mode) tests this cane be an issue. Discovered during work on spark-testing-base on Spark 1.4.1, but seems like the logic that triggers it is present in master (see SparkHadoopUtil object). A valid work around for users encountering this issue is to fork a different JVM, however this can be heavy weight.
      
      ```
      [info] SampleMiniClusterTest:
      [info] Exception encountered when attempting to run a suite with class name: com.holdenkarau.spark.testing.SampleMiniClusterTest *** ABORTED ***
      [info] java.lang.ClassCastException: org.apache.spark.deploy.SparkHadoopUtil cannot be cast to org.apache.spark.deploy.yarn.YarnSparkHadoopUtil
      [info] at org.apache.spark.deploy.yarn.YarnSparkHadoopUtil$.get(YarnSparkHadoopUtil.scala:163)
      [info] at org.apache.spark.deploy.yarn.Client.prepareLocalResources(Client.scala:257)
      [info] at org.apache.spark.deploy.yarn.Client.createContainerLaunchContext(Client.scala:561)
      [info] at org.apache.spark.deploy.yarn.Client.submitApplication(Client.scala:115)
      [info] at org.apache.spark.scheduler.cluster.YarnClientSchedulerBackend.start(YarnClientSchedulerBackend.scala:57)
      [info] at org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:141)
      [info] at org.apache.spark.SparkContext.<init>(SparkContext.scala:497)
      [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.setup(SharedMiniCluster.scala:186)
      [info] at com.holdenkarau.spark.testing.SampleMiniClusterTest.setup(SampleMiniClusterTest.scala:26)
      [info] at com.holdenkarau.spark.testing.SharedMiniCluster$class.beforeAll(SharedMiniCluster.scala:103)
      ```
      
      Author: Holden Karau <holden@pigscanfly.ca>
      
      Closes #8911 from holdenk/SPARK-10812-spark-hadoop-util-support-switching-to-yarn.
      d8d50ed3
  26. Sep 24, 2015
  27. Sep 23, 2015
  28. Sep 15, 2015
  29. Sep 14, 2015
Loading