Skip to content
Snippets Groups Projects
Commit 7f7b50ed authored by Aaron Davidson's avatar Aaron Davidson Committed by Andrew Or
Browse files

[SPARK-3923] Increase Akka heartbeat pause above heartbeat interval

Something about the 2.3.4 upgrade seems to have made the issue manifest where all the services disconnect from each other after exactly 1000 seconds (which is the heartbeat interval). [This post](https://groups.google.com/forum/#!topic/akka-user/X3xzpTCbEFs) suggests that heartbeat pause should be greater than heartbeat interval, and increasing the pause from 600s to 6000s seems to have rectified the issue. My current cluster has now exceeded 1400s of uptime without failure!

I do not know why this fixed it, because the threshold we have set for the failure detector is the exponent of a timeout, and 300 is extremely large. Perhaps the default failure detector changed in 2.3.4 and now ignores threshold.

Author: Aaron Davidson <aaron@databricks.com>

Closes #2784 from aarondav/fix-timeout and squashes the following commits:

bd1151a [Aaron Davidson] Increase pause, don't decrease interval
9cb0372 [Aaron Davidson] [SPARK-3923] Decrease Akka heartbeat interval below heartbeat pause
parent 2fe0ba95
No related branches found
No related tags found
No related merge requests found
...@@ -77,7 +77,7 @@ private[spark] object AkkaUtils extends Logging { ...@@ -77,7 +77,7 @@ private[spark] object AkkaUtils extends Logging {
val logAkkaConfig = if (conf.getBoolean("spark.akka.logAkkaConfig", false)) "on" else "off" val logAkkaConfig = if (conf.getBoolean("spark.akka.logAkkaConfig", false)) "on" else "off"
val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", 600) val akkaHeartBeatPauses = conf.getInt("spark.akka.heartbeat.pauses", 6000)
val akkaFailureDetector = val akkaFailureDetector =
conf.getDouble("spark.akka.failure-detector.threshold", 300.0) conf.getDouble("spark.akka.failure-detector.threshold", 300.0)
val akkaHeartBeatInterval = conf.getInt("spark.akka.heartbeat.interval", 1000) val akkaHeartBeatInterval = conf.getInt("spark.akka.heartbeat.interval", 1000)
......
...@@ -725,7 +725,7 @@ Apart from these, the following properties are also available, and may be useful ...@@ -725,7 +725,7 @@ Apart from these, the following properties are also available, and may be useful
</tr> </tr>
<tr> <tr>
<td><code>spark.akka.heartbeat.pauses</code></td> <td><code>spark.akka.heartbeat.pauses</code></td>
<td>600</td> <td>6000</td>
<td> <td>
This is set to a larger value to disable failure detector that comes inbuilt akka. It can be This is set to a larger value to disable failure detector that comes inbuilt akka. It can be
enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause enabled again, if you plan to use this feature (Not recommended). Acceptable heart beat pause
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment