[SERVER-36162] Powercycle - ensure internal crash command has been executed on the remote host Created: 17/Jul/18  Updated: 29/Oct/23  Resolved: 13/Sep/18

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 3.6.9, 4.0.3, 4.1.3

Type: Bug Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Jonathan Abrahams
Resolution: Fixed Votes: 0
Labels: tig-powercycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0, v3.6
Sprint: TIG 2018-09-24
Participants:
Linked BF Score: 0
Story Points: 5

 Description   

It's possible that due to an ssh connection error, the remote command to internally crash a server will never run. The powertest.py script expects that the crash command will fail, as the ssh connection will be terminated. However, it should examine the output of the crash command to determine it it was actually run on the remote host.

Here's a case where the remote command failed to execute:

[2018/07/15 16:11:38.976] 2018-07-15 20:10:47,078 INFO Crashing server in 46 seconds
[2018/07/15 16:11:38.976] 2018-07-15 20:11:37,188 INFO Inserting canary document {'x': 1531685447.025} to DB power Collection cycle
[2018/07/15 16:11:38.976] ssh -o ServerAliveCountMax=10 -o ServerAliveInterval=6 -o StrictHostKeyChecking=no -o ConnectTimeout=10 -o ConnectionAttempts=20 -i /cygdrive/c/data/mci/3ab7f95ff9a32d5ea1ad8ffe3e1a09fd/powercycle.pem -o GSSAPIAuthentication=no -o CheckHostIP=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 -o ConnectionAttempts=20  10.122.5.210 /bin/bash -c "$'source venv_powercycle/Scripts/activate; python -u powertest.py --remoteOperation  --sshUserHost 10.122.5.210 --sshConnection \'-i /cygdrive/c/data/mci/3ab7f95ff9a32d5ea1ad8ffe3e1a09fd/powercycle.pem -o GSSAPIAuthentication=no -o CheckHostIP=no -o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 -o ConnectionAttempts=20\' --rsync  --rsyncExcludeFiles diagnostic.data/metrics.interim* --backupPathBefore /log/powercycle/beforerecovery --backupPathAfter /log/powercycle/afterrecovery --validate local --canary local --docForCanary None --seedDocNum 10000 --crashOption \'notmyfault/notmyfaultc64.exe -accepteula crash 1\' --instanceId i-093c2bc45b5317756 --crashWaitTime 45 --jitterForCrashWaitTime 5 --numCrudClients 10 --numFsmClients 10 --rootDir /log/powercycle-mongodb_mongo_v3.6_windows_64_2k8_ssl_powercycle_syncdelay_WT_f1bcba35cefd0c5c0402e32575327a77507ac03e_18_07_14_22_41_33 --mongodbBinDir /log/powercycle --dbPath /data/db --logPath /log/powercycle/mongod.log --mongodUsablePorts 20000 20001 --mongodOptions \'--setParameter enableTestCommands=1 --syncdelay 10 --storageEngine wiredTiger\' --remotePython \'source venv_powercycle/Scripts/activate; python -u\'   crash_server'"
[2018/07/15 16:12:29.518] 2018-07-15 20:12:16,477 INFO Connection timed out during banner exchange



 Comments   
Comment by Githook User [ 21/Sep/18 ]

Author:

{'name': 'Jonathan Abrahams', 'email': 'jonathan@Jonathans-MacBook-Pro.local'}

Message: SERVER-36162 Powercycle - ensure internal crash command has been executed on the remote host

(cherry picked from commit f4d62c2ba9a27dc03663779d0817bc399ab2e91f)
Branch: v3.6
https://github.com/mongodb/mongo/commit/ffdec90b38eadb58f3880a72c6db0c2d6d5c3d6b

Comment by Githook User [ 20/Sep/18 ]

Author:

{'name': 'Jonathan Abrahams', 'email': 'jonathan@Jonathans-MacBook-Pro.local'}

Message: SERVER-36162 Powercycle - ensure internal crash command has been executed on the remote host

(cherry picked from commit f4d62c2ba9a27dc03663779d0817bc399ab2e91f)
Branch: v4.0
https://github.com/mongodb/mongo/commit/c5c89977a194e4b5ee7c708c64fdd9d6a5a736c2

Comment by Githook User [ 13/Sep/18 ]

Author:

{'name': 'Jonathan Abrahams', 'email': 'jonathan@Jonathans-MacBook-Pro.local'}

Message: SERVER-36162 Powercycle - ensure internal crash command has been executed on the remote host
Branch: master
https://github.com/mongodb/mongo/commit/f4d62c2ba9a27dc03663779d0817bc399ab2e91f

Comment by Max Hirschhorn [ 31/Jul/18 ]

remote_operations.py and thus powercycle have no way to distinguish whether the errors that come back from a remote command are from SSH itself or from the commands being run through SSH. In order to more tightly handle whether or not we want to retry on SSH errors, we need to build logic to be able to detect what the source of the error is.

Comment by Jonathan Abrahams [ 23/Jul/18 ]

The output from the ssh is returned, as "Connection timed out during banner exchange", so we can retry on this. We need to ensure before running the next loop in the powertest.py (for the server crash scenarios) that the server has been restarted (by examining the uptime).

Comment by Max Hirschhorn [ 22/Jul/18 ]

However, it should examine the output of the crash command to determine it it was actually run on the remote host.

jonathan.abrahams, isn't it possible that the client won't even observe the output of the crash command because it has been disconnected from the remote host as part of running the crash command? It isn't clear to me the kind of change you are proposing to make to powertest.py.

Separately, should we add "Connection timed out during banner exchange" to this list of ssh errors that remote_operations.py knows to retry on?

Generated at Thu Feb 08 04:42:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.