[SERVER-35724] Remote EC2 hosts which are not accessible via ssh should fail with system error Created: 21/Jun/18  Updated: 29/Oct/23  Resolved: 28/Jun/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 3.6.7, 4.0.1, 4.1.1

Type: Task Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Jonathan Abrahams
Resolution: Fixed Votes: 0
Labels: powercycle-infra
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-34996 Save console_output & console_screens... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0, v3.6
Sprint: TIG 2018-07-02
Participants:
Linked BF Score: 31

 Description   

When a remote EC2 instance is "crashed" by the powercycle test it sometimes fails to become available via ssh. The AWS status still indicates it as "running". The work in SERVER-34996 is intended to help analyze why this may occur.

The following should be done such that we can distinguish between a test failure (possible data corruption) and an environment failure:

  • powertest.py should exit with a way to indicate it failed due to ssh
  • If the exit is due to ssh, then a system failure should be triggered (which will show the task as purple)

In order to help find out why a particular EC2 instance is failing to permit ssh we should also do the following:

  • Termination of the EC2 instance should not be attempted if a system failure occurred due to ssh issue from powercycle (for non-Windows variants)
  • Increase the expire_hours to 24 (for non-Windows variants)


 Comments   
Comment by Githook User [ 13/Jul/18 ]

Author:

{'username': 'hptabster', 'name': 'Jonathan Abrahams', 'email': 'jonathan@mongodb.com'}

Message: SERVER-35724 Remote EC2 hosts which are not accessible via ssh should fail with system error
Branch: v3.6
https://github.com/mongodb/mongo/commit/999672772454d09172121559140f683675479e1f

Comment by Githook User [ 09/Jul/18 ]

Author:

{'username': 'hptabster', 'name': 'Jonathan Abrahams', 'email': 'jonathan@mongodb.com'}

Message: SERVER-35724 Remote EC2 hosts which are not accessible via ssh should fail with system error

(cherry picked from commit ae29cbee182e41c10ca7b1a44e034f9e200a5b90)
Branch: v4.0
https://github.com/mongodb/mongo/commit/fcbfc70bc7344d056cec99ce2f92370e799e2767

Comment by Githook User [ 29/Jun/18 ]

Author:

{'username': 'hptabster', 'name': 'Jonathan Abrahams', 'email': 'jonathan@mongodb.com'}

Message: SERVER-35724 Remote EC2 hosts which are not accessible via ssh should fail with system error
Branch: master
https://github.com/mongodb/mongo/commit/ae29cbee182e41c10ca7b1a44e034f9e200a5b90

Comment by Jonathan Abrahams [ 26/Jun/18 ]

We'll disable Amazon Linux 2 variant due to ssh connection issues after the machine has been internally "crashed".

Comment by Max Hirschhorn [ 22/Jun/18 ]

jonathan.abrahams, there are still failures in the BF tickets linked to SERVER-34996 that aren't from the Amazon Linux 2 builder. Until we can figure out how to make ssh after crashing the machine more reliably, I think we'll still want to turn the task purple and leave the machines around so the failures can be escalated to us and the Build team.

Comment by Jonathan Abrahams [ 22/Jun/18 ]

Given the findings that the host is not ssh accessible because it cannot boot (typically occurs on an Amazon Linux 2 instance) we do not need to not terminate the EC2 instance or increase it's expire_hours. It's not clear why we cannot boot, perhaps AWS does not support this scenario.

Generated at Thu Feb 08 04:40:44 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.