[SERVER-46842] resmoke.py shouldn't run data consistency checks in stepdown suites if a process has crashed Created: 13/Mar/20  Updated: 29/Oct/23  Resolved: 22/Apr/20

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.4.1, 4.7.0

Type: Improvement Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Mikhail Shchatko
Resolution: Fixed Votes: 0
Labels: bkp, tig-resmoke
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.4
Sprint: STM 2020-03-23, STM 2020-04-20, STM 2020-05-04
Participants:
Story Points: 1

 Description   

resmoke.py ordinarily checks that a test didn't cause the server to crash by calling self.fixture.is_running() after the test finishes. However, due to the stepdown thread and the job thread only being synchronized by calling ContinuousStepdown.after_test(), it isn't safe to check whether the fixture is still running immediately after the test finishes.

# Don't check fixture.is_running() when using the ContinuousStepdown hook, which kills
# and restarts the primary. Even if the fixture is still running as expected, there is a
# race where fixture.is_running() could fail if called after the primary was killed but
# before it was restarted.
self._check_if_fixture_running = not any(
    isinstance(hook, stepdown.ContinuousStepdown) for hook in self.hooks)

Skipping this check causes resmoke.py to continue to run the other data consistency checks, even when a process in the MongoDB cluster has crashed. While misleading for Server engineers in terms of causing them to click on the "wrong" link in Evergreen for the task failure, it also have a severe negative impact on our automated log extraction tool by preventing it from finding relevant information. We should ensure process crashes in test suites using the ContinuousStepdown hook prevent other tests and hooks from running. I suspect having _StepdownThread.pause() check that fixture is still running as the last thing it does would accomplish this.



 Comments   
Comment by Githook User [ 04/Jun/20 ]

Author:

{'name': 'Mikhail Shchatko', 'email': 'mikhail.shchatko@mongodb.com'}

Message: SERVER-46842 resmoke.py shouldn't run data consistency checks in stepdown suites if a process has crashed

(cherry picked from commit 40801001754b6bdc15bd2f59eae523c59b6ff055)
Branch: v4.4
https://github.com/mongodb/mongo/commit/b085e6a542c8dfe69d8702f843e04b46296c8f28

Comment by Siyuan Zhou [ 04/Jun/20 ]

Awesome. Thank you!

Comment by Robert Guo (Inactive) [ 04/Jun/20 ]

siyuan.zhou Done! backport is in the commit queue

Comment by Siyuan Zhou [ 29/May/20 ]

mikhail.shchatko and robert.guo, do you have plan to backport this to 4.4? I found the test change in my pacth of SERVER-47950 depends on this. I could backport my ticket without the test if robert.guo prefers.

Comment by Githook User [ 28/Apr/20 ]

The following changes were intended for SERVER-46841:

Author:

{'name': 'Amirsaman Memaripour', 'email': 'amirsaman.memaripour@mongodb.com', 'username': 'samanca'}

Message: SERVER-46842 Make PeriodicRunner interrupt blocked operations on stop

(cherry picked from commit ef75364ada70eaf4a096ed07adfeb3175abd719b)
Branch: v4.2
https://github.com/mongodb/mongo/commit/19a8df607f40a3d31d145bef9255ed4e019d23c1

Comment by Githook User [ 22/Apr/20 ]

Author:

{'name': 'Mikhail Shchatko', 'email': 'mikhail.shchatko@mongodb.com'}

Message: SERVER-46842 resmoke.py shouldn't run data consistency checks in stepdown suites if a process has crashed
Branch: master
https://github.com/mongodb/mongo/commit/40801001754b6bdc15bd2f59eae523c59b6ff055

Comment by Ian Whalen (Inactive) [ 13/Mar/20 ]

BFG-555889 had been run through the bot analyzer already and had extracted some unuseful logs from the CheckReplDBHash failure, but missed the following from the test logs:

[ShardedClusterFixture:job0:shard1:node2] | 2020-03-12T05:55:48.235+0000 F  -        23093   [OplogApplier-0] "Fatal assertion {msgid} {status} at {file} {line}","attr":{"msgid":34361,"status":"OplogOutOfOrder: Attempted to apply an oplog entry ({ ts: Timestamp(1583992546, 16), t: 41 }) which is not greater than our last applied OpTime ({ ts: Timestamp(1583992548, 9), t: 41 }).","file":"src/mongo/db/repl/oplog_applier_impl.cpp","line":477}
[ShardedClusterFixture:job0:shard1:node2] | 2020-03-12T05:55:48.235+0000 F  -        23094   [OplogApplier-0] "\n\n***aborting after fassert() failure\n\n"

Comment by Max Hirschhorn [ 13/Mar/20 ]

Ian was the one doing the screen share so I'd want to double check with him / his browser history on the specific ones we were looking through.

Comment by David Bradford (Inactive) [ 13/Mar/20 ]

For the cases where the description was empty, do you know if the had the bot-analyzed label. Due to the volume of BFGs coming in lately the log analysis worker has been thousands of BFGs behind all this week. It didn't get caught up until today.

Generated at Thu Feb 08 05:12:35 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.