[SERVER-43226] Primary takeover during ReplSetTest.stop can hang due to fsyncLock Created: 09/Sep/19  Updated: 29/Oct/23  Resolved: 27/Sep/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Bug Priority: Major - P3
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Problem/Incident
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-09-09, Repl 2019-09-23, Repl 2019-10-07
Participants:
Linked BF Score: 47

 Description   

A rare race, observed in catchup_takeover_two_nodes_ahead.js:

  • The test reaches the end and calls ReplSetTest.stop
  • ReplSetTest calls getPrimary() and determines that Node 0 is the primary
  • For unexpected reasons an election and catchup takeover occur (e.g., Node 0 was blocked by LogKeeper and didn't do heartbeats for 10+ seconds), Node 0 begins to step down and reports ismaster: false
  • ReplSetTest proceeds to the point where it fsyncLocks Node 0 before checking dbhashes, because it still thinks Node 0 is primary
  • The new primary cannot complete catchup: queries on Node 0's oplog time out because Node 0 is fsyncLocked
  • Deeper in ReplSetTest's dbhash code it calls getPrimary() again, but no nodes report ismaster: true so getPrimary() times out

Proposals:

1. Currently, ReplSetTest.checkReplicaSet() fsyncLocks the primary before comparing replicas' dbhashes, oplogs, and/or collection counts. There's a comment saying the fsyncLock's purpose is to prevent TTL indexes from reaping documents during these checks. We could call setParameter with ttlMonitorEnabled: false instead of fsyncLock, but I tried that and got test failures due to dbhash mismatches. As I had feared, it's not only TTL indexes that cause dbhash mismatches, so we need the blunt tool of fsyncLock to prevent any changes.

2. All ReplSetTest code that runs with the primary fsyncLocked could be updated so it works while all replicas report ismaster: false, by using the cached self._master value instead of calling getPrimary(). This is a big code change, and it's hard to maintain. Next year we might change the code, accidentally call getPrimary() while the primary is fsyncLocked, and introduce a new rare BF without knowing it right away.

3. Sames as #2, but in getPrimary(), if no replica reports ismaster: true, assert no replicas are fsyncLocked. This will help catch mistakes in the future and help diagnose BFs like the current one, however, it's still a big change to brittle ReplSetTest code.

4. Add a retry loop in ReplSetTest.checkReplicaSet(): fsyncLock the primary, try to complete a check. If it fails, unlock, wait for a primary, and relock.

I'll submit Proposal 4 for review.



 Comments   
Comment by Githook User [ 27/Sep/19 ]

Author:

{'username': 'ajdavis', 'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis'}

Message: SERVER-43226 ReplSetTest.stop can hang due to fsyncLock
Branch: master
https://github.com/mongodb/mongo/commit/9b470eb73873f5db5c9fcee5df5316d477a1fa12

Comment by Githook User [ 26/Sep/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}

Message: Revert "SERVER-43226 ReplSetTest.stop can hang due to fsyncLock"

This reverts commit 7a9cefdabf985360ada4ac7569b0682320075edf.
Branch: master
https://github.com/mongodb/mongo/commit/fc0db84002fd553bcf0f7d7a153d79bbba75cf34

Comment by A. Jesse Jiryu Davis [ 26/Sep/19 ]

Re-opening: my previous attempt caused build failures. I've reverted it and I'll investigate.

Comment by Githook User [ 26/Sep/19 ]

Author:

{'username': 'ajdavis', 'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis'}

Message: SERVER-43226 ReplSetTest.stop can hang due to fsyncLock
Branch: master
https://github.com/mongodb/mongo/commit/7a9cefdabf985360ada4ac7569b0682320075edf

Generated at Thu Feb 08 05:02:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.