-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Replication
-
Labels:None
-
Fully Compatible
-
ALL
-
Repl 2019-09-09, Repl 2019-09-23, Repl 2019-10-07
-
47
A rare race, observed in catchup_takeover_two_nodes_ahead.js:
- The test reaches the end and calls ReplSetTest.stop
- ReplSetTest calls getPrimary() and determines that Node 0 is the primary
- For unexpected reasons an election and catchup takeover occur (e.g., Node 0 was blocked by LogKeeper and didn't do heartbeats for 10+ seconds), Node 0 begins to step down and reports ismaster: false
- ReplSetTest proceeds to the point where it fsyncLocks Node 0 before checking dbhashes, because it still thinks Node 0 is primary
- The new primary cannot complete catchup: queries on Node 0's oplog time out because Node 0 is fsyncLocked
- Deeper in ReplSetTest's dbhash code it calls getPrimary() again, but no nodes report ismaster: true so getPrimary() times out
Proposals:
1. Currently, ReplSetTest.checkReplicaSet() fsyncLocks the primary before comparing replicas' dbhashes, oplogs, and/or collection counts. There's a comment saying the fsyncLock's purpose is to prevent TTL indexes from reaping documents during these checks. We could call setParameter with ttlMonitorEnabled: false instead of fsyncLock, but I tried that and got test failures due to dbhash mismatches. As I had feared, it's not only TTL indexes that cause dbhash mismatches, so we need the blunt tool of fsyncLock to prevent any changes.
2. All ReplSetTest code that runs with the primary fsyncLocked could be updated so it works while all replicas report ismaster: false, by using the cached self._master value instead of calling getPrimary(). This is a big code change, and it's hard to maintain. Next year we might change the code, accidentally call getPrimary() while the primary is fsyncLocked, and introduce a new rare BF without knowing it right away.
3. Sames as #2, but in getPrimary(), if no replica reports ismaster: true, assert no replicas are fsyncLocked. This will help catch mistakes in the future and help diagnose BFs like the current one, however, it's still a big change to brittle ReplSetTest code.
4. Add a retry loop in ReplSetTest.checkReplicaSet(): fsyncLock the primary, try to complete a check. If it fails, unlock, wait for a primary, and relock.
I'll submit Proposal 4 for review.