[SERVER-43226] Primary takeover during ReplSetTest.stop can hang due to fsyncLock Created: 09/Sep/19 Updated: 29/Oct/23 Resolved: 27/Sep/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | A. Jesse Jiryu Davis | Assignee: | A. Jesse Jiryu Davis |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Operating System: | ALL | ||||||||
| Sprint: | Repl 2019-09-09, Repl 2019-09-23, Repl 2019-10-07 | ||||||||
| Participants: | |||||||||
| Linked BF Score: | 47 | ||||||||
| Description |
|
A rare race, observed in catchup_takeover_two_nodes_ahead.js:
Proposals: 1. Currently, ReplSetTest.checkReplicaSet() fsyncLocks the primary before comparing replicas' dbhashes, oplogs, and/or collection counts. There's a comment saying the fsyncLock's purpose is to prevent TTL indexes from reaping documents during these checks. We could call setParameter with ttlMonitorEnabled: false instead of fsyncLock, but I tried that and got test failures due to dbhash mismatches. As I had feared, it's not only TTL indexes that cause dbhash mismatches, so we need the blunt tool of fsyncLock to prevent any changes. 2. All ReplSetTest code that runs with the primary fsyncLocked could be updated so it works while all replicas report ismaster: false, by using the cached self._master value instead of calling getPrimary(). This is a big code change, and it's hard to maintain. Next year we might change the code, accidentally call getPrimary() while the primary is fsyncLocked, and introduce a new rare BF without knowing it right away. 3. Sames as #2, but in getPrimary(), if no replica reports ismaster: true, assert no replicas are fsyncLocked. This will help catch mistakes in the future and help diagnose BFs like the current one, however, it's still a big change to brittle ReplSetTest code. 4. Add a retry loop in ReplSetTest.checkReplicaSet(): fsyncLock the primary, try to complete a check. If it fails, unlock, wait for a primary, and relock. I'll submit Proposal 4 for review. |
| Comments |
| Comment by Githook User [ 27/Sep/19 ] |
|
Author: {'username': 'ajdavis', 'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis'}Message: |
| Comment by Githook User [ 26/Sep/19 ] |
|
Author: {'name': 'A. Jesse Jiryu Davis', 'username': 'ajdavis', 'email': 'jesse@mongodb.com'}Message: Revert " This reverts commit 7a9cefdabf985360ada4ac7569b0682320075edf. |
| Comment by A. Jesse Jiryu Davis [ 26/Sep/19 ] |
|
Re-opening: my previous attempt caused build failures. I've reverted it and I'll investigate. |
| Comment by Githook User [ 26/Sep/19 ] |
|
Author: {'username': 'ajdavis', 'email': 'jesse@mongodb.com', 'name': 'A. Jesse Jiryu Davis'}Message: |