[SERVER-50755] Ensure FCV document is covered in WT checkpoint before killing node in wt_nojournal_fsync.js Created: 03/Sep/20 Updated: 29/Oct/23 Resolved: 05/Oct/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 4.9.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Jason Chan | Assignee: | Jason Chan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Sprint: | Repl 2020-09-21, Repl 2020-10-05, Repl 2020-10-19 | ||||
| Participants: | |||||
| Linked BF Score: | 17 | ||||
| Description |
|
We only create a new FCV document on a clean startup (the server has no non-local databases). However, it is possible that on an unclean shutdown (and no journaling), the creation of the admin.system.version collection makes it into the WT checkpoint but not the insertion of the FCV document. This means that on startup recovery, we never create the missing FCV document and end up fasserting instead. I think we should modify wt_nojournal_fsync.js to ensure that the FCV document makes it into the checkpoint before sending the kill -9. |
| Comments |
| Comment by Githook User [ 03/Oct/20 ] |
|
Author: {'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}Message: |
| Comment by Jason Chan [ 11/Sep/20 ] |
|
Discussed this with siyuan.zhou and we have some concerns about this in the Replica set case: In replica sets, the FCV document is created as part of replSetInitiate. However, this insertion of the FCV document is called directly through the storage interface and does not create an oplog entry. This means this write is not yet durable. This could cause issues for cases where ReplSetInitiate succeeds, and then the secondaries complete initial sync of the FCV document before it is made durable on the primary, and then the primary crashes. The primary will fassert if the admin database makes it into the checkpoint but the FCV document has not. Restarting the server with {--repair} will default the FCV to the lastLTS version, which could be out of sync with the rest of the replica set. Our proposed solution is to call waitUntilUnjournaledWritesDurable after setting the FCV as part of initializeReplSetStorage. |
| Comment by Daniel Gottlieb (Inactive) [ 09/Sep/20 ] |
|
FWIW, I vote we just delete this test. |