[SERVER-50755] Ensure FCV document is covered in WT checkpoint before killing node in wt_nojournal_fsync.js Created: 03/Sep/20  Updated: 29/Oct/23  Resolved: 05/Oct/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 4.9.0

Type: Bug Priority: Major - P3
Reporter: Jason Chan Assignee: Jason Chan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2020-09-21, Repl 2020-10-05, Repl 2020-10-19
Participants:
Linked BF Score: 17

 Description   

We only create a new FCV document on a clean startup (the server has no non-local databases).

However, it is possible that on an unclean shutdown (and no journaling), the creation of the admin.system.version collection makes it into the WT checkpoint but not the insertion of the FCV document. This means that on startup recovery, we never create the missing FCV document and end up fasserting instead.

I think we should modify wt_nojournal_fsync.js to ensure that the FCV document makes it into the checkpoint before sending the kill -9.



 Comments   
Comment by Githook User [ 03/Oct/20 ]

Author:

{'name': 'Jason Chan', 'email': 'jason.chan@mongodb.com', 'username': 'jasonjhchan'}

Message: SERVER-50755 Ensure FCV document is covered in WT checkpoint
Branch: master
https://github.com/mongodb/mongo/commit/f519387c4fc53912bc669f6b13e08ee7a5faf69a

Comment by Jason Chan [ 11/Sep/20 ]

Discussed this with siyuan.zhou and we have some concerns about this in the Replica set case:

In replica sets, the FCV document is created as part of replSetInitiate. However, this insertion of the FCV document is called directly through the storage interface and does not create an oplog entry. This means this write is not yet durable. This could cause issues for cases where ReplSetInitiate succeeds, and then the secondaries complete initial sync of the FCV document before it is made durable on the primary, and then the primary crashes. The primary will fassert if the admin database makes it into the checkpoint but the FCV document has not. Restarting the server with {--repair} will default the FCV to the lastLTS version, which could be out of sync with the rest of the replica set.

Our proposed solution is to call waitUntilUnjournaledWritesDurable after setting the FCV as part of initializeReplSetStorage.

Comment by Daniel Gottlieb (Inactive) [ 09/Sep/20 ]

FWIW, I vote we just delete this test.

Generated at Thu Feb 08 05:23:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.