[SERVER-52961] Fail MongoDB FCV upgrade 3.6 if local.replset.minvalid document contains not null timestamp in 'oplogDeleteFromPoint' field. Created: 20/Nov/20  Updated: 27/Sep/21  Resolved: 02/Dec/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Evin Roesle
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-50147 Cannot start mongo after upgrading to... Closed
Operating System: ALL
Participants:
Case:

 Description   

This ticket came from investigation on a help ticket SERVER-50147 filed by customer.

Currently, when repl is enabled, MongoDB 3.4 sets 'oplogDeleteFromPoint' field in minvalid document to a non-null timestamp during steady state oplog application before writing oplog entries and clears the timestamp after writing the oplog entries. So, on unclean shutdown 3.4 can have 'oplogDeleteFromPoint' with non-null timestamp.

In that unclean shutdown case, if the user restarts the node as standalone before upgrading to mongoDB binary version 3.6, then we can hit the problem mentioned in SERVER-50147. (see here for related nexus of prior work)

Solution:
The current work-around solution is to manually unset the 'oplogDeleteFromPoint' field in minvalid document which is an unsafe solution. 'oplogDeleteFromPoint' with non-null timestamp indicates that there was a shutdown happened in the middle of writing an oplog batch and this info is necessary for startup recovery until MongoDB 3.6 binary FCV 3.4.

Discussed some solutions of unsetting the field in SERVER-50147, but it's really not safe to unset the field manually in any-version. Safer option would be to fail MongoDB FCV upgrade 3.6 if local.replset.minvalid document contains not null timestamp in 'oplogDeleteFromPoint' field (makes user to do startup-recovery in 3.4 binary or 3.6 binary with FCV 3.4). Also, need to unset the 'oplogDeleteFromPoint' field with null timestamp on FCV 3.6 upgrade irrespective of whether the node is standalone or repl-enabled.



 Comments   
Comment by Evin Roesle [ 02/Dec/20 ]

We are closing this issue as we do not believe that the risk on 3.6 is worth taking as this does not appear to be common and there is a workaround available for this issue.

To work around this issue you can delete the data files on the node that is failing the upgrade so that an initial sync is triggered. This will safely fix the issue.

If you encounter this issue and the workaround is not suitable, please reopen this ticket.

Generated at Thu Feb 08 05:29:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.