[SERVER-7275] node can't roll back if behind minValid Created: 05/Oct/12  Updated: 06/Dec/22  Resolved: 08/Sep/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dwight Merriman Assignee: Backlog - Replication Team
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-7200 use oplog as op buffer on secondaries Closed
Duplicate
is duplicated by SERVER-23841 Mongod always complain "Fatal asserti... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

Consider a replica set with three members at this state of oplogs:

S1: 1 2 3 4 5 6 7 8 9 (primary)
S2: 1
S3: 1 2 3 4 5

now suppose S2 starts applying the batch 2..9. It early commits

{2,4,6,8} ops. It then crashes.

after crashing, the oplog for S2 is unchanged, but writes have occurred to the datafiles for opids {2,4,6,8}

.

On restart, S2 would recover ok (if it's idempotent) if S1 is up. However, suppose S1 goes down first (perhaps permanently). Now S2 and S3 are the remaining set members on S2's restart. S3 has the latest data. After recovery we have:

S1: down
S2: 1 2 3 4 5 (+

{6,8} applied but not in oplog)
S3: 1 2 3 4 5

*However S2 has also written ops {6,8}

and they are never rolled back.



 Comments   
Comment by Eric Milkie [ 08/Sep/16 ]

The situation described here is fixed by the work in SERVER-7200.

Comment by Scott Hernandez (Inactive) [ 21/Apr/16 ]

Since it is now possible that the oplog has been written locally, it might be possible to finish applying to get consistent and then to do the rollback after. This would only be possible if the oplog entries were recorded and are available on restart – which is now possible since we write the local oplog entries and apply them concurrently. This would be the same behavior as the oplog as a buffer, but that change promises that the oplog will be there, instead of just being possibly there depending on execution and journaling order.

Comment by Eric Milkie [ 23/Feb/15 ]

Actually in the scenario above, S2 would go down, since it would attempt rollback but its minValid was not pointing at the end of its oplog.
The oplog-as-buffer project will solve this issue, by logging ops in the oplog prior to applying them.

Comment by Dwight Merriman [ 05/Oct/12 ]

Three possible solutions:
1) Never early commit batches. That is, roll them back (remapprivateview) and then retry with a smaller batch size.
2a) Write to the oplog at the beginning of the batch instead of the end, marking the ops as not yet completed. Note that other set members should not pull those items until they are really done (barring some proof that that is ok, which i believe is nontrivial). On an early commit flag in the oplog which ones are already done and which are not. On restart, finish the ops there which are not yet applied before coming up and listening.
2b) On an early commit write the batch of ops to local.rs.batchinprogress. note there which ones are already executed. on restart/recovery finish the batchinprogress items before starting up. at the end of that job empty batchinpgrogress collection and write the ops to the oplog as normal.

2b has an advantage over 2a in that it doesn't change the oplog format. Thus it is backward compatible and in addition won't mess up anyone who queries the oplog themselves for custom purposes. It's kind of like a journal for the current batch.

Generated at Thu Feb 08 03:14:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.