[SERVER-52661] lastDurable is set after it is cleared in initial sync Created: 06/Nov/20  Updated: 06/Dec/22  Resolved: 06/Nov/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.9 Required
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Xuerui Fa Assignee: Backlog - Replication Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
is related to SERVER-47898 Advancing lastDurable irrespective of... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:
Linked BF Score: 56

 Description   

When we start up a new initial sync attempt, it seems possible for lastDurable to be set after it has been cleared. For example, this sequence:

  1. After an initial sync attempt, the last applied oplog entry is not yet durable
  2. A new initial sync attempt starts and the oplogApplier in initial sync is shut down
  3. The journaling thread gets the lastApplied from the replication coordinator
  4. lastApplied and lastDurable are reset in the replication coordinator by initial sync
  5. lastDurable is set to lastApplied by the journaling thread

This was discovered after SERVER-47898. In that ticket, if we set lastDurable, we will also advance lastApplied. Thus, lastApplied would be set after the above sequence occurs. After that, we would go into this invariant, which would fail.

Just as a small note, it seems like unexpectedly setting lastDurable hasn't been causing noticeable issues until now. SERVER-47898 was reverted after a few days, and the BFs from the invariant failing also went away after it was reverted.

CC lingzhi.deng, matthew.russotto



 Comments   
Comment by Lingzhi Deng [ 06/Nov/20 ]

Good catch Xuerui. Yes, the journal flusher sets lastDurable asynchronously, which could be out of sync with the repl states. This type of issue also appeared in SERVER-50949. Similar to SERVER-50949, we may need to find a way to either stop/pause/restart the journal flusher after each failed initial attempt or make getToken() return an empty token during initial sync.. I think more investigation needs to be done together with SERVER-47898. But I believe this is probably not a bug now without SERVER-47898 because we would simply ignore setting lastDurable to an OpTime higher than lastApplied.

Generated at Thu Feb 08 05:28:39 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.