[SERVER-26397] Look for new sync source more frequently while in catchup mode Created: 29/Sep/16  Updated: 25/Jul/18  Resolved: 25/Jul/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Vesselina Ratcheva (Inactive)
Resolution: Won't Fix Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Operating System: ALL
Sprint: Repl 2017-12-04, Repl 2017-12-18, Repl 2018-07-16, Repl 2018-07-30
Participants:
Linked BF Score: 0

 Description   

If a recently-elected primary is in catchup mode but has no sync source, it's delaying becoming a fully-usable primary, but not actually doing any work. It's possible that when it first got elected and looked for a sync source there was no good sync source available, but then one becomes available while it is in catchup mode. We should be checking for new sync sources more frequently than we normally do if we're in catchup mode, since the whole node is otherwise just sitting idle.



 Comments   
Comment by Siyuan Zhou [ 25/Jul/18 ]

We are not sure if this would fix the original BF and we haven't seen this elsewhere. We also don't want to expose the state of ReplicationCoordinatorImpl to bgsync, because the concurrency rules don't allow bgsync to call into ReplicationCoordinatorImpl. The problem will only happen if the replset is pretty quiet but a write occurs during the 2 seconds of heartbeat interval, which should be rare in reality. The worse case is to wait for 1 more second.

There is a real case where this is a valid improvement, it's just a really unlikely and uncommon case that we've decided isn't worth the extra complexity to the system to address.

Closing this as "Won't Fix". We can reopen this when it occurs in the future.

Comment by Spencer Brody (Inactive) [ 23/Jul/18 ]

I'm not positive but I think the case I was alluding to when I filed this ticket was:

  1. Node gets elected, timer starts for deciding whether to go into catchup mode
  2. bgsync thread looks for a sync source, finds none, goes to sleep for 1 second
  3. heartbeat comes in with newer optime from some node, causing us to go into catchup mode
  4. Now we're stuck waiting for the bgsync thread to wake up and look for a sync source again before we start doing any work as part of catchup mode.

A more ideal solution would probably be to wake up the bgsync thread whenever we get a heartbeat or replSetUpdatePosition with new information that may affect its ability to find a sync source, but that is likely a much more complex change.

Comment by Siyuan Zhou [ 12/Jul/18 ]

vesselina.ratcheva, if we cannot find a good sync source, how did we know there is a node with higher optime? I'm curious why there was no good sync source available when it first got elected and looked for a sync source.

Generated at Thu Feb 08 04:11:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.