Majority secondaries lagging for 30s post a failover breaching read/write SLA

    • Type: Bug
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 4.4.9, 7.0.11, 8.0.0
    • Component/s: None
    • None
    • Replication
    • ALL
    • Hide

      We have not had much luck reproducing this besides the one instance we saw. However, it does seem like the sequence of events is common enough that it can recur and we want to guard against this.

       

      Show
      We have not had much luck reproducing this besides the one instance we saw. However, it does seem like the sequence of events is common enough that it can recur and we want to guard against this.  
    • None
    • 3
    • TBD
    • None
    • None
    • None
    • None
    • None
    • None

      We observed a case recently where the primary degraded resulting in the following sequence of events 

       

      T0:   [All secondaries] Oplog fetcher thread returned error “NetworkTimeout: Error while getting the next batch in the oplog fetcher :: caused by :: recv failed while exhausting cursor :: caused by :: Socket operation timed out”

       

      T0 + 7s: [New Primary]  Election triggers “Starting an election, since we've seen no PRIMARY in election timeout period”

       

      T0 + 7s: [New Primary] Election succeeds, new primary assuming role

       

      T0 + 7s: [All secondaries] Restarting heartbeat after learning of a new primary

       

      T0 + 9s: [All secondaries] Old Primary is now in state RS_DOWN

       

      T0 + 30s: [All secondaries] Blacklisting candidate due to error Request xxx timed out, deadline was T0+30s, op was RemoteCommand xxx – target:[a.b.c.d:27017] db:local expDate: T0+30s cmd:{ find: "oplog.rs", limit: 1, sort:

      { $natural: 1 }

      , projection: { ts: 1, t: 1 }, readConcern: {} }

       

      T0 + 30s: [All secondaries] Choosing primary as sync source

       

      T0 + 30s: [All secondaries] Scheduling fetcher to read remote Oplog 

      It therefore took 30s from the time sync source resolver issued a find command on the Oplog against the old primary which was apparently not responding at this stage to timeout. This timeout is hardcoded here and used here and therefore it took that long for the secondaries to reevaluate the sync source resulting in an equivalent amount of replication lag. This long period of lag is concerning because none of the secondaries can ack the writes with >w:2 write concern nor able to serve the latest reads.

      Based on the above, we have a few questions which we are looking to the community to answer so we are in a position to make an appropriate fix for this problem for our setup.

      1. Why is the fetcher timeout on a find command that only does a sort and limits the result to 1 hardcoded to 30s?
      2. We ran the find command (db.runCommand({explain:{find:"oplog.rs",sort:{$natural:1},projection:{"t":1, "ts":1}, limit:1}})) in question against an Oplog as large as 700G as well as on the shard in question in the above mentioned sequence of events which had a 50G Oplog and the query took less than a second in both cases.
      3. What are the consequences of dropping the find timeout against the query here in order to reduce the secondary unavailability window for our use case in a scenario as is mentioned above ? What would be a good timeout number to drop to ?
      4. Would there be a consequence to reducing timeout value against the replSetGetRBID call here as well which uses the same timeout if we intend for the sync source resolver to fail fast and reevaluate the sync source.

       

      Setup notes

      Mongodb Version

      4.4.9 although the same issue can happen in the latest versions due to the same hardcoded fetcher timeout

      OS version

      Ubuntu 24.04.2 LTS 

      Kernel version

      6.8.0-1029-aws

      Shard timeout settings

      chainingAllowed: false,heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 5, electionTimeoutMillis: 5000, catchUpTimeoutMillis: 30000, catchUpTakeoverDelayMillis: 30000

       

      Shard layout

      5 voting members in a single cluster

       

            Assignee:
            Unassigned
            Reporter:
            Preeti Murthy
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated: