[SERVER-80575] Flow control is engaged and the sustainer point is not moving Created: 31/Aug/23  Updated: 01/Dec/23  Resolved: 01/Dec/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Bharath Kumar CM Assignee: Alison Rhea Thorne
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Server Triage
Operating System: ALL
Participants:

 Description   

In a 3 node replica set (no sharding):

at present 

p1 - secondary

p2 - primary

dr - secondary

When I connect using mongsh on each of those nodes -

On p1 : mongosh <--all-envs> --port 7253

takes close 8-10 seconds to prompt mongoshell

rs.status()

  members: [
    {
      _id: 0,
      name: 'p1:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 367399,
      optime: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDate: ISODate("2023-08-28T07:15:07.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: 'Could not find member to sync from',
      configVersion: 19,
      configTerm: 47,
      self: true,
      lastHeartbeatMessage: ''
    },
    {
      _id: 1,
      name: 'p2:7253',
      health: 0,
      state: 8,
      stateStr: '(not reachable/healthy)',
      uptime: 0,
      optime: \{ ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDurable: \{ ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDate: ISODate("1970-01-01T00:00:00.000Z"),
      optimeDurableDate: ISODate("1970-01-01T00:00:00.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastHeartbeat: ISODate("2023-08-28T07:15:29.974Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:19:31.951Z"),
      pingMs: Long("0"),
      lastHeartbeatMessage: "Couldn't get a connection within the time limit",
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 47
    },
    {
      _id: 2,
      name: 'dr:7253',
      health: 0,
      state: 8,
      stateStr: '(not reachable/healthy)',
      uptime: 0,
      optime: \{ ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDurable: \{ ts: Timestamp({ t: 0, i: 0 }), t: Long("-1") },
      optimeDate: ISODate("1970-01-01T00:00:00.000Z"),
      optimeDurableDate: ISODate("1970-01-01T00:00:00.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastHeartbeat: ISODate("2023-08-31T03:50:33.417Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:19:34.226Z"),
      pingMs: Long("80"),
      lastHeartbeatMessage: "Couldn't get a connection within the time limit",
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 47
    },

mongo7253 [direct: secondary] test> rs.printSecondaryReplicationInfo()
source: p1lc1mon029.ger.corp.intel.com:7253

{   syncedTo: 'Mon Aug 28 2023 07:15:07 GMT+0000 (Coordinated Universal Time)',   replLag: '0 secs (0 hrs) behind the freshest member (no primary available at the moment)' }


source: p2:17253

{ 'no replication info, yet.  State': '(not reachable/healthy)' }


source: dr:7253

{ 'no replication info, yet.  State': '(not reachable/healthy)' }

mongo7253 [direct: secondary] test>

mongo7253 [direct: secondary] test> rs.printReplicationInfo()
actual oplog size
'2048 MB'

configured oplog size
'2048 MB'

log length start to end
'45966145 secs (12768.37 hrs)'

oplog first event time
'Mon Mar 14 2022 06:52:42 GMT+0000 (Coordinated Universal Time)'

oplog last event time
'Mon Aug 28 2023 07:15:07 GMT+0000 (Coordinated Universal Time)'

now
'Thu Aug 31 2023 08:22:47 GMT+0000 (Coordinated Universal Time)'
mongo7253 [direct: secondary] test>

On p2 : mongosh <--all-envs> --port 7253

immediately gives mongo shell command 

mongo7253 [direct: primary] test> show dbs
admin          336.00 KiB
config           1.29 MiB
XYZ             9.05 MiB
local          132.86 MiB
mongo7253 [direct: primary] test>

rs.status()

  members: [
    {
      _id: 0,
      name: 'p1:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 263345,
      optime: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDurable: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDate: ISODate("2023-08-28T07:15:07.000Z"),
      optimeDurableDate: ISODate("2023-08-28T07:15:07.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastHeartbeat: ISODate("2023-08-31T08:24:36.250Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:24:36.228Z"),
      pingMs: Long("1042"),
      lastHeartbeatMessage: '',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 47
    },
    {
      _id: 1,
      name: 'p2:7253',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 263355,
      optime: \{ ts: Timestamp({ t: 1693470270, i: 1 }), t: Long("50") },
      optimeDate: ISODate("2023-08-31T08:24:30.000Z"),
      lastAppliedWallTime: ISODate("2023-08-31T08:24:30.694Z"),
      lastDurableWallTime: ISODate("2023-08-31T08:24:30.694Z"),
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      electionTime: Timestamp(\{ t: 1693206944, i: 1 }),
      electionDate: ISODate("2023-08-28T07:15:44.000Z"),
      configVersion: 19,
      configTerm: 50,
      self: true,
      lastHeartbeatMessage: ''
    },
    {
      _id: 2,
      name: 'dr:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 18194,
      optime: \{ ts: Timestamp({ t: 1693470270, i: 1 }), t: Long("50") },
      optimeDurable: \{ ts: Timestamp({ t: 1693470270, i: 1 }), t: Long("50") },
      optimeDate: ISODate("2023-08-31T08:24:30.000Z"),
      optimeDurableDate: ISODate("2023-08-31T08:24:30.000Z"),
      lastAppliedWallTime: ISODate("2023-08-31T08:24:30.694Z"),
      lastDurableWallTime: ISODate("2023-08-31T08:24:30.694Z"),
      lastHeartbeat: ISODate("2023-08-31T08:24:34.685Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:24:35.853Z"),
      pingMs: Long("82"),
      lastHeartbeatMessage: '',
      syncSourceHost: 'p2:7253',
      syncSourceId: 1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 50
    },

mongo7253 [direct: primary] test> rs.printSecondaryReplicationInfo()
source: p1:7253

{   syncedTo: 'Mon Aug 28 2023 07:15:07 GMT+0000 (Coordinated Universal Time)',   replLag: '-263413 secs (-73.17 hrs) behind the primary ' }


source: dr:7253

{   syncedTo: 'Thu Aug 31 2023 08:25:20 GMT+0000 (Coordinated Universal Time)',   replLag: '0 secs (0 hrs) behind the primary ' }

mongo7253 [direct: primary] test>

mongo7253 [direct: primary] test> rs.printReplicationInfo()
actual oplog size
'2048 MB'

configured oplog size
'2048 MB'

log length start to end
'46229587 secs (12841.55 hrs)'

oplog first event time
'Mon Mar 14 2022 06:52:53 GMT+0000 (Coordinated Universal Time)'

oplog last event time
'Thu Aug 31 2023 08:26:00 GMT+0000 (Coordinated Universal Time)'

now
'Thu Aug 31 2023 08:26:03 GMT+0000 (Coordinated Universal Time)'
mongo7253 [direct: primary] test>

On dr: mongosh <--all-envs> --port 7253

  members: [
    {
      _id: 0,
      name: 'p1:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 18319,
      optime: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDurable: \{ ts: Timestamp({ t: 1693206907, i: 1 }), t: Long("47") },
      optimeDate: ISODate("2023-08-28T07:15:07.000Z"),
      optimeDurableDate: ISODate("2023-08-28T07:15:07.000Z"),
      lastAppliedWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastDurableWallTime: ISODate("2023-08-28T07:15:07.307Z"),
      lastHeartbeat: ISODate("2023-08-31T08:26:33.995Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:26:37.495Z"),
      pingMs: Long("959"),
      lastHeartbeatMessage: '',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 47
    },
    {
      _id: 1,
      name: 'p2:7253',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 18320,
      optime: \{ ts: Timestamp({ t: 1693470388, i: 2 }), t: Long("50") },
      optimeDurable: \{ ts: Timestamp({ t: 1693470388, i: 2 }), t: Long("50") },
      optimeDate: ISODate("2023-08-31T08:26:28.000Z"),
      optimeDurableDate: ISODate("2023-08-31T08:26:28.000Z"),
      lastAppliedWallTime: ISODate("2023-08-31T08:26:28.836Z"),
      lastDurableWallTime: ISODate("2023-08-31T08:26:28.836Z"),
      lastHeartbeat: ISODate("2023-08-31T08:26:36.658Z"),
      lastHeartbeatRecv: ISODate("2023-08-31T08:26:37.492Z"),
      pingMs: Long("82"),
      lastHeartbeatMessage: '',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      electionTime: Timestamp(\{ t: 1693206944, i: 1 }),
      electionDate: ISODate("2023-08-28T07:15:44.000Z"),
      configVersion: 19,
      configTerm: 50
    },
    {
      _id: 2,
      name: 'dr:7253',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 18334,
      optime: \{ ts: Timestamp({ t: 1693470388, i: 2 }), t: Long("50") },
      optimeDate: ISODate("2023-08-31T08:26:28.000Z"),
      lastAppliedWallTime: ISODate("2023-08-31T08:26:28.836Z"),
      lastDurableWallTime: ISODate("2023-08-31T08:26:28.836Z"),
      syncSourceHost: 'p2:7253',
      syncSourceId: 1,
      infoMessage: '',
      configVersion: 19,
      configTerm: 50,
      self: true,
      lastHeartbeatMessage: ''
    },

mongo7253 [direct: secondary] test> rs.printSecondaryReplicationInfo()
source: p1:7253

{   syncedTo: 'Mon Aug 28 2023 07:15:07 GMT+0000 (Coordinated Universal Time)',   replLag: '-263523 secs (-73.2 hrs) behind the primary ' }


source: dr:7253

{   syncedTo: 'Thu Aug 31 2023 08:27:10 GMT+0000 (Coordinated Universal Time)',   replLag: '0 secs (0 hrs) behind the primary ' }

mongo7253 [direct: secondary] test>

mongo7253 [direct: secondary] test> rs.printReplicationInfo()
actual oplog size
'2048 MB'

configured oplog size
'2048 MB'

log length start to end
'46229695 secs (12841.58 hrs)'

oplog first event time
'Mon Mar 14 2022 06:53:15 GMT+0000 (Coordinated Universal Time)'

oplog last event time
'Thu Aug 31 2023 08:28:10 GMT+0000 (Coordinated Universal Time)'

now
'Thu Aug 31 2023 08:28:12 GMT+0000 (Coordinated Universal Time)'
mongo7253 [direct: secondary] test>

=================

 

when I tail the log for REPL messages ;

On p1: keep getting these messages
tail -f /instances/mongo_db7253/logs/mongodb.log | grep REPL

{"t":\{"$date":"2023-08-31T08:28:58.394+00:00"}

,"s":"I",  "c":"REPL",     "id":21401,   "ctx":"conn26379","msg":"Scheduling heartbeat to fetch a newer config","attr":{"configTerm":50,"configVersion":19,"senderHost":"dr:7253"}}

{"t":\{"$date":"2023-08-31T08:29:00.682+00:00"}

,"s":"I",  "c":"REPL",     "id":21401,   "ctx":"conn2169","msg":"Scheduling heartbeat to fetch a newer config","attr":{"configTerm":50,"configVersion":19,"senderHost":"p2:7253"}}

On p2 & dr: I get nothing 
tail -f /instances/mongo_db7253/logs/mongodb.log | grep REPL

however I get below 
tail -f /instances/mongo_db7253/logs/mongodb.log | grep FlowControlRefresher

{"t":\{"$date":"2023-08-31T08:31:39.550+00:00"}

,"s":"W",  "c":"STORAGE",  "id":22225,   "ctx":"FlowControlRefresher","msg":"Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries."}

===================

question number :

why is my p1 having negative lag number (73 hours)?
whether P1 primary or p2 ?
why p1 is primary because its ahead of other 2 nodes as per rs.printSecondaryReplicationInfo() on p2 and dr
why p2 is primary is because when connected locally using mongosh it shows as primary.
why p1 says - no primary available at the moment when ran rs.printSecondaryReplicationInfo() from p1
why p2 and dr state show as 'not reachable/healthy' when ran rs.status() on P1
but they show p1, DR as secondary and p2 as primary when ran  rs.status() on p2 and DR

why p2 and DR show lastHeartbeatMessage: "Couldn't get a connection within the time limit" for when ran rs,satus() on p1 and shows healthy on p2 and dr.

Is p1 a primary as its trying to replicate to p2 and Dr but not happening and p2 is false primary ?
what's the bug with FlowControl to display message like
"Flow control is engaged and the sustainer point is not moving. Please check the health of all secondaries."

how to fix this replication issue ?
my mongodb server version is 5.0.5 on all 3 nodes.



 Comments   
Comment by Alison Rhea Thorne [ 01/Dec/23 ]

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Comment by Alison Rhea Thorne [ 08/Nov/23 ]

We still need additional information to diagnose the problem. If this is still an issue for you, would you please do the following?

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)
Comment by Alison Rhea Thorne [ 17/Oct/23 ]

I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)
Comment by Bharath Kumar CM [ 17/Oct/23 ]

any update ?

 

Generated at Thu Feb 08 06:43:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.