[SERVER-85915] Cannot select sync source because we are a voter and it is not Created: 30/Jan/24  Updated: 06/Feb/24

Status: Needs Scheduling
Project: Core Server
Component/s: None
Affects Version/s: 7.2.0-rc0, 7.0.3
Fix Version/s: None

Type: Improvement Priority: Minor - P4
Reporter: Андрій К Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: logging-improvement
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Replication
Participants:

 Description   

After upgrade RS from 6 to 7.0.5 we are getting this error on one replica:

{"t":{"$date":"2024-01-30T09:22:47.768+00:00"},"s":"I",  "c":"REPL",     "id":3873108, "ctx":"ReplCoordExtern-0","msg":"Cannot select sync source because we are a voter and it is not","attr":{"syncSourceCandidate":"example5.com:27017"}}

nodes with votes: 1, 2, 3
I see no other problems: lag is 0, replica works ok. But there is constant spam with this message into log (around 30 messages per second)

result of rs.conf() command from example3.com (problem node)

{
  set: 'exampleRS',
  date: ISODate('2024-01-30T09:24:24.943Z'),
  myState: 2,
  term: Long('161'),
  syncSourceHost: 'example2.com:27017',
  syncSourceId: 1,
  heartbeatIntervalMillis: Long('2000'),
  majorityVoteCount: 2,
  writeMajorityCount: 2,
  votingMembersCount: 3,
  writableVotingMembersCount: 3,
  optimes: {
    lastCommittedOpTime: { ts: Timestamp({ t: 1706606664, i: 5 }), t: Long('161') },
    lastCommittedWallTime: ISODate('2024-01-30T09:24:24.676Z'),
    readConcernMajorityOpTime: { ts: Timestamp({ t: 1706606664, i: 5 }), t: Long('161') },
    appliedOpTime: { ts: Timestamp({ t: 1706606664, i: 5 }), t: Long('161') },
    durableOpTime: { ts: Timestamp({ t: 1706606664, i: 5 }), t: Long('161') },
    lastAppliedWallTime: ISODate('2024-01-30T09:24:24.676Z'),
    lastDurableWallTime: ISODate('2024-01-30T09:24:24.676Z')
  },
  lastStableRecoveryTimestamp: Timestamp({ t: 1706606606, i: 7 }),
  members: [
    {
      _id: 0,
      name: 'example1.com:27017',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 662,
      optime: { ts: Timestamp({ t: 1706606663, i: 2 }), t: Long('161') },
      optimeDurable: { ts: Timestamp({ t: 1706606663, i: 2 }), t: Long('161') },
      optimeDate: ISODate('2024-01-30T09:24:23.000Z'),
      optimeDurableDate: ISODate('2024-01-30T09:24:23.000Z'),
      lastAppliedWallTime: ISODate('2024-01-30T09:24:23.414Z'),
      lastDurableWallTime: ISODate('2024-01-30T09:24:23.414Z'),
      lastHeartbeat: ISODate('2024-01-30T09:24:23.925Z'),
      lastHeartbeatRecv: ISODate('2024-01-30T09:24:23.062Z'),
      pingMs: Long('178'),
      lastHeartbeatMessage: '',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      electionTime: Timestamp({ t: 1706519832, i: 1 }),
      electionDate: ISODate('2024-01-29T09:17:12.000Z'),
      configVersion: 20,
      configTerm: 161
    },
    {
      _id: 1,
      name: 'example2.com:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 662,
      optime: { ts: Timestamp({ t: 1706606662, i: 20 }), t: Long('161') },
      optimeDurable: { ts: Timestamp({ t: 1706606662, i: 20 }), t: Long('161') },
      optimeDate: ISODate('2024-01-30T09:24:22.000Z'),
      optimeDurableDate: ISODate('2024-01-30T09:24:22.000Z'),
      lastAppliedWallTime: ISODate('2024-01-30T09:24:22.626Z'),
      lastDurableWallTime: ISODate('2024-01-30T09:24:22.626Z'),
      lastHeartbeat: ISODate('2024-01-30T09:24:22.925Z'),
      lastHeartbeatRecv: ISODate('2024-01-30T09:24:23.752Z'),
      pingMs: Long('176'),
      lastHeartbeatMessage: '',
      syncSourceHost: 'example1.com:27017',
      syncSourceId: 0,
      infoMessage: '',
      configVersion: 20,
      configTerm: 161
    },
    {
      _id: 2,
      name: 'example3.com:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 666,
      optime: { ts: Timestamp({ t: 1706606664, i: 5 }), t: Long('161') },
      optimeDate: ISODate('2024-01-30T09:24:24.000Z'),
      lastAppliedWallTime: ISODate('2024-01-30T09:24:24.676Z'),
      lastDurableWallTime: ISODate('2024-01-30T09:24:24.676Z'),
      syncSourceHost: 'example2.com:27017',
      syncSourceId: 1,
      infoMessage: '',
      configVersion: 20,
      configTerm: 161,
      self: true,
      lastHeartbeatMessage: ''
    },
    {
      _id: 3,
      name: 'example4.com:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 662,
      optime: { ts: Timestamp({ t: 1706606663, i: 2 }), t: Long('161') },
      optimeDurable: { ts: Timestamp({ t: 1706606663, i: 2 }), t: Long('161') },
      optimeDate: ISODate('2024-01-30T09:24:23.000Z'),
      optimeDurableDate: ISODate('2024-01-30T09:24:23.000Z'),
      lastAppliedWallTime: ISODate('2024-01-30T09:24:23.414Z'),
      lastDurableWallTime: ISODate('2024-01-30T09:24:23.414Z'),
      lastHeartbeat: ISODate('2024-01-30T09:24:23.506Z'),
      lastHeartbeatRecv: ISODate('2024-01-30T09:24:23.999Z'),
      pingMs: Long('175'),
      lastHeartbeatMessage: '',
      syncSourceHost: 'example2.com:27017',
      syncSourceId: 1,
      infoMessage: '',
      configVersion: 20,
      configTerm: 161
    },
    {
      _id: 4,
      name: 'example5.com:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 662,
      optime: { ts: Timestamp({ t: 1706606663, i: 1 }), t: Long('161') },
      optimeDurable: { ts: Timestamp({ t: 1706606663, i: 1 }), t: Long('161') },
      optimeDate: ISODate('2024-01-30T09:24:23.000Z'),
      optimeDurableDate: ISODate('2024-01-30T09:24:23.000Z'),
      lastAppliedWallTime: ISODate('2024-01-30T09:24:23.237Z'),
      lastDurableWallTime: ISODate('2024-01-30T09:24:23.237Z'),
      lastHeartbeat: ISODate('2024-01-30T09:24:23.385Z'),
      lastHeartbeatRecv: ISODate('2024-01-30T09:24:23.248Z'),
      pingMs: Long('115'),
      lastHeartbeatMessage: '',
      syncSourceHost: 'example2.com:27017',
      syncSourceId: 1,
      infoMessage: '',
      configVersion: 20,
      configTerm: 161
    },
    {
      _id: 5,
      name: 'example6.com:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 661,
      optime: { ts: Timestamp({ t: 1706606664, i: 2 }), t: Long('161') },
      optimeDurable: { ts: Timestamp({ t: 1706606664, i: 2 }), t: Long('161') },
      optimeDate: ISODate('2024-01-30T09:24:24.000Z'),
      optimeDurableDate: ISODate('2024-01-30T09:24:24.000Z'),
      lastAppliedWallTime: ISODate('2024-01-30T09:24:24.251Z'),
      lastDurableWallTime: ISODate('2024-01-30T09:24:24.251Z'),
      lastHeartbeat: ISODate('2024-01-30T09:24:24.729Z'),
      lastHeartbeatRecv: ISODate('2024-01-30T09:24:24.891Z'),
      pingMs: Long('328'),
      lastHeartbeatMessage: '',
      syncSourceHost: 'example2.com:27017',
      syncSourceId: 1,
      infoMessage: '',
      configVersion: 20,
      configTerm: 161
    }
  ] 

example5.com is not a voting member, just secondary node. I tried to restart problem member, but message still present



 Comments   
Comment by Chris Kelly [ 06/Feb/24 ]

Excellent, thanks for sharing the results.

I am going to assign this issue to the relevant team to consider this change.

Comment by Андрій К [ 06/Feb/24 ]

Hi chris.kelly@mongodb.com 
I attempted to modify the changeSyncSourceThresholdMillis setting using mongosh:

db.adminCommand( { getParameter : 1, "changeSyncSourceThresholdMillis": 1 })
db.adminCommand( { setParameter: 1, "changeSyncSourceThresholdMillis": 200} ) 

After making this adjustment, messages disappeared from the log. Consequently, I configured it in the mongod.conf file and restarted the server to ensure the changes were applied permanently:

setParameter:
  changeSyncSourceThresholdMillis: 200 

Thank you for the information. I now understand that this behavior is intentional, not a bug.

I suggest altering the logic to trigger this message after fetching each batch of oplog entries, but no more than once per second. This adjustment should mitigate potential log pressure, especially when experiencing network latency spikes with the default changeSyncSourceThresholdMillis value 5ms.

Comment by Chris Kelly [ 05/Feb/24 ]

Hi andriikrymus@gmail.com!

Thanks for your report. Based on this, I see: 

  • This is a fairly new log line we added in 7.0 in SERVER-62921 (Improve sync source selection logging)
  • The underlying behavior (Candidate must be a voter if we are a voter) has been the same since 4.7 in SERVER-47449 (Re-evaluate sync source after fetching each batch of oplog entries)

Per SERVER-47449:

The node will use ping times and the new changeSyncSourceThresholdMillis server parameter to determine if the node could switch from syncing from a node in a different data center to a node within the same data center. This check will be skipped if the current sync source’s ping time satisfies changeSyncSourceThresholdMillis.

If changeSyncSourceThresholdMillis is set to 0, nodes should not re-evaluate their sync source.

I think it's possible this parameter can be adjusted to not trigger as often in your case. Let me know if that works out for you.

Generated at Thu Feb 08 06:58:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.