-
Type: Bug
-
Resolution: Fixed
-
Priority: Major - P3
-
Affects Version/s: None
-
Component/s: Sharding
-
Fully Compatible
-
v4.4
-
Sharding 2020-10-05
-
16
On both mongos and mongod, the ShardRegistry's updateReplSetHosts() is called identically from both onConfirmedSet() and onPossibleSet(). The only difference between these hooks is that onConfirmedSet() is called once the primary has been identified, whereas onPossibleSet() doesn't have this restriction. In practice this means that onPossibleSet() is often triggered after the first response comes back (from a secondary), with a connection string consisting of just that host (or the subset of the set's hosts that have replied).
Unfortunately, this causes the ShardRegistry to forget about the "missing" hosts, which can cause "No shard found for host: hostname:port" errors when getShardForHostNoReload() is called during ingress/egress to/from one of the "missing" hosts.
2020-09-18T11:36:45.154+00:00 I NETWORK 23729 [ReplicaSetMonitor-TaskExecutor] "ServerPingMonitor is now monitoring host","attr":{"host":"ip-10-122-79-156.ec2.internal:20521","replicaSet":"change_stream_update_lookup_read_concern"} 2020-09-18T11:36:45.154+00:00 I NETWORK 4333213 [ReplicaSetMonitor-TaskExecutor] "RSM Topology Change","attr":{"replicaSet":"change_stream_update_lookup_read_concern","newTopologyDescription":"{ id: \"128b473b-449e-46c0-ab92-5c18a9d27482\", topologyType: \"ReplicaSetNoPrimary\", servers: { ip-10-122-79-156.ec2.internal:20520: { address: \"ip-10-122-79-156.ec2.internal:20520\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} }, ip-10-122-79-156.ec2.internal:20521: { address: \"ip-10-122-79-156.ec2.internal:20521\", topologyVersion: { processId: ObjectId('5f649bbd2fd508258a6cd48d'), counter: 6 }, roundTripTime: 1061, lastWriteDate: new Date(1600429005000), opTime: { ts: Timestamp(1600429005, 1), t: 1 }, type: \"RSSecondary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-79-156.ec2.internal:20521\", setName: \"change_stream_update_lookup_read_concern\", setVersion: 3, primary: \"ip-10-122-79-156.ec2.internal:20520\", lastUpdateTime: new Date(1600429005154), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-79-156.ec2.internal:20520\" }, arbiters: {}, passives: { 0: \"ip-10-122-79-156.ec2.internal:20521\", 1: \"ip-10-122-79-156.ec2.internal:20522\" }, tags: { tag: \"closestSecondary\" } }, ip-10-122-79-156.ec2.internal:20522: { address: \"ip-10-122-79-156.ec2.internal:20522\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} } }, logicalSessionTimeoutMinutes: 30, setName: \"change_stream_update_lookup_read_concern\", compatible: true }","previousTopologyDescription":"{ id: \"128b473b-449e-46c0-ab92-5c18a9d27482\", topologyType: \"Unknown\", servers: { ip-10-122-79-156.ec2.internal:20520: { address: \"ip-10-122-79-156.ec2.internal:20520\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} }, ip-10-122-79-156.ec2.internal:20521: { address: \"ip-10-122-79-156.ec2.internal:20521\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} }, ip-10-122-79-156.ec2.internal:20522: { address: \"ip-10-122-79-156.ec2.internal:20522\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} } }, compatible: true }"} 2020-09-18T11:36:45.154+00:00 I SHARDING 22732 [ShardRegistry-0] "Updating shard connection string on shard registry","attr":{"shardId":"change_stream_update_lookup_read_concern","newShardConnectionString":"change_stream_update_lookup_read_concern/ip-10-122-79-156.ec2.internal:20521","oldShardConnectionString":"change_stream_update_lookup_read_concern/ip-10-122-79-156.ec2.internal:20520,ip-10-122-79-156.ec2.internal:20521,ip-10-122-79-156.ec2.internal:20522"} 2020-09-18T11:36:45.233+00:00 I NETWORK 23729 [ReplicaSetMonitor-TaskExecutor] "ServerPingMonitor is now monitoring host","attr":{"host":"ip-10-122-79-156.ec2.internal:20520","replicaSet":"change_stream_update_lookup_read_concern"} 2020-09-18T11:36:45.233+00:00 I NETWORK 4333213 [ReplicaSetMonitor-TaskExecutor] "RSM Topology Change","attr":{"replicaSet":"change_stream_update_lookup_read_concern","newTopologyDescription":"{ id: \"128b473b-449e-46c0-ab92-5c18a9d27482\", topologyType: \"ReplicaSetWithPrimary\", servers: { ip-10-122-79-156.ec2.internal:20520: { address: \"ip-10-122-79-156.ec2.internal:20520\", topologyVersion: { processId: ObjectId('5f649bbdecd281fc1f7d176e'), counter: 10 }, roundTripTime: 80061, lastWriteDate: new Date(1600429005000), opTime: { ts: Timestamp(1600429005, 1), t: 1 }, type: \"RSPrimary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-79-156.ec2.internal:20520\", setName: \"change_stream_update_lookup_read_concern\", setVersion: 3, electionId: ObjectId('7fffffff0000000000000001'), primary: \"ip-10-122-79-156.ec2.internal:20520\", lastUpdateTime: new Date(1600429005233), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-79-156.ec2.internal:20520\" }, arbiters: {}, passives: { 0: \"ip-10-122-79-156.ec2.internal:20521\", 1: \"ip-10-122-79-156.ec2.internal:20522\" }, tags: { tag: \"primary\" } }, ip-10-122-79-156.ec2.internal:20521: { address: \"ip-10-122-79-156.ec2.internal:20521\", topologyVersion: { processId: ObjectId('5f649bbd2fd508258a6cd48d'), counter: 6 }, roundTripTime: 1061, lastWriteDate: new Date(1600429005000), opTime: { ts: Timestamp(1600429005, 1), t: 1 }, type: \"RSSecondary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-79-156.ec2.internal:20521\", setName: \"change_stream_update_lookup_read_concern\", setVersion: 3, primary: \"ip-10-122-79-156.ec2.internal:20520\", lastUpdateTime: new Date(1600429005154), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-79-156.ec2.internal:20520\" }, arbiters: {}, passives: { 0: \"ip-10-122-79-156.ec2.internal:20521\", 1: \"ip-10-122-79-156.ec2.internal:20522\" }, tags: { tag: \"closestSecondary\" } }, ip-10-122-79-156.ec2.internal:20522: { address: \"ip-10-122-79-156.ec2.internal:20522\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} } }, logicalSessionTimeoutMinutes: 30, setName: \"change_stream_update_lookup_read_concern\", compatible: true, maxSetVersion: 3, maxElectionId: ObjectId('7fffffff0000000000000001') }","previousTopologyDescription":"{ id: \"128b473b-449e-46c0-ab92-5c18a9d27482\", topologyType: \"ReplicaSetNoPrimary\", servers: { ip-10-122-79-156.ec2.internal:20520: { address: \"ip-10-122-79-156.ec2.internal:20520\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} }, ip-10-122-79-156.ec2.internal:20521: { address: \"ip-10-122-79-156.ec2.internal:20521\", topologyVersion: { processId: ObjectId('5f649bbd2fd508258a6cd48d'), counter: 6 }, roundTripTime: 1061, lastWriteDate: new Date(1600429005000), opTime: { ts: Timestamp(1600429005, 1), t: 1 }, type: \"RSSecondary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-79-156.ec2.internal:20521\", setName: \"change_stream_update_lookup_read_concern\", setVersion: 3, primary: \"ip-10-122-79-156.ec2.internal:20520\", lastUpdateTime: new Date(1600429005154), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-79-156.ec2.internal:20520\" }, arbiters: {}, passives: { 0: \"ip-10-122-79-156.ec2.internal:20521\", 1: \"ip-10-122-79-156.ec2.internal:20522\" }, tags: { tag: \"closestSecondary\" } }, ip-10-122-79-156.ec2.internal:20522: { address: \"ip-10-122-79-156.ec2.internal:20522\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} } }, logicalSessionTimeoutMinutes: 30, setName: \"change_stream_update_lookup_read_concern\", compatible: true }"} 2020-09-18T11:36:45.233+00:00 I SHARDING 471691 [ReplicaSetMonitor-TaskExecutor] "Updating the shard registry with confirmed replica set","attr":{"connectionString":"change_stream_update_lookup_read_concern/ip-10-122-79-156.ec2.internal:20520,ip-10-122-79-156.ec2.internal:20521,ip-10-122-79-156.ec2.internal:20522"} 2020-09-18T11:36:45.234+00:00 I NETWORK 23729 [ReplicaSetMonitor-TaskExecutor] "ServerPingMonitor is now monitoring host","attr":{"host":"ip-10-122-79-156.ec2.internal:20522","replicaSet":"change_stream_update_lookup_read_concern"} 2020-09-18T11:36:45.234+00:00 I NETWORK 4333213 [ReplicaSetMonitor-TaskExecutor] "RSM Topology Change","attr":{"replicaSet":"change_stream_update_lookup_read_concern","newTopologyDescription":"{ id: \"128b473b-449e-46c0-ab92-5c18a9d27482\", topologyType: \"ReplicaSetWithPrimary\", servers: { ip-10-122-79-156.ec2.internal:20520: { address: \"ip-10-122-79-156.ec2.internal:20520\", topologyVersion: { processId: ObjectId('5f649bbdecd281fc1f7d176e'), counter: 10 }, roundTripTime: 80061, lastWriteDate: new Date(1600429005000), opTime: { ts: Timestamp(1600429005, 1), t: 1 }, type: \"RSPrimary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-79-156.ec2.internal:20520\", setName: \"change_stream_update_lookup_read_concern\", setVersion: 3, electionId: ObjectId('7fffffff0000000000000001'), primary: \"ip-10-122-79-156.ec2.internal:20520\", lastUpdateTime: new Date(1600429005233), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-79-156.ec2.internal:20520\" }, arbiters: {}, passives: { 0: \"ip-10-122-79-156.ec2.internal:20521\", 1: \"ip-10-122-79-156.ec2.internal:20522\" }, tags: { tag: \"primary\" } }, ip-10-122-79-156.ec2.internal:20521: { address: \"ip-10-122-79-156.ec2.internal:20521\", topologyVersion: { processId: ObjectId('5f649bbd2fd508258a6cd48d'), counter: 6 }, roundTripTime: 1061, lastWriteDate: new Date(1600429005000), opTime: { ts: Timestamp(1600429005, 1), t: 1 }, type: \"RSSecondary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-79-156.ec2.internal:20521\", setName: \"change_stream_update_lookup_read_concern\", setVersion: 3, primary: \"ip-10-122-79-156.ec2.internal:20520\", lastUpdateTime: new Date(1600429005154), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-79-156.ec2.internal:20520\" }, arbiters: {}, passives: { 0: \"ip-10-122-79-156.ec2.internal:20521\", 1: \"ip-10-122-79-156.ec2.internal:20522\" }, tags: { tag: \"closestSecondary\" } }, ip-10-122-79-156.ec2.internal:20522: { address: \"ip-10-122-79-156.ec2.internal:20522\", topologyVersion: { processId: ObjectId('5f649bbd2b79b3daf0faedeb'), counter: 5 }, roundTripTime: 80750, lastWriteDate: new Date(1600429005000), opTime: { ts: Timestamp(1600429005, 1), t: 1 }, type: \"RSSecondary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-79-156.ec2.internal:20522\", setName: \"change_stream_update_lookup_read_concern\", setVersion: 3, primary: \"ip-10-122-79-156.ec2.internal:20520\", lastUpdateTime: new Date(1600429005234), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-79-156.ec2.internal:20520\" }, arbiters: {}, passives: { 0: \"ip-10-122-79-156.ec2.internal:20521\", 1: \"ip-10-122-79-156.ec2.internal:20522\" }, tags: { tag: \"fartherSecondary\" } } }, logicalSessionTimeoutMinutes: 30, setName: \"change_stream_update_lookup_read_concern\", compatible: true, maxSetVersion: 3, maxElectionId: ObjectId('7fffffff0000000000000001') }","previousTopologyDescription":"{ id: \"128b473b-449e-46c0-ab92-5c18a9d27482\", topologyType: \"ReplicaSetWithPrimary\", servers: { ip-10-122-79-156.ec2.internal:20520: { address: \"ip-10-122-79-156.ec2.internal:20520\", topologyVersion: { processId: ObjectId('5f649bbdecd281fc1f7d176e'), counter: 10 }, roundTripTime: 80061, lastWriteDate: new Date(1600429005000), opTime: { ts: Timestamp(1600429005, 1), t: 1 }, type: \"RSPrimary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-79-156.ec2.internal:20520\", setName: \"change_stream_update_lookup_read_concern\", setVersion: 3, electionId: ObjectId('7fffffff0000000000000001'), primary: \"ip-10-122-79-156.ec2.internal:20520\", lastUpdateTime: new Date(1600429005233), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-79-156.ec2.internal:20520\" }, arbiters: {}, passives: { 0: \"ip-10-122-79-156.ec2.internal:20521\", 1: \"ip-10-122-79-156.ec2.internal:20522\" }, tags: { tag: \"primary\" } }, ip-10-122-79-156.ec2.internal:20521: { address: \"ip-10-122-79-156.ec2.internal:20521\", topologyVersion: { processId: ObjectId('5f649bbd2fd508258a6cd48d'), counter: 6 }, roundTripTime: 1061, lastWriteDate: new Date(1600429005000), opTime: { ts: Timestamp(1600429005, 1), t: 1 }, type: \"RSSecondary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-79-156.ec2.internal:20521\", setName: \"change_stream_update_lookup_read_concern\", setVersion: 3, primary: \"ip-10-122-79-156.ec2.internal:20520\", lastUpdateTime: new Date(1600429005154), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-79-156.ec2.internal:20520\" }, arbiters: {}, passives: { 0: \"ip-10-122-79-156.ec2.internal:20521\", 1: \"ip-10-122-79-156.ec2.internal:20522\" }, tags: { tag: \"closestSecondary\" } }, ip-10-122-79-156.ec2.internal:20522: { address: \"ip-10-122-79-156.ec2.internal:20522\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} } }, logicalSessionTimeoutMinutes: 30, setName: \"change_stream_update_lookup_read_concern\", compatible: true, maxSetVersion: 3, maxElectionId: ObjectId('7fffffff0000000000000001') }"} 2020-09-18T11:36:45.234+00:00 I SHARDING 471691 [ReplicaSetMonitor-TaskExecutor] "Updating the shard registry with confirmed replica set","attr":{"connectionString":"change_stream_update_lookup_read_concern/ip-10-122-79-156.ec2.internal:20520,ip-10-122-79-156.ec2.internal:20521,ip-10-122-79-156.ec2.internal:20522"} 2020-09-18T11:36:45.241+00:00 I SHARDING 22732 [ShardRegistry-0] "Updating shard connection string on shard registry","attr":{"shardId":"change_stream_update_lookup_read_concern","newShardConnectionString":"change_stream_update_lookup_read_concern/ip-10-122-79-156.ec2.internal:20520,ip-10-122-79-156.ec2.internal:20521,ip-10-122-79-156.ec2.internal:20522","oldShardConnectionString":"change_stream_update_lookup_read_concern/ip-10-122-79-156.ec2.internal:20521"} 2020-09-18T11:36:45.242+00:00 W QUERY 20478 [conn33] "getMore command executor error","attr":{"error":{"code":70,"codeName":"ShardNotFound","errmsg":"No shard found for host: ip-10-122-79-156.ec2.internal:20520"},"stats":{}}
Other similar errors are possible where the connection string is unexpectedly missing hosts, for example:
d22026| 2020-09-21T13:36:09.518+00:00 I NETWORK 23729 [ReplicaSetMonitor-TaskExecutor] "ServerPingMonitor is now monitoring host","attr":{"host":"ip-10-122-25-5:22023","replicaSet":"shard_aware_init_secondaries-configRS"} d22026| 2020-09-21T13:36:09.518+00:00 I NETWORK 4333213 [ReplicaSetMonitor-TaskExecutor] "RSM Topology Change","attr":{"replicaSet":"shard_aware_init_secondaries-configRS","newTopologyDescription":"{ id: \"6c5068da-31b1-45ee-a837-5d94bb2745ba\", topologyType: \"ReplicaSetNoPrimary\", servers: { ip-10-122-25-5:22021: { address: \"ip-10-122-25-5:22021\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} }, ip-10-122-25-5:22022: { address: \"ip-10-122-25-5:22022\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} }, ip-10-122-25-5:22023: { address: \"ip-10-122-25-5:22023\", topologyVersion: { processId: ObjectId('5f68ac28884a6c9f4c80f137'), counter: 5 }, roundTripTime: 1327, lastWriteDate: new Date(1600695369000), opTime: { ts: Timestamp(1600695369, 1), t: 1 }, type: \"RSSecondary\", minWireVersion: 10, maxWireVersion: 10, me: \"ip-10-122-25-5:22023\", setName: \"shard_aware_init_secondaries-configRS\", setVersion: 5, primary: \"ip-10-122-25-5:22021\", lastUpdateTime: new Date(1600695369518), logicalSessionTimeoutMinutes: 30, hosts: { 0: \"ip-10-122-25-5:22021\", 1: \"ip-10-122-25-5:22022\", 2: \"ip-10-122-25-5:22023\" }, arbiters: {}, passives: {} } }, logicalSessionTimeoutMinutes: 30, setName: \"shard_aware_init_secondaries-configRS\", compatible: true }","previousTopologyDescription":"{ id: \"6c5068da-31b1-45ee-a837-5d94bb2745ba\", topologyType: \"Unknown\", servers: { ip-10-122-25-5:22021: { address: \"ip-10-122-25-5:22021\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} }, ip-10-122-25-5:22022: { address: \"ip-10-122-25-5:22022\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} }, ip-10-122-25-5:22023: { address: \"ip-10-122-25-5:22023\", type: \"Unknown\", minWireVersion: 0, maxWireVersion: 0, lastUpdateTime: new Date(-9223372036854775808), hosts: {}, arbiters: {}, passives: {} } }, compatible: true }"} d22026| 2020-09-21T13:36:09.519+00:00 I SHARDING 22732 [ReplicaSetMonitor-TaskExecutor] "Updating shard connection string on shard registry","attr":{"shardId":"config","newShardConnectionString":"shard_aware_init_secondaries-configRS/ip-10-122-25-5:22023","oldShardConnectionString":"shard_aware_init_secondaries-configRS/ip-10-122-25-5:22021,ip-10-122-25-5:22022,ip-10-122-25-5:22023"} uncaught exception: Error: ["shard_aware_init_secondaries-configRS/ip-10-122-25-5:22021,ip-10-122-25-5:22022,ip-10-122-25-5:22023"] != ["shard_aware_init_secondaries-configRS/ip-10-122-25-5:22023"] are not equal : doassert@src/mongo/shell/assert.js:20:14 assert.eq@src/mongo/shell/assert.js:179:9 @jstests/sharding/shard_aware_init_secondaries.js:49:1 @jstests/sharding/shard_aware_init_secondaries.js:7:2 failed to load: jstests/sharding/shard_aware_init_secondaries.js
Instead, it would be better if the RSM indicated to the ShardRegistry whether the given connection string is "complete" (responses received from all hosts) or "partial" (connection string may be missing hosts that haven't yet responded), eg. by passing a boolean to updateReplSetHosts(). This would allow the ShardRegistry to "merge" partial connection strings into the most recently received complete connection string (for that set) — while still having the ability to "forget" hosts that have been permanently removed from the set (because receiving a complete connection string will just replace the previous one, rather than being merged in).
On the presumption that onConfirmedSet() corresponds to "complete" and onPossibleSet() corresponds to "partial", this merging could be done in the ShardRegistry. If this isn't the case, then the RSM will need to track the partialness/completeness of the State's connection string, and use that in both onConfirmedSet()/onPossibleSet(). However, in either case it would probably be better if the merging could be done in the RSM, because this would avoid complicating the existing RSM-ShardRegistry API — the ShardRegistry can just continue to assume that it receives only complete and authoritative connection strings, and update itself based on that.
- causes
-
SERVER-51257 ShardRegistry should properly handle "possible" RSM updates for the config shard
- Closed
- is depended on by
-
SERVER-50907 Shard objects should cache their own connection strings
- Closed