[SERVER-45321] Mongo suddenly recalled about removed shard Created: 30/Dec/19  Updated: 27/Oct/23  Resolved: 12/Jan/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Alexander Pyhalov Assignee: Dmitry Agranat
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

We ran sharded mongo cluster on Mongo 4.0.13 on Ubuntu 18.04 x64 .

A single (non-sharded) mongo was initially added to cluster (srv211), later 3 other mongo servers were added and collections were moved to 3 new shards. srv211 shard was removed, all mongos services were restarted (shard/cfg instances were not).

But suddenly after several months of uptime, the cluster was blocked and we got the following error messages in one of our shard logs:

 

{{2019-12-30T19:35:18.079+0000 I SHARDING [conn123235] received splitChunk request: { splitChunk: "project_prod_user.player_maps", from: "project-db-3", keyPattern:

{ pers_id: "hashed" }

, epoch: ObjectId('5dafd8dc63ae2c18230f9dce'), shardVersion: [ Timestamp(10814, 0), ObjectId('5dafd8dc63ae2c18230f9dce') ], min: { pers_id: 4618768445950617740 }, max: { pers_id: 4620991399317296785 }, splitKeys: [ { pers_id: 4619791202844536983 }, { pers_id: 4620989830407096034 } ], $clusterTime: { clusterTime: Timestamp(1577734518, 5), signature:

{ hash: BinData(0, 4219F00DC9FE995B5CBBBC9B5E39EBFF35C51A10), keyId: 6750833113431015451 }

}, $configServerState: { opTime:

{ ts: Timestamp(1577734517, 30), t: 5 }

}, $db: "admin" }}}
2019-12-30T19:35:18.085+0000 I SHARDING [conn123235] distributed lock 'project_prod_user.player_maps' acquired for 'splitting chunk [{ pers_id: 4618768445950617740 }, { pers_id: 4620991399317296785 }) in project_prod_user.player_maps', ts : 5e0a5176dc25d0219069dd07
2019-12-30T19:35:18.136+0000 I SHARDING [conn123235] distributed lock with ts: 5e0a5176dc25d0219069dd07' unlocked.
2019-12-30T19:35:18.148+0000 I SHARDING [conn125292] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.148+0000 I SHARDING [conn125292] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.149+0000 I SHARDING [conn125258] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.150+0000 I SHARDING [conn125295] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.150+0000 I SHARDING [conn125295] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.151+0000 I SHARDING [conn125295] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.151+0000 I SHARDING [conn125295] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.151+0000 I SHARDING [conn125258] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.151+0000 I SHARDING [conn125258] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.152+0000 I SHARDING [conn125295] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.152+0000 I SHARDING [conn125295] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.153+0000 I SHARDING [conn125258] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.153+0000 I SHARDING [conn125258] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.153+0000 I SHARDING [conn125295] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.153+0000 I SHARDING [conn125295] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.154+0000 I SHARDING [conn125258] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.154+0000 I SHARDING [conn125258] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.154+0000 I SHARDING [conn125295] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.154+0000 I SHARDING [conn125295] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.155+0000 I SHARDING [conn125258] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.155+0000 I SHARDING [conn125258] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.155+0000 I SHARDING [conn125295] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.155+0000 I SHARDING [conn125295] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.156+0000 I SHARDING [conn125258] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.156+0000 I SHARDING [conn125258] Failed to handle stale version exception :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.168+0000 I SHARDING [conn125295] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.171+0000 I SHARDING [conn125295] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found
2019-12-30T19:35:18.174+0000 I SHARDING [conn125295] Failed to refresh metadata for collectionproject_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found 

 

I've examined config database, but found no mentions of srv211 in db.chunks or db.shards.

Is it a known issue?



 Comments   
Comment by Alexander Pyhalov [ 14/Jan/20 ]

I see the following statement in documentation:

If you use the movePrimary command to move un-sharded collections, you must either restart all mongos instances, or use the flushRouterConfig command on all mongos instances before reading or writing any data to any unsharded collections that were moved. This action ensures that the mongos is aware of the new shard for these collections.

This is mongo 4.0.13, so I thought that restarting all mongos were enough.

Comment by Dmitry Agranat [ 12/Jan/20 ]

Hi alp, thank you for providing the requested steps and it's unfortunate you cannot reproduce this issue.

I do not see in these steps the execution of the flushRouterConfig command. As per our documentation, users should be running _flushRouterConfig after movePrimary.

Regards,
Dima

Comment by Alexander Pyhalov [ 10/Jan/20 ]

These are full mongo logs for 3 servers -  https://drive.google.com/open?id=1sQ8qjngVtNAExL0YvcDGIfo1QSD7pSot  .

mongos logs are in var/log/mongodb/mongos.log , shard srv logs - in var/log/mongodb/mongod.log , config mongo logs - in var/log/mongodb/cfg_mongod.log .

Issues started at about 2019-12-30T19:35, you can see a lot of 'Failed to refresh metadata for collectionsurvival_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found' in logs of third server - 3/var/log/mongodb/mongod.log .

Comment by Alexander Pyhalov [ 09/Jan/20 ]

I doubt I can reproduce it. I don't have logs for transition period, but I can share logs during failure (via some private channel).

The steps to migrate from non-sharded installation to sharded cluster were the following:

1) We created necessary indexes on srv211 mongod instance (non-sharded)
db.player_mails.createIndex({pers_id:"hashed"} , {background: true })
db.player_maps.createIndex({pers_id:"hashed"}, {background: true })
db.player_patches.createIndex({pers_id:"hashed"}, {background: true })
db.players.createIndex({pers_id:"hashed"}, {background: true })
db.pvp_maps.createIndex({pers_id:"hashed"}, {background: true })

2) restarted mongod as shardsvr on 27017 port, created cluster auth file
3) added srv211 as shard (sh.addShard("srv211/192.168.100.201:27017"))
4) added new cluster nodes
sh.addShard("db-1/db-1.internal:27018")
sh.addShard("db-2/db-2.internal:27018")
sh.addShard("db-3/db-3.internal:27018")
5) Enabled sharding for our dbs (sh.enableSharding("dbname") for each db)
For each collection sharded it on pers_id:
sh.shardCollection("db.collection", {"pers_id": "hashed"})
6) started removing srv211
db.adminCommand( { removeShard: "srv211" } )

Later, when I saw that removeShard says that there's no data on srv211, I ran
db.adminCommand( { movePrimary: "dbname", to: "db-1" })


Once again ran
db.adminCommand( { removeShard: "srv211" } )

Turned off mongod on srv211 , and restarted mongos on db-1 - db-3

 

 

Comment by Dmitry Agranat [ 31/Dec/19 ]

Hi alp,

ShardNotFound message could be caused due to various reasons. For us to be able to investigate this, please provide the exact steps performed in a form of a numbered list including mongod and mongos logs from all members of the shard.

If this data is no longer available, please provide a simple reproducer showing the reported issue with details how to execute it.

Could you please run flushRouterConfig command on mongos - does it fix the issue?

Thanks,
Dima

Generated at Thu Feb 08 05:08:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.