[SERVER-45321] Mongo suddenly recalled about removed shard Created: 30/Dec/19 Updated: 27/Oct/23 Resolved: 12/Jan/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Alexander Pyhalov | Assignee: | Dmitry Agranat |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Participants: |
| Description |
|
We ran sharded mongo cluster on Mongo 4.0.13 on Ubuntu 18.04 x64 . A single (non-sharded) mongo was initially added to cluster (srv211), later 3 other mongo servers were added and collections were moved to 3 new shards. srv211 shard was removed, all mongos services were restarted (shard/cfg instances were not). But suddenly after several months of uptime, the cluster was blocked and we got the following error messages in one of our shard logs:
{{2019-12-30T19:35:18.079+0000 I SHARDING [conn123235] received splitChunk request: { splitChunk: "project_prod_user.player_maps", from: "project-db-3", keyPattern: { pers_id: "hashed" }, epoch: ObjectId('5dafd8dc63ae2c18230f9dce'), shardVersion: [ Timestamp(10814, 0), ObjectId('5dafd8dc63ae2c18230f9dce') ], min: { pers_id: 4618768445950617740 }, max: { pers_id: 4620991399317296785 }, splitKeys: [ { pers_id: 4619791202844536983 }, { pers_id: 4620989830407096034 } ], $clusterTime: { clusterTime: Timestamp(1577734518, 5), signature: { hash: BinData(0, 4219F00DC9FE995B5CBBBC9B5E39EBFF35C51A10), keyId: 6750833113431015451 }}, $configServerState: { opTime: { ts: Timestamp(1577734517, 30), t: 5 } }, $db: "admin" }}}
I've examined config database, but found no mentions of srv211 in db.chunks or db.shards. Is it a known issue? |
| Comments |
| Comment by Alexander Pyhalov [ 14/Jan/20 ] |
|
I see the following statement in documentation: If you use the movePrimary command to move un-sharded collections, you must either restart all mongos instances, or use the flushRouterConfig command on all mongos instances before reading or writing any data to any unsharded collections that were moved. This action ensures that the mongos is aware of the new shard for these collections. This is mongo 4.0.13, so I thought that restarting all mongos were enough. |
| Comment by Dmitry Agranat [ 12/Jan/20 ] |
|
Hi alp, thank you for providing the requested steps and it's unfortunate you cannot reproduce this issue. I do not see in these steps the execution of the flushRouterConfig command. As per our documentation, users should be running _flushRouterConfig after movePrimary. Regards, |
| Comment by Alexander Pyhalov [ 10/Jan/20 ] |
|
These are full mongo logs for 3 servers - https://drive.google.com/open?id=1sQ8qjngVtNAExL0YvcDGIfo1QSD7pSot . mongos logs are in var/log/mongodb/mongos.log , shard srv logs - in var/log/mongodb/mongod.log , config mongo logs - in var/log/mongodb/cfg_mongod.log . Issues started at about 2019-12-30T19:35, you can see a lot of 'Failed to refresh metadata for collectionsurvival_prod_user.player_maps :: caused by :: ShardNotFound: Shard srv211 not found' in logs of third server - 3/var/log/mongodb/mongod.log . |
| Comment by Alexander Pyhalov [ 09/Jan/20 ] |
|
I doubt I can reproduce it. I don't have logs for transition period, but I can share logs during failure (via some private channel). The steps to migrate from non-sharded installation to sharded cluster were the following: 1) We created necessary indexes on srv211 mongod instance (non-sharded) 2) restarted mongod as shardsvr on 27017 port, created cluster auth file Turned off mongod on srv211 , and restarted mongos on db-1 - db-3
|
| Comment by Dmitry Agranat [ 31/Dec/19 ] |
|
Hi alp, ShardNotFound message could be caused due to various reasons. For us to be able to investigate this, please provide the exact steps performed in a form of a numbered list including mongod and mongos logs from all members of the shard. If this data is no longer available, please provide a simple reproducer showing the reported issue with details how to execute it. Could you please run flushRouterConfig command on mongos - does it fix the issue? Thanks, |