-
Type: Bug
-
Resolution: Works as Designed
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Sharding
-
Labels:None
-
ALL
Hi we are using mongo sharded cluster running with 4.2.1.
Architecture:
3 mongos
config server running as replica set ( 1 primary + 2 secondaries)
2 shard with 3 nodes running as replica set ( 1 primary + 2 secondaries)
Since shard1 and shard2 are under utilized, we decided to remove shard2.
we did the following steps
1) we issued remove shard from mongos and also Moved databases to another shard
out of 2 sharded collections, all chunks related to 1 collections are drained to another shard.
But chunk migration is failing for another collection.we see Balancer is not moving chunks and it's throwing following message
2020-02-24T15:56:48.481+0000 I SHARDING [Balancer] distributed lock 'keychain.eg_keyring' acquired for 'Migrating chunk(s) in collection keychain.eg_keyring', ts : 5de584eedab8c4c434adabb5 2020-02-24T15:56:48.546+0000 I SHARDING [TransactionCoordinator] distributed lock with ts: '5de584eedab8c4c434adabb5' and _id: 'keychain.eg_keyring' unlocked. 2020-02-24T15:56:48.549+0000 I SHARDING [Balancer] Balancer move keychain.eg_keyring: [{ rId: UUID("80460000-0000-0000-0000-000000000000") }, { rId: UUID("80480000-0000-0000-0000-000000000000") }), from test-mongodb-egdp-keychain-01-shard02, to test-mongodb-egdp-keychain-01-shard01 failed :: caused by :: OperationFailed: Data transfer error: migrate failed: Location51008: operation was interrupted 2020-02-24T15:56:48.550+0000 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-10-0-212-244:27017-2020-02-24T15:56:48.550+0000-5e53f240dab8c4c434ec8b37", server: "ip-10-0-212-244:27017", shard: "config", clientAddr: "", time: new Date(1582559808550), what: "balancer.round", ns: "", details: { executionTimeMillis: 243, errorOccured: false, candidateChunks: 1, chunksMoved: 0 } }
we tried even moving some of chunks manually and they also failed with same reason.
sh.status() output is attached
We issued the following command to include chunk info from above sh.status() output to move one chunk
command:
db.adminCommand( { moveChunk : "keychain.eg_keyring" , bounds : [{ "rId" : UUID("80460000-0000-0000-0000-000000000000") }, { "rId" : UUID("80480000-0000-0000-0000-000000000000") }] , to : "test-mongodb-egdp-keychain-01-shard01" } )
Output:
mongos> db.adminCommand( { moveChunk : "keychain.eg_keyring" , ... bounds : [{ "rId" : UUID("80460000-0000-0000-0000-000000000000") }, { "rId" : UUID("80480000-0000-0000-0000-000000000000") }] , ... to : "test-mongodb-egdp-keychain-01-shard01" ... } ) { "ok" : 0, "errmsg" : "Data transfer error: migrate failed: Location51008: operation was interrupted", "code" : 96, "codeName" : "OperationFailed", "operationTime" : Timestamp(1582566446, 139), "$clusterTime" : { "clusterTime" : Timestamp(1582566446, 139), "signature" : { "hash" : BinData(0,"jaz2qGWhuM36vt48xNt+mv+CHfo="), "keyId" : NumberLong("6765960194405957649") } } }
Apart from this , we also issued flushRouterConfig multiple times and we restarted all mongos. But still same issue exists.
Please let me know if there is any known bug around this or any configuration that we need to tweak on our side.