[SERVER-46363] Chunk move failing after removing shard from cluster Created: 24/Feb/20  Updated: 27/Oct/23  Resolved: 04/Mar/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dilip Kolasani Assignee: Carl Champain (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File sh.status.feb24    
Operating System: ALL
Participants:

 Description   

Hi we are using mongo sharded cluster running with 4.2.1.

Architecture:
3 mongos
config server running as replica set ( 1 primary + 2 secondaries)
2 shard with 3 nodes running as replica set ( 1 primary + 2 secondaries)

Since shard1 and shard2 are under utilized, we decided to remove shard2.

we did the following steps

1) we issued remove shard from mongos and also Moved databases to another shard

out of 2 sharded collections, all chunks related to 1 collections are drained to another shard.
But chunk migration is failing for another collection.we see Balancer is not moving chunks and it's throwing following message

2020-02-24T15:56:48.481+0000 I  SHARDING [Balancer] distributed lock 'keychain.eg_keyring' acquired for 'Migrating chunk(s) in collection keychain.eg_keyring', ts : 5de584eedab8c4c434adabb5
2020-02-24T15:56:48.546+0000 I  SHARDING [TransactionCoordinator] distributed lock with ts: '5de584eedab8c4c434adabb5' and _id: 'keychain.eg_keyring' unlocked.
2020-02-24T15:56:48.549+0000 I  SHARDING [Balancer] Balancer move keychain.eg_keyring: [{ rId: UUID("80460000-0000-0000-0000-000000000000") }, { rId: UUID("80480000-0000-0000-0000-000000000000") }), from test-mongodb-egdp-keychain-01-shard02, to test-mongodb-egdp-keychain-01-shard01 failed :: caused by :: OperationFailed: Data transfer error: migrate failed: Location51008: operation was interrupted
2020-02-24T15:56:48.550+0000 I  SHARDING [Balancer] about to log metadata event into actionlog: { _id: "ip-10-0-212-244:27017-2020-02-24T15:56:48.550+0000-5e53f240dab8c4c434ec8b37", server: "ip-10-0-212-244:27017", shard: "config", clientAddr: "", time: new Date(1582559808550), what: "balancer.round", ns: "", details: { executionTimeMillis: 243, errorOccured: false, candidateChunks: 1, chunksMoved: 0 } }

we tried even moving some of chunks manually and they also failed with same reason.

sh.status() output is attached

We issued the following command to include chunk info from above sh.status() output to move one chunk

command:

db.adminCommand( { moveChunk : "keychain.eg_keyring" ,
                 bounds : [{ "rId" : UUID("80460000-0000-0000-0000-000000000000") }, { "rId" : UUID("80480000-0000-0000-0000-000000000000") }] ,
                 to : "test-mongodb-egdp-keychain-01-shard01"
                  } )

Output:

mongos> db.adminCommand( { moveChunk : "keychain.eg_keyring" ,
 
...                  bounds : [{ "rId" : UUID("80460000-0000-0000-0000-000000000000") }, { "rId" : UUID("80480000-0000-0000-0000-000000000000") }] ,
...                  to : "test-mongodb-egdp-keychain-01-shard01"
...                   } )
{
	"ok" : 0,
	"errmsg" : "Data transfer error: migrate failed: Location51008: operation was interrupted",
	"code" : 96,
	"codeName" : "OperationFailed",
	"operationTime" : Timestamp(1582566446, 139),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1582566446, 139),
		"signature" : {
			"hash" : BinData(0,"jaz2qGWhuM36vt48xNt+mv+CHfo="),
			"keyId" : NumberLong("6765960194405957649")
		}
	}
}

Apart from this , we also issued flushRouterConfig multiple times and we restarted all mongos. But still same issue exists.

Please let me know if there is any known bug around this or any configuration that we need to tweak on our side.



 Comments   
Comment by Carl Champain (Inactive) [ 04/Mar/20 ]

haidilip83@gmail.com,

Thanks for getting back to us! I will now close this ticket.

Comment by Dilip Kolasani [ 03/Mar/20 ]

Thanks Carl for detailed summary.hypothesis 1 is confirmed. We are working on enforcing the uniqueness of the _id index on application side.we can now close this ticket.

Comment by Carl Champain (Inactive) [ 28/Feb/20 ]

haidilip83@gmail.com,

After investigating your issue, we’ve come up with two hypotheses:

1. The _id index key is not unique across your sharded cluster. Our documentation says the following about the uniqueness of the _id index across a sharded cluster:

If the _id field is not the shard key or the prefix of the shard key, _id index only enforces the uniqueness constraint per shard and not across shards.
For example, consider a sharded collection (with shard key {x:1}) that spans two shards A and B. Because the _id key is not part of the shard key, the collection could have a document with _id value 1 in shard A and another document with _id value 1 in shard B.
If the _id field is not the shard key nor the prefix of the shard key, MongoDB expects applications to enforce the uniqueness of the _id values across the shards.

So, in your case, we noticed that _id is neither the shard key nor the prefix of the shard key, which makes it possible that a document on shard2 has the same _id as a document on shard1.

 
2. Shard1 may contain orphan documents. Orphan documents appear after a failed migration or an unclean shutdown, they can be duplicates of documents that were moved onto a different shard. There are a few ways in which orphan documents would cause a duplicate key error:

  • If the chunks migration from shard2 to shard1 failed, but some documents were still written on shard1. Then, when shard2 tries to migrate the chunks again, the duplicate error arises because shard1 already contains some documents from shard2.
  • If shard1 crashed after the chunks migration from shard1 to shard2 and while the RangeDeleter was still running. Then, when shard1 goes back online, the RangeDeleter does not persist or replicate the ranges it has yet to clean, so it can’t restart from where it left off. During the chunks migration from shard2 to shard1, the error comes up since shard1 has orphan documents.

 

To determine whether hypothesis 1 or 2 is correct, please connect directly to the primary replica set member of shard1 and shard2 and run:

db.eg_keyring.find({{ _id: UUID("245a5a22-4eb3-35ab-b79d-6c0bc431f169") }})

  • If the returned documents have the same _id but not the same shard key, then the _id index key is not unique across your sharded cluster and hypothesis 1 is confirmed. You can enforce the uniqueness of the _id index key on your application logic or you can update your shard key. If you need further assistance troubleshooting, I encourage you to ask our community by posting on the mongodb-user group or on Stack Overflow with the mongodb tag.
  • If the returned documents are identical, then shard1 contains orphan documents and hypothesis 2 is confirmed. Please run cleanupOrphaned on the primary replica set member of shard1 to remediate this issue.

Kind regards,
Carl

Comment by Dilip Kolasani [ 26/Feb/20 ]

Please confirm if this is anyway related to https://jira.mongodb.org/browse/SERVER-45844 also?

Comment by Dilip Kolasani [ 25/Feb/20 ]

Hi Carl,
I have uploaded all requested information.

regards
Dilip K

Comment by Carl Champain (Inactive) [ 25/Feb/20 ]

Hi haidilip83@gmail.com,

Thank you for the report.
To help us understand what is happening, can you please provide:

  • The logs for:
    • Each of the mongos.
    • The primary of shard1 and shard2.
    • The primary of the config servers.
  • The mongodump of your config server:
    • The command should look like this: mongodump --db=config --host=<hostname:port_of _the_mongos>

We've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Kind regards,
Carl

Generated at Thu Feb 08 05:11:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.