[SERVER-44143] moveChunk Issue (Mongo version 4.0.4) Created: 22/Oct/19  Updated: 31/Oct/19  Resolved: 31/Oct/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gennadiy Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Hello MongoDB Team.

We faced with the issue related to moveChunk process.
This is the error from one of the Shrad's log:

2019-10-22T11:18:45.921+0200 I SHARDING [conn132] Starting chunk migration ns: db.col, [{ productId: MinKey }, { productId: -9062096908397525382 }), fromShard: shard3, toShard: shard2 with expected collection version epoch 5bb4b060aec28d86b2174007
2019-10-22T11:18:45.924+0200 I SHARDING [conn132] about to log metadata event into changelog: { _id: "prulmcoembb09-2019-10-22T11:18:45.924+0200-5daec975680ef73c96f4be36", server: "prulmcoembb09", clientAddr: "10.217.1.252:50292", time: new Date(1571735925924), what: "moveChunk.start", ns: "db.col", details: { min: { productId: MinKey }, max: { productId: -9062096908397525382 }, from: "shard3", to: "shard2" } }
2019-10-22T11:18:45.955+0200 I SHARDING [conn132] moveChunk data transfer progress: { waited: true, active: false, ns: "db.col", from: "shard3/IP:PORT", fromShardId: "shard3", min: { productId: MinKey }, max: { productId: -9062096908397525382 }, shardKeyPattern: { productId: "hashed" }, state: "fail", errmsg: "migrate failed: InvalidUUID: Cannot create collection db.col because we already have an identically named collection with UUID 55a...", counts: { cloned: 0, clonedBytes: 0, catchup: 0, steady: 0 }, ok: 1.0, operationTime: Timestamp(1571735921, 1), $gleStats: { lastOpTime: Timestamp(0, 0), electionId: ObjectId('7fffffff000000000000001f') }, lastCommittedOpTime: Timestamp(1571735921, 1), $configServerState: { opTime: { ts: Timestamp(1571735925, 6), t: 18 } }, $clusterTime: { clusterTime: Timestamp(1571735925, 6), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } } mem used: 0 documents remaining to clone: 2731
2019-10-22T11:18:45.955+0200 I SHARDING [conn132] about to log metadata event into changelog: { _id: "prulmcoembb09-2019-10-22T11:18:45.955+0200-5daec975680ef73c96f4be5d", server: "prulmcoembb09", clientAddr: "10.217.1.252:50292", time: new Date(1571735925955), what: "moveChunk.error", ns: "db.col", details: { min: { productId: MinKey }, max: { productId: -9062096908397525382 }, from: "shard3", to: "shard2" } }
2019-10-22T11:18:45.961+0200 W SHARDING [conn132] Chunk move failed :: caused by :: OperationFailed: Data transfer error: migrate failed: InvalidUUID: Cannot create collection db.col because we already have an identically named collection with UUID 55ab81fa-7d21-4742-8d71-f4ef8f741ec2, which differs from the donor's UUID 3db9aaae-c037-4162-b0a8-9eec312df936. Manually drop the collection on this shard if it contains data from a previous incarnation of db.col

 

Sorry, I've changed the IPs and db and collection names due to security policy.

Here is the sharded collection status:

 

 

mongos> db.coll.getShardDistribution()
Shard shard3 at shard3/IP:PORT
 data : 1.26GiB docs : 109259 chunks : 42
 estimated data per chunk : 30.87MiB
 estimated docs per chunk : 2601
Shard shard1 at shard1/IP:PORT
 data : 1.18GiB docs : 102091 chunks : 36
 estimated data per chunk : 33.61MiB
 estimated docs per chunk : 2835
Shard shard2 at shard2/IP:PORT
 data : 1.17GiB docs : 101178 chunks : 34
 estimated data per chunk : 35.27MiB
 estimated docs per chunk : 2975
Totals
 data : 3.61GiB docs : 312528 chunks : 112
 Shard shard3 contains 34.99% data, 34.95% docs in cluster, avg obj size on shard : 12KiB
 Shard shard1 contains 32.64% data, 32.66% docs in cluster, avg obj size on shard : 12KiB
 Shard shard2 contains 32.36% data, 32.37% docs in cluster, avg obj size on shard : 12KiB

 The configuration is the following:

3 Shards with 5 nodes (PRIMARY + 4 Sec + 1 Arb)

Here is the Config log error related to the issue:

2019-10-20T02:00:04.541+0200 I SHARDING [Balancer] distributed lock 'db.coll' acquired for 'Migrating chunk(s) in collection db.coll', ts : 5daab1a4d74e87bf1a4ee992
2019-10-20T02:00:04.911+0200 I SHARDING [ShardRegistry] distributed lock with ts: '5daab1a4d74e87bf1a4ee992' and _id: 'db.coll' unlocked.
2019-10-20T02:00:04.922+0200 I SHARDING [Balancer] Balancer move db.coll: [{ productId: MinKey }, { productId: -9062096908397525382 }), from shard3, to shard2 failed :: caused by :: OperationFailed: Data transfer error: migrate failed: InvalidUUID: Cannot create collection db.coll because we already have an identically named collection with UUID 55ab81fa-7d21-4742-8d71-f4ef8f741ec2, which differs from the donor's UUID 3db9aaae-c037-4162-b0a8-9eec312df936. Manually drop the collection on this shard if it contains data from a previous incarnation of db.coll
2019-10-20T02:00:04.922+0200 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "host-2019-10-20T02:00:04.922+0200-5daba384d74e87bf1a699d86", server: "host", clientAddr: "", time: new Date(1571529604922), what: "balancer.round", ns: "", details: { executionTimeMillis: 465, errorOccured: false, candidateChunks: 1, chunksMoved: 0 } }
2019-10-20T02:00:14.962+0200 I SHARDING [Balancer] distributed lock 'db.coll' acquired for 'Migrating chunk(s) in collection db.coll', ts : 5daab1a4d74e87bf1a4ee992
2019-10-20T02:00:15.009+0200 I SHARDING [ShardRegistry] distributed lock with ts: '5daab1a4d74e87bf1a4ee992' and _id: 'db.coll' unlocked.
2019-10-20T02:00:15.012+0200 I SHARDING [Balancer] Balancer move db.coll: [{ productId: MinKey }, { productId: -9062096908397525382 }), from shard3, to shard2 failed :: caused by :: OperationFailed: Data transfer error: migrate failed: InvalidUUID: Cannot create collection db.coll because we already have an identically named collection with UUID 55ab81fa-7d21-4742-8d71-f4ef8f741ec2, which differs from the donor's UUID 3db9aaae-c037-4162-b0a8-9eec312df936. Manually drop the collection on this shard if it contains data from a previous incarnation of db.coll
2019-10-20T02:00:15.013+0200 I SHARDING [Balancer] about to log metadata event into actionlog: { _id: "host-2019-10-20T02:00:15.013+0200-5daba38fd74e87bf1a699e3d", server: "host", clientAddr: "", time: new Date(1571529615013), what: "balancer.round", ns: "", details: { executionTimeMillis: 82, errorOccured: false, candidateChunks: 1, chunksMoved: 0 } }

Please let me know if any additional info are require form my side.

Thank You



 Comments   
Comment by Dmitry Agranat [ 31/Oct/19 ]

Hi Gennadiy,

I am going to close this ticket but feel free to reopen if this happens again.

Thanks,
Dima

Comment by Dmitry Agranat [ 24/Oct/19 ]

Hi Gennadiy,

There might be other possibilities for this issue to manifest itself. For example, if a collection was somehow on a replica set before it was added as a shard or it might be a manifestation of SERVER-17397. But given we do not have the logs or a reproducer to back this up, we have exhausted all avenues of investigation.

If this happens again, or you can reproduce it, please save the logs and we'll be happy to take a look.

Thanks,
Dima

Comment by Gennadiy [ 23/Oct/19 ]

Also, regaraing the migration - we migrated it by using the backup/restore each shard to shard.
The cluster was migrated from 4.0.0 to 4.0.4 mongod from docker conteiners.

Thank You

Gennadiy

Comment by Dmitry Agranat [ 23/Oct/19 ]

Thanks Gennadiy for additional information. It is unfortunate we do not have any evidence for this collection creation.

Based on the information so far, I have another question, when was the cluster upgraded to 4.0.x version and what MongoDB version this cluster was upgraded from?

Thanks,
Dima

Comment by Gennadiy [ 23/Oct/19 ]

Hello Dmitriy,

Unfortunatelly we do not have such logs. The cluster was migrated from docker continers and there were not any issues before and after the migration. 
But now there are... If the collection was not created through mongoS how it can be Sharded as you can see from the output:

mongos> db.coll.getShardDistribution()
Shard shard3 at shard3/IP:PORT
data : 1.26GiB docs : 109259 chunks : 42
estimated data per chunk : 30.87MiB
estimated docs per chunk : 2601
Shard shard1 at shard1/IP:PORT
data : 1.18GiB docs : 102091 chunks : 36
estimated data per chunk : 33.61MiB
estimated docs per chunk : 2835
Shard shard2 at shard2/IP:PORT
data : 1.17GiB docs : 101178 chunks : 34
estimated data per chunk : 35.27MiB
estimated docs per chunk : 2975
Totals
data : 3.61GiB docs : 312528 chunks : 112
Shard shard3 contains 34.99% data, 34.95% docs in cluster, avg obj size on shard : 12KiB
Shard shard1 contains 32.64% data, 32.66% docs in cluster, avg obj size on shard : 12KiB
Shard shard2 contains 32.36% data, 32.37% docs in cluster, avg obj size on shard : 12KiB

Cluster works about an year after migration
So, are there cases or resolutions for this situation to resolve the issue? 
This is PROD env.
 If there are some other quetions - please ask.

Thank You

Gennadiy

Comment by Dmitry Agranat [ 23/Oct/19 ]

Hi Gennadiy,

Could you provide evidence of this collection creation, specifically showing when and from which source this collection was created? If logs are still available to back this up, please upload them to this secure portal. What we are looking for is to determine if this collection was created through mongoS or when directly connected to the shards.

Thanks,
Dima

Comment by Gennadiy [ 22/Oct/19 ]

Hello Dmitriy,

Regarding the "usually arise if you connected directly to the shards and restored there instead of going through mongos" - this is PROD Cluster wich is not never restored yet. So, this is not the root cause of the issue.
The collection name is not the config.system.sessions. This is standart collection with site information. If there are some other quetions - please ask.

Thank You

Gennadiy

Comment by Dmitry Agranat [ 22/Oct/19 ]

Hi genacvali91,

The situation in which different shards get different UUIDs can usually arise if you connected directly to the shards and restored there instead of going through mongos. Could you confirm if this is the case or otherwise provide more details about the procedure that might have caused this issue?

I understand that you have redacted the original collection name, could you confirm the original collection name is not config.system.sessions?

Thanks,
Dima

Generated at Thu Feb 08 05:05:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.