[SERVER-48033] remove shard failed: move chunk abort Created: 08/May/20  Updated: 12/May/20  Resolved: 12/May/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.0.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: vinllen chen Assignee: Carl Champain (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File _recvChunkStart_failed.png     File mongod.log.2020-04-30T00-15-05.tar.gz     PNG File move_chunk_fail.png     PNG File screenshot-1.png     PNG File sh.status.png    
Operating System: ALL
Participants:

 Description   

The second time call `RemoveShard` failed: move chunk abort.

There're 4 shards in my sharding cluster: shard1, shard2, shard3, shard4. And there's several sharded collection distributed on these shards.

At first, I remove shard2 successfully. Then, I call `removeShard` to remove shard4 but failed with `sh.status()` return always "draining" status.

After I went through the shard4 log I found the error: "Chunk move failed :: caused by :: ShardNotFound: Shard2 not found". So I think the cache-route hasn't been updated since shard2 already removed.

It can be reproduced when I run "_recvChunkStart" command on shard4, I attached the picture on the attachment.



 Comments   
Comment by Carl Champain (Inactive) [ 12/May/20 ]

cvinllen@gmail.com,

We need the logs for the entire cluster to fully investigate this issue, so I will now close this ticket. However, feel free to reopen a new ticket next time it happens. Please ensure that you share:

  1. All the logs for the cluster
  2. mongodump of your config server. The command should look like this:

    mongodump --db=config --host=<hostname:port_of _the_mongos>

Kind regards,
Carl

Comment by vinllen chen [ 12/May/20 ]

The shard has already been removed and deleted, but I can offer the config-server log which has this error because the move chunk is started by config-server balancer.

mongod.log.2020-04-30T00-15-05.tar.gz

Comment by Carl Champain (Inactive) [ 11/May/20 ]

Hi cvinllen@gmail.com,

Thank you for the report. 
Can you please provide the logs (mongod.log and mongos.log) for your entire cluster?

Kind regards,
Carl
 

Comment by vinllen chen [ 08/May/20 ]

It looks like the CatalogCache won't reload all DBConfig but only diff chunk info by ChunkManager when move chunk happens. And the "removeShard" command will only update the config-server cache route but not the shard cache route.

Generated at Thu Feb 08 05:15:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.