[SERVER-48033] remove shard failed: move chunk abort Created: 08/May/20 Updated: 12/May/20 Resolved: 12/May/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | 4.0.10 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | vinllen chen | Assignee: | Carl Champain (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
The second time call `RemoveShard` failed: move chunk abort. There're 4 shards in my sharding cluster: shard1, shard2, shard3, shard4. And there's several sharded collection distributed on these shards. At first, I remove shard2 successfully. Then, I call `removeShard` to remove shard4 but failed with `sh.status()` return always "draining" status. After I went through the shard4 log I found the error: "Chunk move failed :: caused by :: ShardNotFound: Shard2 not found". So I think the cache-route hasn't been updated since shard2 already removed. It can be reproduced when I run "_recvChunkStart" command on shard4, I attached the picture on the attachment. |
| Comments |
| Comment by Carl Champain (Inactive) [ 12/May/20 ] | |
|
We need the logs for the entire cluster to fully investigate this issue, so I will now close this ticket. However, feel free to reopen a new ticket next time it happens. Please ensure that you share:
Kind regards, | |
| Comment by vinllen chen [ 12/May/20 ] | |
|
The shard has already been removed and deleted, but I can offer the config-server log which has this error because the move chunk is started by config-server balancer. | |
| Comment by Carl Champain (Inactive) [ 11/May/20 ] | |
|
Thank you for the report. Kind regards, | |
| Comment by vinllen chen [ 08/May/20 ] | |
|
It looks like the CatalogCache won't reload all DBConfig but only diff chunk info by ChunkManager when move chunk happens. And the "removeShard" command will only update the config-server cache route but not the shard cache route. |