[SERVER-26886] Config server shuts down while a migration is in the critical section, causing the shard to fassert on critical command failure. Created: 02/Nov/16  Updated: 10/Mar/20  Resolved: 10/Mar/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dianna Hohensee (Inactive) Assignee: Alexander Taskov (Inactive)
Resolution: Duplicate Votes: 1
Labels: PM-256, sharding-4.4-stabilization, sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-45752 opCtx interruption during migration c... Closed
Duplicate
is duplicated by SERVER-8358 "Move chunk commit failed" shutdown l... Closed
is duplicated by SERVER-30919 Shutdown during chunk migration can t... Closed
Operating System: ALL
Sprint: Sharding 2020-03-23
Participants:
Case:
Linked BF Score: 6

 Description   

This can happen in any JS test with the auto-balancer running in the background. It's possible for shutdown to happen while a migration is in the critical section, which can lead to the shard crashing on command failures.

Shard shutdown in the critical section is more gracefully handled here, but this doesn't help if the shutdown error is coming from the config primary.



 Comments   
Comment by Alexander Taskov (Inactive) [ 10/Mar/20 ]

This was fixed in SERVER-45752

Comment by Dianna Hohensee (Inactive) [ 27/Feb/17 ]

The chunk migration commit procedure can cause a fassert on the donor shard if for any reason the CommitChunkMigration command to the config server fails and we cannot perform a follow-up write to the config server to obtain the latest optime. The donor shard needs the latest optime because in the case of an unknown commit result it clears its chunk metadata for a total refresh the next time metadata is needed: the donor shard must have the latest optime to assure acquisition of the latest chunk metadata on this refresh. If a donor shard were not to see the latest chunk metadata, routing guarantees would break as the donor allows reads to data that may already have changed on the migration recipient shard.

Rather than fasserting in the commit, a flag should be set on the collection chunk metadata, which will cause the next refresh to first get the latest optime from the config server. Acquiring the latest optime on the config server will require a write operation, which means there must be a config primary at the time – refresh, in comparison, is a read on the config server and does not require a primary. If the latest optime cannot be acquired, then the flag will remain set and the command needing the collection chunk metadata refresh will fail.

The flag may need to be persisted in case the server crashes and restarts — I’m presuming the lastOpTime is persisted somewhere to be safe from crashes as well, otherwise we’d be running blind on refreshes right now?

This would be logically cleaner, as the acquisition of the latest optime happens immediately before the action that needs it, rather than confusingly in the chunk commit procedure far away from the reason.

Generated at Thu Feb 08 04:13:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.