[SERVER-38409] Shard can crash at step-up due to FailedToSatisfyReadPreference exception during minOpTime recovery Created: 05/Dec/18 Updated: 27/Oct/23 Resolved: 05/Dec/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Kaloian Manassiev | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Works as Designed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Sharding
|
||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 8 | ||||||||||||
| Description |
|
The ReplicationCoordinatorExternalStateImpl::_shardingOnTransitionToPrimaryHook call already expects NotMaster and ShutdownInProgress errors, but it can also crash in certain cases with a FailedToSatisfyReadPreference exception if the config server is not available at shard node's step-up time. |
| Comments |
| Comment by Kaloian Manassiev [ 05/Dec/18 ] |
|
The sharding minOpTime recovery procedure is used to ensure that after a shard starts up or becomes a primary, it will be able to see with certainty the chunks that it owned after the last time it donated a chunk. The way this works is that a counter indicating that the node is persisting the donation of a chunk is written before the migration is persisted on the config server and is cleared when the migration is successfully persisted. When a node starts up, if it is discovered that the number of active "committers" of the config server metadata is > 0, the config server's primary must be consulted in order to discover whether the previous migration actually committed. Because of this requirement, if the config server is not available in this situation, a node cannot safely continue starting up as the primary of a shard due to the risk of data loss. Therefore, this behaviour "Works as Designed". |