[SERVER-78009] shardSvrCommitReshardCollection command should fail recoverably if the node is shutting down Created: 12/Jun/23 Updated: 20/Nov/23 Resolved: 20/Sep/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 6.0.6, 5.0.18, 7.0.0-rc4 |
| Fix Version/s: | 7.2.0-rc0, 7.0.3, 6.0.13, 5.0.24 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Abdul Qadeer | Assignee: | Brett Nawrocki |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam1 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Assigned Teams: |
Sharding NYC
|
||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Backport Requested: |
v7.1, v7.0, v6.0, v5.0
|
||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Linked BF Score: | 8 | ||||||||||||||||||||||||
| Story Points: | 4 | ||||||||||||||||||||||||
| Description |
|
It is possible that a resharding donor/recipient service POS executor has shutdown on a primary node but has failed to step down due to no secondaries being caught up to be elected. While the secondaries catchup, the primary which has its services shutdown can receive a retry of shardSvrCommitReshardCollection command from an in-progress resharding operation and uassert on existence of resharding state documents (donor or recipient) here. This leads to a fatal error at the coordinator causing it to crash. |
| Comments |
| Comment by Githook User [ 14/Nov/23 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: (cherry picked from commit 7caff9c00c6bc56b55107b1ef1b10a14c2f50516) |
| Comment by Githook User [ 14/Nov/23 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: (cherry picked from commit 7caff9c00c6bc56b55107b1ef1b10a14c2f50516) |
| Comment by Githook User [ 18/Oct/23 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: (cherry picked from commit 7caff9c00c6bc56b55107b1ef1b10a14c2f50516) |
| Comment by Githook User [ 19/Sep/23 ] |
|
Author: {'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}Message: |
| Comment by Wenbin Zhu [ 27/Jun/23 ] |
|
I see, thanks for the explanation! |
| Comment by Abdul Qadeer [ 27/Jun/23 ] |
You are right that it will be interrupted, however the failure here would be arising from CancellationToken being cancelled in some unit of work within the POS. Note that by this time we already have a handle to the service at the command body which is in-progress of shutting down or has already shutdown and we will wait on the completion future of state machine here. In fact this is what happens in the first invocation of this command - it is the subsequent retry that leads to this issue.
I should have been more clear - I can see how that sentence can lead to confusion. How about this?:
You are right about checking for CancellationToken iff we have a handle on the service. Currently, we can only get a handle if the service has not been shutdown and not paused. Hope that makes sense, let me know if you have any further questions. |
| Comment by Wenbin Zhu [ 27/Jun/23 ] |
Hey abdul.qadeer@mongodb.com, I feel like this approach is still racy, because if the POS was in running state when lookupInstance() is being called (so it didn't throw), and later POS might shutdown and the subsequent commit work would still be interrupted and cause the same problem, so I think we need to do the first approach anyways. Is my understanding correct?
Also regarding your description of
I think the concurrent contract is that each POS instance itself is able to determine whether the service has been shutdown by checking the CancellationToken (token.isCanceled()), and for the command invocation to know about that, the instance needs to communicate the right error code back (via completion futures) to the command invocation. |