[SERVER-78009] shardSvrCommitReshardCollection command should fail recoverably if the node is shutting down Created: 12/Jun/23  Updated: 20/Nov/23  Resolved: 20/Sep/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 6.0.6, 5.0.18, 7.0.0-rc4
Fix Version/s: 7.2.0-rc0, 7.0.3, 6.0.13, 5.0.24

Type: Bug Priority: Major - P3
Reporter: Abdul Qadeer Assignee: Brett Nawrocki
Resolution: Fixed Votes: 0
Labels: sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-78108 POS interface should expose its shutd... Closed
Related
related to SERVER-78839 Investigate changes needed in shardin... Backlog
is related to SERVER-81115 ReplicaSetAwareService Can Be Shutdow... Investigating
Assigned Teams:
Sharding NYC
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v7.1, v7.0, v6.0, v5.0
Participants:
Linked BF Score: 8
Story Points: 4

 Description   

It is possible that a resharding donor/recipient service POS executor has shutdown on a primary node but has failed to step down due to no secondaries being caught up to be elected. While the secondaries catchup, the primary which has its services shutdown can receive a retry of shardSvrCommitReshardCollection command from an in-progress resharding operation and uassert on existence of resharding state documents (donor or recipient) here. This leads to a fatal error at the coordinator causing it to crash.



 Comments   
Comment by Githook User [ 14/Nov/23 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-78009 Commit/Abort Resharding is Retryable on Shutdown

(cherry picked from commit 7caff9c00c6bc56b55107b1ef1b10a14c2f50516)
Branch: v6.0
https://github.com/mongodb/mongo/commit/5af131a0ad39db73a14f0b89361c4356b5ea0191

Comment by Githook User [ 14/Nov/23 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-78009 Commit/Abort Resharding is Retryable on Shutdown

(cherry picked from commit 7caff9c00c6bc56b55107b1ef1b10a14c2f50516)
Branch: v5.0
https://github.com/mongodb/mongo/commit/c406199f024404be55e7ce660f9b79e82140e641

Comment by Githook User [ 18/Oct/23 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-78009 Commit/Abort Resharding is Retryable on Shutdown

(cherry picked from commit 7caff9c00c6bc56b55107b1ef1b10a14c2f50516)
Branch: v7.0
https://github.com/mongodb/mongo/commit/6039df3be85f40e8691f9dd12515df3de4b9be3a

Comment by Githook User [ 19/Sep/23 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-78009 Commit/Abort Resharding is Retryable on Shutdown
Branch: master
https://github.com/mongodb/mongo/commit/7caff9c00c6bc56b55107b1ef1b10a14c2f50516

Comment by Wenbin Zhu [ 27/Jun/23 ]

I see, thanks for the explanation!

Comment by Abdul Qadeer [ 27/Jun/23 ]

Hey Abdul Qadeer, I feel like this approach is still racy, because if the POS was in running state when lookupInstance() is being called (so it didn't throw), and later POS might shutdown and the subsequent commit work would still be interrupted and cause the same problem, so I think we need to do the first approach anyways. Is my understanding correct?

You are right that it will be interrupted, however the failure here would be arising from CancellationToken being cancelled in some unit of work within the POS. Note that by this time we already have a handle to the service at the command body which is in-progress of shutting down or has already shutdown and we will wait on the completion future of state machine here. In fact this is what happens in the first invocation of this command - it is the subsequent retry that leads to this issue.

I think the concurrent contract is that each POS instance itself is able to determine whether the service has been shutdown by checking the CancellationToken (token.isCanceled()), and for the command invocation to know about that, the instance needs to communicate the right error code back (via completion futures) to the command invocation.

I should have been more clear - I can see how that sentence can lead to confusion. How about this?:

The Primary Only Service currently lacks a static method that allows derived classes to determine whether the service has been shut down.

You are right about checking for CancellationToken iff we have a handle on the service. Currently, we can only get a handle if the service has not been shutdown and not paused.

Hope that makes sense, let me know if you have any further questions.

Comment by Wenbin Zhu [ 27/Jun/23 ]

If SERVER-78018 provides a method to lookupInstance() which throws InterruptedAtShutdown when the service is in State::kShutdown, the command can retry after catching this error via resharding::tryGetReshardingStateMachine().

Hey abdul.qadeer@mongodb.com, I feel like this approach is still racy, because if the POS was in running state when lookupInstance() is being called (so it didn't throw), and later POS might shutdown and the subsequent commit work would still be interrupted and cause the same problem, so I think we need to do the first approach anyways. Is my understanding correct?

 

Also regarding your description of SERVER-78108:

The Primary Only Service currently lacks a method that allows derived classes to determine whether the service has been shut down.

I think the concurrent contract is that each POS instance itself is able to determine whether the service has been shutdown by checking the CancellationToken (token.isCanceled()), and for the command invocation to know about that, the instance needs to communicate the right error code back (via completion futures) to the command invocation.

Generated at Thu Feb 08 06:37:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.