[SERVER-43700] AsyncRequestsSender may fail if shard is removed concurrently Created: 27/Sep/19  Updated: 06/Dec/22  Resolved: 28/Oct/19

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Lamont Nelson Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File diff.txt    
Issue Links:
Depends
Assigned Teams:
Sharding
Operating System: ALL
Steps To Reproduce:

We were able to repro this bug consistently using the merge_with_drop_shard.js test and the attached diff to force the shard to reload before we create the MultiStatementTransactionRequestsSender object here: https://github.com/mongodb/mongo/blob/0f16c5fc452d16c5a92e43e9fdd96f3822f05271/src/mongo/s/query/establish_cursors.cpp#L64

Participants:
Linked BF Score: 21

 Description   

In merge_with_drop_shard.js the test may fail due to the ARS calling GetShard when the resposne is returned. This failure occurs for currentOp in this case, but could potentially happen for any aggregation command that comes through this path.



 Comments   
Comment by Lamont Nelson [ 30/Sep/19 ]

The JS exception from the BF is due to the currentOp command not having a 'inprog' key in the top level document of the response. That is because the currentOp command failed. This is easy enough to fix just in the test by checking for the existence of the key in the assert soon statement. The currentOp command in the test is only used to determine if the aggregate command is complete (the thing we are testing), so this could be a valid fix for this BF.

I originally thought that the following was the cause, but upon second look this isn't the case. I missed the exception in establish_cursors.cpp. 1. ARS targets the shards & send the requests 2. one of the shards was removed after targeting (but before the response) 3. the responses come back and ARS tries to look up the shard id here, which throws an exception since that shard was not found.

This ticket was meant to address 1,2,&3 above. This could in theory happen, but it wasn't the reason for the failure in the BF. So maybe we should close this one if this behavior isn't a problem. I'll make a new ticket for the js fix.

Comment by Mira Carey [ 30/Sep/19 ]

lamont.nelson, just to clarify, is this a bug in merge_with_drop_shard.js, or something you think needs fixing in the ARS?

I ask because the only two paths I see in the ARS are either for tageting (which clearly needs to call getShard(), or in the failure path (where currently we want the shard to update the repl set monitor).

I can see how lifting the isRetriable logic out of the shard + ignoring the repl set update, when we've lost the shard, would let us preserve the original error more often, but not sure if that would solve the issues in the linked bf (are some of those coming from targeting, rather than in error handling?).

Generated at Thu Feb 08 05:03:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.