[SERVER-43700] AsyncRequestsSender may fail if shard is removed concurrently Created: 27/Sep/19 Updated: 06/Dec/22 Resolved: 28/Oct/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Networking |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Lamont Nelson | Assignee: | [DO NOT USE] Backlog - Sharding Team |
| Resolution: | Won't Fix | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||
| Issue Links: |
|
||||
| Assigned Teams: |
Sharding
|
||||
| Operating System: | ALL | ||||
| Steps To Reproduce: | We were able to repro this bug consistently using the merge_with_drop_shard.js test and the attached diff to force the shard to reload before we create the MultiStatementTransactionRequestsSender object here: https://github.com/mongodb/mongo/blob/0f16c5fc452d16c5a92e43e9fdd96f3822f05271/src/mongo/s/query/establish_cursors.cpp#L64 |
||||
| Participants: | |||||
| Linked BF Score: | 21 | ||||
| Description |
|
In merge_with_drop_shard.js the test may fail due to the ARS calling GetShard when the resposne is returned. This failure occurs for currentOp in this case, but could potentially happen for any aggregation command that comes through this path. |
| Comments |
| Comment by Lamont Nelson [ 30/Sep/19 ] |
|
The JS exception from the BF is due to the currentOp command not having a 'inprog' key in the top level document of the response. That is because the currentOp command failed. This is easy enough to fix just in the test by checking for the existence of the key in the assert soon statement. The currentOp command in the test is only used to determine if the aggregate command is complete (the thing we are testing), so this could be a valid fix for this BF. I originally thought that the following was the cause, but upon second look this isn't the case. I missed the exception in establish_cursors.cpp. 1. ARS targets the shards & send the requests 2. one of the shards was removed after targeting (but before the response) 3. the responses come back and ARS tries to look up the shard id here, which throws an exception since that shard was not found. This ticket was meant to address 1,2,&3 above. This could in theory happen, but it wasn't the reason for the failure in the BF. So maybe we should close this one if this behavior isn't a problem. I'll make a new ticket for the js fix. |
| Comment by Mira Carey [ 30/Sep/19 ] |
|
lamont.nelson, just to clarify, is this a bug in merge_with_drop_shard.js, or something you think needs fixing in the ARS? I ask because the only two paths I see in the ARS are either for tageting (which clearly needs to call getShard(), or in the failure path (where currently we want the shard to update the repl set monitor). I can see how lifting the isRetriable logic out of the shard + ignoring the repl set update, when we've lost the shard, would let us preserve the original error more often, but not sure if that would solve the issues in the linked bf (are some of those coming from targeting, rather than in error handling?). |