[SERVER-32208] Remove retrying of OperationFailed in auto_retry_on_network_error.js Created: 07/Dec/17  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: Querying, Sharding
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Minor - P4
Reporter: Jack Mulrow Assignee: Backlog - Cluster Scalability
Resolution: Unresolved Votes: 0
Labels: max-triage, neweng, open_todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-33542 Using maxTime() on MongoDB 3.4 and 3.... Closed
is related to SERVER-31730 PlanExecutor::executePlan() should pr... Closed
Assigned Teams:
Cluster Scalability
Participants:

 Description   

There used to be a case where a write command swallowed the original error code and replaced it with OperationFailed, but we believe that has been fixed in SERVER-33542. This ticket has been left open to track the work to remove the special case for it here.

Original Description

Similarly to SERVER-31730, there are commands that swallow plan executor errors (like InterruptedDueToReplStateChange) and instead return ErrorCodes::OperationFailed. This interferes with mongos retry logic and the retry logic in the jscore and concurrency continuous stepdown suites.

These are the commands I've seen that definitely have this problem, there may be more though:
find,
findAndModify,
distinct,
geoNear

Grepping for "executor error" turns up a few more places where this could be an issue, since it seems like this logic has been copied across a few commands.



 Comments   
Comment by Jack Mulrow [ 13/Apr/18 ]

charlie.swanson That sounds good to me.

Comment by Charlie Swanson [ 13/Apr/18 ]

jack.mulrow it looks like we actually resolved this under SERVER-33542, though we didn't undo your special retry logic. I think we should adapt this ticket to just be about removing that logic - is it okay if I do that and move it to the sharding backlog?

Comment by Jack Mulrow [ 08/Dec/17 ]

charlie.swanson I don't know if there are any existing BFs related to this, I only ran into it locally when I was writing the new retryable_writes_jscore_stepdown_passthrough for SERVER-31194 which runs the jscore tests against a replica set while continuously stepping down the primary. I was seeing failures like these pretty frequently, but I was able to add special logic to retry them, so for our current tests this isn't really a pressing issue.

Comment by Charlie Swanson [ 08/Dec/17 ]

jack.mulrow can you link us to some instances of these failures so we can know how often this is a problem to help with triaging? I'm aware of SERVER-31730 and the associated failures, are there more?

Generated at Thu Feb 08 04:29:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.