[SERVER-58195] InterruptedDueToReplStateChange missing 'RetryableWriteError' error label Created: 01/Jul/21  Updated: 27/Oct/23  Resolved: 19/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.4.6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ross Lawley Assignee: Lingzhi Deng
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to JAVA-4244 Top level error labels aren't added t... Closed
is related to SERVER-41245 Add RetryableWriteError Error Label Closed
Operating System: ALL
Sprint: Repl 2021-07-26
Participants:

 Description   

When running an Atlas Failover Test on a M10 instance with the default connection string (retryWrites = true and writeconcern = majority) it is possible for the driver to receive the following error:

 

{"code": 11602, "codeName": "InterruptedDueToReplStateChange", "errmsg": "operation was interrupted", 
"errInfo": {"writeConcern": {"w": "majority", "wtimeout": 0, "provenance": "clientSupplied"}}}
 

The missing error labels (specifically the 'RetryableWriteError') label means that according to the retryable writes specification the error is not retryable:

For server versions 4.4 and newer, the server will add a RetryableWriteError label to errors or server responses that it considers retryable before returning them to the driver. As new server versions are released, the errors that are labeled with the RetryableWriteError label may change. Drivers MUST NOT add a RetryableWriteError label to any error derived from a 4.4+ server response (i.e. any error that is not a network error).

Steps to reproduce

Continually run updateOne operations in a loop and run a failover test in Atlas.

It is a race condition, so having minimal network latency (a nearer data center) helps increase the chance of the race.



 Comments   
Comment by Lingzhi Deng [ 19/Jul/21 ]

ross.lawley, thanks for your investigation. I am now closing this as "Works as Designed". Just reposting what Ross found, the server correctly returns the RetryableWriteError label:

{
   "n": 1, 
    "electionId": {"$oid": "7fffffff000000000000000a"}, 
    "opTime": {"ts": {"$timestamp": {"t": 1626689946, "i": 14}}, "t": 10}, 
    "nModified": 1, 
    "writeConcernError": {
         "code": 11602, 
         "codeName": "InterruptedDueToReplStateChange", 
         "errmsg": "operation was interrupted",
         "errInfo": {
             "writeConcern": {"w": "majority", "wtimeout": 0, "provenance": "clientSupplied"}}}, 
    "ok": 1.0, 
    "errorLabels": ["RetryableWriteError"],   
    "topologyVersion": {"processId": {"$oid": "60f54f5c9e6c63f5eea70275"}, "counter": 29}, 
    "$clusterTime": {"clusterTime": {"$timestamp": {"t": 1626689946, "i": 14}}, "signature": {"hash": {"$binary": {"base64": "AAAAAAAAAAAAAAAAAAAAAAAAAAA=", "subType": "00"}}, "keyId": 0}}, 
    "operationTime": {"$timestamp": {"t": 1626689946, "i": 14}}}

Comment by Ross Lawley [ 19/Jul/21 ]

My apologies - further testing has shown there is a bug in the Java driver that failed to correctly handle this error, such that it didn't include the top level error labels in the generated WriteConcernError.

Suggest closing and marking Works as Designed.

Generated at Thu Feb 08 05:43:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.