[SERVER-55648] Mongos doesn't return top-level batch-write error in case of shutdown Created: 30/Mar/21  Updated: 29/Oct/23  Resolved: 29/Jul/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.2.12
Fix Version/s: 4.2.16, 4.0.28

Type: Bug Priority: Major - P3
Reporter: Tommaso Tocci Assignee: Luis Osta (Inactive)
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File reproducible_for_HELP-23256.patch    
Issue Links:
Backports
Depends
Problem/Incident
Related
related to SERVER-59474 Return a shutdown error as top-level ... Closed
related to SERVER-62175 Mongos fails to attach RetryableWrite... Closed
related to SERVER-64642 Fix error where mongos returns Callba... Closed
related to SERVER-58985 Re-enable retryable_mongos_write_erro... Closed
is related to SERVER-53624 4.4 mongos does not attach RetryableW... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Steps To Reproduce:

To reproduce the error apply the provided patch (r4.2.12 - 5593fd8e33b60c7580 ) and run:

buildscripts/resmoke.py run --suite=sharding jstests/sharding/insert_with_mongos_shutdown.js

Participants:
Case:
Linked BF Score: 85

 Description   

Batch write operations could either return a top level error:

{ok: 0, code: 91, messaage: "Server is shutting down"}

or a nested array of writeErrors:

{ok: 1, writeErrors: [ { index: 0, code: 91, message: "Server is shutting down" } ]}

Since our current retryable-write specs is a bit vague around the handling of the batchWrite response in case of writeErrors, drivers only implement retries for top-level errors of a batch write response and won't even look at the retry-able errors in the writeErrors array.

The problem is that if a mongos gets shutted down in the middle of a batch write execution instead of returning a response with a top level error it could actually return a nested array that won't be retried by drivers.
So in this case we will have a batch write that fail with a retryable error that won't be retried neither from the mongos nor from the driver.

I suspect that this is the same underlying issue of SERVER-53624 but that one is specific to mongoDB versions grater than 4.4, given that mongos is attaching retryable error labels only since v4.4.



 Comments   
Comment by Githook User [ 22/Sep/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-55648 Return top-level batch-write error in case of shutdown
Branch: v4.0
https://github.com/mongodb/mongo/commit/6862ef35ecbd1399305e924a8165289f2fc0b180

Comment by Githook User [ 20/Sep/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: Revert "SERVER-55648 Return correct response in case of shutdown"

This reverts commit 211007fa4a705c02e7c373dd6fc148aa4de3a038.
Branch: v4.0
https://github.com/mongodb/mongo/commit/9923a9019d2798e06e5bef0a70410eb85cd01e7d

Comment by Githook User [ 20/Sep/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-55648 Return correct response in case of shutdown
Branch: v4.0
https://github.com/mongodb/mongo/commit/211007fa4a705c02e7c373dd6fc148aa4de3a038

Comment by Githook User [ 30/Jul/21 ]

Author:

{'name': 'jannaerin', 'email': 'golden.janna@gmail.com', 'username': 'jannaerin'}

Message: SERVER-55648 Disable retryable_mongos_write_errors.js in sharding_last_stable_mongos_and_mixed_shards
Branch: v4.2
https://github.com/mongodb/mongo/commit/be8229a4fd16ec34be40086dae9e30e5c2a6d726

Comment by Githook User [ 29/Jul/21 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-55648Mongos doesn't return top-level batch-write error in case of shutdown
Branch: v4.2
https://github.com/mongodb/mongo/commit/82fbe6ec0d6e756aaa875f06210298ffdc0991ca

Comment by Oleg Pudeyev (Inactive) [ 22/Jul/21 ]

luis.osta I think my comment above is incorrect. Its incorrectness was also pointed out by Jeremy in the subsequent comment.

When a driver receives an error from the server, several things may happen, including 1) retrying the operation and 2) marking the server unknown.

https://github.com/mongodb/specifications/pull/911 says that, when the server reports an error in writeErrors, the server MUST NOT be marked unknown. This says nothing about whether the operation would be retried by the driver. The operations should be retryable if they match the "determining retryable errors" requirements described in https://github.com/mongodb/specifications/blob/master/source/retryable-writes/retryable-writes.rst#determining-retryable-errors.

I attempted to write a test at https://github.com/p-mongo/tests/blob/master/driver-retry-write-errors/test.rb which would set a fail point on a shard mongod and then write through a mongos, but this test doesn't produce any errors. When I write to the shard directly the fail point is triggered as expected. Are fail points not triggered by mongos->mongod operations or did I get the syntax wrong?

Comment by Jack Mulrow [ 05/Apr/21 ]

As part of this ticket, we should also investigate if there are retryable codes other than shutdown errors that can be buried within writeErrors like this.

Comment by Jeremy Mikola [ 02/Apr/21 ]

oleg.pudeyev: My understanding of mongodb/specifications#911 for DRIVERS-1376 is that it only applies to error checking as it pertains to SDAM. Although the original description of DRIVERS-1376 did talk about retryable writes, it looks like that was ultimately removed from the scope.

Comment by Oleg Pudeyev (Inactive) [ 01/Apr/21 ]

The driver behavior was clarified in https://github.com/mongodb/specifications/pull/911 to require drivers to NOT check writeErrors when looking for retryable errors.

Generated at Thu Feb 08 05:37:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.