[CDRIVER-3121] Failpoint errors may not be considered a failure Created: 09/May/19  Updated: 28/Oct/23  Resolved: 13/Mar/20

Status: Closed
Project: C Driver
Component/s: libmongoc
Affects Version/s: None
Fix Version/s: 1.17.0-beta, 1.17.0

Type: Bug Priority: Major - P3
Reporter: Patrick Freed Assignee: Kevin Albertson
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends

 Description   

Given the following code sample:

if (!mongoc_collection_insert_one(collection, insert, NULL, &reply, &error)) {
            printf("err inserting reply: %s err: %s\n", bson_as_relaxed_extended_json(&reply, NULL), error.message);
        } else {
            printf("success: %s\n", bson_as_relaxed_extended_json(&reply, NULL));
        }

it prints

success: { "insertedCount" : 1, "writeConcernErrors" : [ { "code" : 91.0, "errmsg" : "Replication is being shut down" } ] }

when encountering a write concern error. This means the function returned true, even though the docs state that a write concern error should be considered a failure. I believe this applies to the other crud functions as well (including bulk writes).



 Comments   
Comment by Kevin Albertson [ 13/Mar/20 ]

The error codes should now accept double or int64 as well.

It also appears that the write concern error will be present in the reply even if the write is retried and succeeds on the second attempt. The function will report success via the return value (not the reply) regardless of whether the write was retried successfully or not, though.

I couldn't reproduce this, and reading the code, it does appear that write commands would retry before parsing a reply into a write result. I wonder if this was a surfacing of the same issue. Perhaps the failpoint triggered on the retry as well, and though the second reply has a writeConcernError, the return still indicated success. I'm closing for now, but feel free to re-open if that isn't the case.

Comment by Githook User [ 13/Mar/20 ]

Author:

{'username': 'kevinAlbs', 'name': 'Kevin Albertson', 'email': 'kevin.albertson@mongodb.com'}

Message: CDRIVER-3121 loosen error code numeric checks

Permit double and int64, since they may appear when
error codes are returned via failpoints.
Branch: master
https://github.com/mongodb/mongo-c-driver/commit/4ea991488081d4f129947510647b34a640808840

Comment by Patrick Freed [ 13/Aug/19 ]

Regarding your second open question, it feels like the server's responsibility to standardize the types of their error codes, regardless of what the user sends them. I guess since failPoints are largely developer tools, they don't do a ton of sanitation right now, but I still find it very strange that it's even possible for the codes to be returned as multiple different types. I think the ok field might suffer from this same issue.

Comment by Patrick Freed [ 13/Aug/19 ]

Blast from the past here–I glossed over the questions directed at me. Yep, I was using a full non-mocked mongod instance when I encountered this bug. For some reason, the server happens to include the exact error code as specified in the configureFailPoint command in its reply when the failPoint is triggered, preserving its type. The fail point does trigger correctly regardless of the that type (assuming the same numeric value), however.

e.g. sending this command via the shell

{configureFailPoint: "failCommand", mode: {times: 1}, data: { failCommands: ["insert"], writeConcernError: { code: NumberLong(91) } } }

when triggered, will return something like

{ ..., writeConcernError: { code: NumberLong(91) }, ... }

By default, I believe the shell interprets numbers as Doubles, so just a vanilla untyped configureFailPoint shell command will have the code returned as a double. Specifying the code as NumberInt(..) is currently the only way to have libmongoc parse it correctly.

Comment by Kevin Albertson [ 24/Jun/19 ]

Thanks for investigating this. You're correct in that the code in _parse_error_reply is limiting the error code check to an int32 reply. It seems we've been doing this as far back as 1.3:
https://github.com/mongodb/mongo-c-driver/blob/r1.3/src/mongoc/mongoc-rpc.c#L754-L757

We don't catch this in retryable writes tests because it seems that using a failpoint to mock the error server reply always returns the error codes as int32 as well. So how did you encounter this bug? Did you see this on a live mongod returning the error code as a double?

I have a couple other open questions that I should answer before completing this ticket.

1. If we do should check for any numeric error code type, does this warrant a spec change in retryable writes, and other specs that mention error codes, to mention this?
2. Though this may be an isolated case, should configureFailPoint use the type of the error code provided, instead of always returning int32? Then we could have a spec test that validates different error code types.

Comment by Patrick Freed [ 09/May/19 ]

The cause of this is in mongo-rpc.c around line 1112, in _parse_error_reply. When parsing write concern errors, this function expects the error code to be an int32. When encountering non-int32 s, it ignores the code and reports no error.

In the above example (where the error is caused by setting a failpoint in the shell), the code is sent back from the server as a double, so we see the bad behavior. Likewise, specifying the failpoint code using NumberLong(91) also causes the write concern error to not be reported. Configuring with NumberInt(91) produces the desired behavior, however.

Comment by Patrick Freed [ 09/May/19 ]

It also appears that the write concern error will be present in the reply even if the write is retried and succeeds on the second attempt. The function will report success via the return value (not the reply) regardless of whether the write was retried successfully or not, though.

Generated at Wed Feb 07 21:17:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.