[SERVER-45939] Have mongos propagate RetryableWriteError label to client Created: 03/Feb/20  Updated: 08/Jan/24

Status: Backlog
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matt Broadstone Assignee: Backlog - Service Architecture
Resolution: Unresolved Votes: 0
Labels: sa-remove-fv-backlog-22
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by DRIVERS-525 Expand use of error labels for Retrya... Closed
Gantt Dependency
Related
Assigned Teams:
Service Arch
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Service Arch 2020-02-24, Service Arch 2020-03-09, Service Arch 2020-03-23, Service Arch 2020-04-06, Service arch 2020-04-20, Service arch 2020-05-04, Service arch 2020-05-18, Service arch 2020-06-01, Service arch 2020-06-15, Service arch 2020-06-29, Service arch 2020-07-13, Service Arch 2020-07-27, Service Arch 2020-08-10, Service Arch 2020-08-24
Participants:

 Description   

As part of 4.4 updates to the retryable writes specification for drivers the failCommand fail point in mongod was updated to automatically add errorLabels to error responses for known error codes. mongos has different behavior than mongod here, and it causes these tests to fail when tested against a sharded topology.



 Comments   
Comment by Alexander Golin (Inactive) [ 14/Oct/20 ]

Thank you! I'm closing out DRIVERS-525. mira.carey@mongodb.com I don't want to close this ticket out from under you, so leaving as is for you to assess.

Comment by Divjot Arora (Inactive) [ 14/Oct/20 ]

I agree with emily.giurleo's answer.

Comment by Emily Giurleo (Inactive) [ 14/Oct/20 ]

I believe this should be closed as "Works as Designed," and all the work has been done on the drivers side to account for this behavior.

Comment by Alexander Golin (Inactive) [ 14/Oct/20 ]

Hi all, Esha and I are doing our weekly cleanup and noticed all the drivers tickets are completed for DRIVERS-525, which depends on this ticket. Looks like the conversation thread may have trailed off here so wanted to jump start it. emily.giurleo divjot.arora mira.carey@mongodb.com, should this be closed as "Works as Designed"? If so, is there anything that drivers teams will need to go back and reconsider?

Comment by Divjot Arora (Inactive) [ 13/Mar/20 ]

Seems like this is heading towards "Works as Designed". Should we close this?

emily.giurleo mira.carey@mongodb.com

Comment by Emily Giurleo (Inactive) [ 14/Feb/20 ]

Thank you! I didn't realize this was the case.

Comment by Mira Carey [ 14/Feb/20 ]

Currently, is there ever a case where mongos propagates the RetryableWriteError label to the driver? Glancing at the RetryableWriteError code in the server would suggest not, but I think your comment (and conversations I've had with Matt) suggests that there would be cases where it would be preferable for the driver to retry instead.

The obvious ones to me would have been where mongos hadn't actually attempted to perform any operations yet. Perhaps if it was shutting down (in some kind of drain mode), or perhaps even on the existing shutdown path today.

Comment by Emily Giurleo (Inactive) [ 14/Feb/20 ]

mira.carey@mongodb.com I just want to clarify something you said at the end of your comment –

The larger idea here was that the important thing about retryable writes was that someone perform ~1 retry for any given write. My thinking was that it wasn't particularly important who was performing that retry, and so the retryable error label was an implementation detail when we needed to shunt that responsibility to the driver.

Currently, is there ever a case where mongos propagates the RetryableWriteError label to the driver? Glancing at the RetryableWriteError code in the server would suggest not, but I think your comment (and conversations I've had with Matt) suggests that there would be cases where it would be preferable for the driver to retry instead.

I just want to make sure we're all on the same page on the implementation that's going into 4.4 servers.

Comment by Mira Carey [ 12/Feb/20 ]

matt.broadstone the doc I reviewed was Retryable Writes Error Labels Proposal. It appears that lgtm'rs were Judah, Lingzhi, Siyuan, myself and Jeremy. That doc was spawned off SERVER-41245.

In terms of specific objections:

It seems to me we may miss an opportunity to successfully complete a write if a single mongos fails to complete the operation, but some other proxy might have a chance.

I think that's always the case. The competing desire was to perform fewer retries and return to the user more quickly. Maybe too much was read into the driver's decision to perform only 1 retry.

At large (perhaps with universal timeouts) we would always perform infinite retries, but we seem to have a bias for fewer unlikely-to-actually-succeed retries as the current status quo.

This would seem to break the contract we extend our users regarding retryable writes:

  • a user won't see a retry in their command monitoring

Fair, but also feels like something we could clarify in documentation, if we were otherwise satisfied with the overall approach.

  • one failed insertOne might have a RetryableWriteError and be retried, but depending on <something> the next failed insertOne will not be considered retryable

The something we're talking about here is switching from a mongos to a mongod? Or is there something else you mean there?


The larger idea here was that the important thing about retryable writes was that someone perform ~1 retry for any given write. My thinking was that it wasn't particularly important who was performing that retry, and so the retryable error label was an implementation detail when we needed to shunt that responsibility to the driver. I'm certainly willing to reconsider if the downstream implications of avoiding retries in the drivers outweighs the cost of an extra 3 retries in mongos.

Comment by Matt Broadstone [ 11/Feb/20 ]

emily.giurleo Thanks for the update! I have no problem renaming the ticket.

mira.carey@mongodb.com can you point me to the design document for this feature? I can't seem to find it, and I'd like to be able to refer to it when we discuss the design. For instance, when Emily says "It was originally decided that mongos would not return the `RetryableWriteError` label to the client because mongos would internally retry any failed commands", where was that decision made? Who was involved in that decision?

It seems to me we may miss an opportunity to successfully complete a write if a single mongos fails to complete the operation, but some other proxy might have a chance. This would seem to break the contract we extend our users regarding retryable writes:

  • a user won't see a retry in their command monitoring
  • one failed insertOne might have a RetryableWriteError and be retried, but depending on <something> the next failed insertOne will not be considered retryable
Comment by Mira Carey [ 11/Feb/20 ]

To expand on emily.giurleo's comment,

The intent from the design was for the label to be a direction from the server to the driver to initiate an extra retry. On mongos, where we'd already performed retries (the only time we would propagate retryable errors), the consideration was whether it'd be easier to rewrite replies to strip the label, or simply have mongod not return them (and say that an explicit whitelist was an implementation detail). Again, the reason being to avoid excessive numbers of retries, in a world where drivers retry only once.

Which is to say that the existing behavior was intentional:

  • mongod doesn't return retryable error lables to mongos
  • mongos doesn't send retryable error labels to drivers for operations it has already performed retries

And the larger read is that we never actually want drivers to retry "retryable simply because of error code" errors. We want drivers to retry if and only if the label is present (which would only be because mongos had not itself performed a retry)

Comment by Emily Giurleo (Inactive) [ 11/Feb/20 ]

It was originally decided that mongos would not return the `RetryableWriteError` label to the client because mongos would internally retry any failed commands. If the client retried the command as well, that would result in a command being retried up to 6 times, which is a lot.

matt.broadstone made the point that if the error label is propagated to the client, the client might connect to a different mongos on retry, making it more likely that the command will succeed.

So I think there are two things to do here:

  • I propose renaming this ticket to something that more closely aligns with Matt's intentions, such as: "Have mongos propagate RetryableWriteError label to client." Matt, let me know if that sounds reasonable to you.
  • Then, I think we should align on whether that is actually the desired behavior of the server and whether this work will be completed.
Comment by Amirsaman Memaripour [ 11/Feb/20 ]

lingzhi.deng Can you please clarify why internal clients (e.g., `mongos`) should not get `RetryableWriteError` error labels?

https://github.com/mongodb/mongo/blob/b5134bcb006e74e2102de6e171a28931f84ea12f/src/mongo/db/error_labels.cpp#L45-L49

Comment by Divjot Arora (Inactive) [ 10/Feb/20 ]

matt.broadstone amirsaman.memaripour I'm not sure this is just a Service Arch issue. Given that the RetryableWriteError label was not propagated to the driver from a mongos in the original implementation, it's possible that the ErrorLabelBuilder::isRetryableWriteError method is intentionally missing checks for errors that could only happen in sharded clusters. If you're proposing removing the _isInternalClient check at the beginning of the method, we also need to verify that any sharding-specific errors are not being excluded.

Comment by Matt Broadstone [ 10/Feb/20 ]

amirsaman.memaripour I think you want to be looking at this code in particular: https://github.com/mongodb/mongo/blob/b5134bcb006e74e2102de6e171a28931f84ea12f/src/mongo/db/error_labels.cpp#L45-L49.

Here's an example of how to witness this behavior using the shell:

> use admin
switched to db admin
> db.runCommand({
...   configureFailPoint: "failCommand",
...   mode: { times: 1 },
...   data: {
...     failCommands: ['insert'],
...     errorCode: 10107
...   }
... })
{ "count" : 219, "ok" : 1 }
> db.test.insertOne({ a: 42 })
2020-02-10T14:39:18.525-0500 E  QUERY    [js] uncaught exception: WriteCommandError({
	"ok" : 0,
	"errmsg" : "Failing command due to 'failCommand' failpoint",
	"code" : 10107,
	"codeName" : "NotMaster"
}) :
WriteCommandError({
	"ok" : 0,
	"errmsg" : "Failing command due to 'failCommand' failpoint",
	"code" : 10107,
	"codeName" : "NotMaster"
})
WriteCommandError@src/mongo/shell/bulk_api.js:417:48
executeBatch@src/mongo/shell/bulk_api.js:915:23
Bulk/this.execute@src/mongo/shell/bulk_api.js:1163:21
DBCollection.prototype.insertOne@src/mongo/shell/crud_api.js:264:9
@(shell):1:1

Comment by Amirsaman Memaripour [ 10/Feb/20 ]

matt.broadstone my understanding is that the current code for `mongos` applies error labels when `writeConcernError` is set for a command:

https://github.com/mongodb/mongo/blob/3cd3a6da3fe0b6f022b721094bda0b97c3527d23/src/mongo/s/commands/strategy.cpp#L272-L289

Can you please let me know the exact tests that are failing, and a pointer on how I can reproduce the failures on my local machine? Thank you!

Comment by Githook User [ 05/Feb/20 ]

Author:

{'username': 'mbroadst', 'name': 'Matt Broadstone', 'email': 'mbroadst@mongodb.com'}

Message: fix: explicitly specify error labels to support mongos 4.3

Mongos does not currently fill in `errorLabels` for a `failCommand`
set on a known error code. Because of the way our logic for
determining a retryable write error has changed in the 4.4 timeline
we would no longer be able to run these tests against mongos until
the bug is fixed in the proxy. As a stopgap, we'll explicitly
specify the error label in our `failCommand`, and revert this change
once the bug is fixed.

Related to SERVER-45939
Branch: master
https://github.com/mongodb/specifications/commit/30d95cf14248cd78ba54fce0010d09791a026cae

Comment by Matt Broadstone [ 05/Feb/20 ]

ratika.gandhi while it would be ideal to prioritize this work, we've worked around it by explicitly adding errorLabels to our failCommand fail points. I created SPEC-1565 to track the work of reverting that change when this is fixed in mongos. Having said that, RC0 feels a bit late given that all drivers will have to resync their tests multiple times.

Comment by Ratika Gandhi [ 04/Feb/20 ]

matt.broadstone, By what date do you need this? Can it wait until RC0?

Generated at Thu Feb 08 05:10:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.