[SERVER-45939] Have mongos propagate RetryableWriteError label to client Created: 03/Feb/20 Updated: 08/Jan/24 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matt Broadstone | Assignee: | Backlog - Service Architecture |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | sa-remove-fv-backlog-22 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Service Arch
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Service Arch 2020-02-24, Service Arch 2020-03-09, Service Arch 2020-03-23, Service Arch 2020-04-06, Service arch 2020-04-20, Service arch 2020-05-04, Service arch 2020-05-18, Service arch 2020-06-01, Service arch 2020-06-15, Service arch 2020-06-29, Service arch 2020-07-13, Service Arch 2020-07-27, Service Arch 2020-08-10, Service Arch 2020-08-24 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
As part of 4.4 updates to the retryable writes specification for drivers the failCommand fail point in mongod was updated to automatically add errorLabels to error responses for known error codes. mongos has different behavior than mongod here, and it causes these tests to fail when tested against a sharded topology. |
| Comments |
| Comment by Alexander Golin (Inactive) [ 14/Oct/20 ] | |||||||||||||||||||||||||||||
|
Thank you! I'm closing out | |||||||||||||||||||||||||||||
| Comment by Divjot Arora (Inactive) [ 14/Oct/20 ] | |||||||||||||||||||||||||||||
|
I agree with emily.giurleo's answer. | |||||||||||||||||||||||||||||
| Comment by Emily Giurleo (Inactive) [ 14/Oct/20 ] | |||||||||||||||||||||||||||||
|
I believe this should be closed as "Works as Designed," and all the work has been done on the drivers side to account for this behavior. | |||||||||||||||||||||||||||||
| Comment by Alexander Golin (Inactive) [ 14/Oct/20 ] | |||||||||||||||||||||||||||||
|
Hi all, Esha and I are doing our weekly cleanup and noticed all the drivers tickets are completed for | |||||||||||||||||||||||||||||
| Comment by Divjot Arora (Inactive) [ 13/Mar/20 ] | |||||||||||||||||||||||||||||
|
Seems like this is heading towards "Works as Designed". Should we close this? | |||||||||||||||||||||||||||||
| Comment by Emily Giurleo (Inactive) [ 14/Feb/20 ] | |||||||||||||||||||||||||||||
|
Thank you! I didn't realize this was the case. | |||||||||||||||||||||||||||||
| Comment by Mira Carey [ 14/Feb/20 ] | |||||||||||||||||||||||||||||
The obvious ones to me would have been where mongos hadn't actually attempted to perform any operations yet. Perhaps if it was shutting down (in some kind of drain mode), or perhaps even on the existing shutdown path today. | |||||||||||||||||||||||||||||
| Comment by Emily Giurleo (Inactive) [ 14/Feb/20 ] | |||||||||||||||||||||||||||||
|
mira.carey@mongodb.com I just want to clarify something you said at the end of your comment – The larger idea here was that the important thing about retryable writes was that someone perform ~1 retry for any given write. My thinking was that it wasn't particularly important who was performing that retry, and so the retryable error label was an implementation detail when we needed to shunt that responsibility to the driver. Currently, is there ever a case where mongos propagates the RetryableWriteError label to the driver? Glancing at the RetryableWriteError code in the server would suggest not, but I think your comment (and conversations I've had with Matt) suggests that there would be cases where it would be preferable for the driver to retry instead. I just want to make sure we're all on the same page on the implementation that's going into 4.4 servers. | |||||||||||||||||||||||||||||
| Comment by Mira Carey [ 12/Feb/20 ] | |||||||||||||||||||||||||||||
|
matt.broadstone the doc I reviewed was Retryable Writes Error Labels Proposal. It appears that lgtm'rs were Judah, Lingzhi, Siyuan, myself and Jeremy. That doc was spawned off In terms of specific objections:
I think that's always the case. The competing desire was to perform fewer retries and return to the user more quickly. Maybe too much was read into the driver's decision to perform only 1 retry. At large (perhaps with universal timeouts) we would always perform infinite retries, but we seem to have a bias for fewer unlikely-to-actually-succeed retries as the current status quo.
Fair, but also feels like something we could clarify in documentation, if we were otherwise satisfied with the overall approach.
The something we're talking about here is switching from a mongos to a mongod? Or is there something else you mean there? The larger idea here was that the important thing about retryable writes was that someone perform ~1 retry for any given write. My thinking was that it wasn't particularly important who was performing that retry, and so the retryable error label was an implementation detail when we needed to shunt that responsibility to the driver. I'm certainly willing to reconsider if the downstream implications of avoiding retries in the drivers outweighs the cost of an extra 3 retries in mongos. | |||||||||||||||||||||||||||||
| Comment by Matt Broadstone [ 11/Feb/20 ] | |||||||||||||||||||||||||||||
|
emily.giurleo Thanks for the update! I have no problem renaming the ticket. mira.carey@mongodb.com can you point me to the design document for this feature? I can't seem to find it, and I'd like to be able to refer to it when we discuss the design. For instance, when Emily says "It was originally decided that mongos would not return the `RetryableWriteError` label to the client because mongos would internally retry any failed commands", where was that decision made? Who was involved in that decision? It seems to me we may miss an opportunity to successfully complete a write if a single mongos fails to complete the operation, but some other proxy might have a chance. This would seem to break the contract we extend our users regarding retryable writes:
| |||||||||||||||||||||||||||||
| Comment by Mira Carey [ 11/Feb/20 ] | |||||||||||||||||||||||||||||
|
To expand on emily.giurleo's comment, The intent from the design was for the label to be a direction from the server to the driver to initiate an extra retry. On mongos, where we'd already performed retries (the only time we would propagate retryable errors), the consideration was whether it'd be easier to rewrite replies to strip the label, or simply have mongod not return them (and say that an explicit whitelist was an implementation detail). Again, the reason being to avoid excessive numbers of retries, in a world where drivers retry only once. Which is to say that the existing behavior was intentional:
And the larger read is that we never actually want drivers to retry "retryable simply because of error code" errors. We want drivers to retry if and only if the label is present (which would only be because mongos had not itself performed a retry) | |||||||||||||||||||||||||||||
| Comment by Emily Giurleo (Inactive) [ 11/Feb/20 ] | |||||||||||||||||||||||||||||
|
It was originally decided that mongos would not return the `RetryableWriteError` label to the client because mongos would internally retry any failed commands. If the client retried the command as well, that would result in a command being retried up to 6 times, which is a lot. matt.broadstone made the point that if the error label is propagated to the client, the client might connect to a different mongos on retry, making it more likely that the command will succeed. So I think there are two things to do here:
| |||||||||||||||||||||||||||||
| Comment by Amirsaman Memaripour [ 11/Feb/20 ] | |||||||||||||||||||||||||||||
|
lingzhi.deng Can you please clarify why internal clients (e.g., `mongos`) should not get `RetryableWriteError` error labels? | |||||||||||||||||||||||||||||
| Comment by Divjot Arora (Inactive) [ 10/Feb/20 ] | |||||||||||||||||||||||||||||
|
matt.broadstone amirsaman.memaripour I'm not sure this is just a Service Arch issue. Given that the RetryableWriteError label was not propagated to the driver from a mongos in the original implementation, it's possible that the ErrorLabelBuilder::isRetryableWriteError method is intentionally missing checks for errors that could only happen in sharded clusters. If you're proposing removing the _isInternalClient check at the beginning of the method, we also need to verify that any sharding-specific errors are not being excluded. | |||||||||||||||||||||||||||||
| Comment by Matt Broadstone [ 10/Feb/20 ] | |||||||||||||||||||||||||||||
|
amirsaman.memaripour I think you want to be looking at this code in particular: https://github.com/mongodb/mongo/blob/b5134bcb006e74e2102de6e171a28931f84ea12f/src/mongo/db/error_labels.cpp#L45-L49. Here's an example of how to witness this behavior using the shell:
| |||||||||||||||||||||||||||||
| Comment by Amirsaman Memaripour [ 10/Feb/20 ] | |||||||||||||||||||||||||||||
|
matt.broadstone my understanding is that the current code for `mongos` applies error labels when `writeConcernError` is set for a command: Can you please let me know the exact tests that are failing, and a pointer on how I can reproduce the failures on my local machine? Thank you! | |||||||||||||||||||||||||||||
| Comment by Githook User [ 05/Feb/20 ] | |||||||||||||||||||||||||||||
|
Author: {'username': 'mbroadst', 'name': 'Matt Broadstone', 'email': 'mbroadst@mongodb.com'}Message: fix: explicitly specify error labels to support mongos 4.3 Mongos does not currently fill in `errorLabels` for a `failCommand` Related to SERVER-45939 | |||||||||||||||||||||||||||||
| Comment by Matt Broadstone [ 05/Feb/20 ] | |||||||||||||||||||||||||||||
|
ratika.gandhi while it would be ideal to prioritize this work, we've worked around it by explicitly adding errorLabels to our failCommand fail points. I created SPEC-1565 to track the work of reverting that change when this is fixed in mongos. Having said that, RC0 feels a bit late given that all drivers will have to resync their tests multiple times. | |||||||||||||||||||||||||||||
| Comment by Ratika Gandhi [ 04/Feb/20 ] | |||||||||||||||||||||||||||||
|
matt.broadstone, By what date do you need this? Can it wait until RC0? |