[DRIVERS-2112] Retryable writes can write twice under very specific circumstances Created: 25/Jul/18 Updated: 31/Mar/22 |
|
| Status: | Backlog |
| Project: | Drivers |
| Component/s: | Retryability |
| Fix Version/s: | None |
| Type: | Spec Change | Priority: | Major - P3 |
| Reporter: | Kevin Adistambha | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Driver Changes: | Needed | ||||||||||||
| Description |
|
The spec for retryable writes https://github.com/mongodb/specifications/blob/master/source/retryable-writes/retryable-writes.rst contains a Q&A section at the end of the document. However, the Q&A does not address one particular scenario when retryable writes can violate its "once and only once" guarantee, which was mentioned in the docs under Failover Period:
This has created some confusion in the community since there seems to exist at least one possibility, no matter how small, that a retryable write can violate its intended design. Please consider adding this specific failure scenario (and how unlikely it is) to the Q&A section. |
| Comments |
| Comment by A. Jesse Jiryu Davis [ 31/Jan/22 ] |
|
I added some detail on the timeoutMS PR: https://github.com/mongodb/specifications/pull/881#issuecomment-939072212 I think timeoutMS makes this issue much more concerning since we'll keep trying indefinitely by default. |
| Comment by Jeremy Mikola [ 14/Dec/18 ] |
|
This has been moved to the SPEC backlog, as specs are not a primary documentation source for our users. |
| Comment by Kevin Adistambha [ 26/Jul/18 ] |
|
jmikola Yes that is one possible timeline that a double write could happen. kevin.pulo and I discussed this just now and we arrived at the same conclusion as yours: checking the elapsed time between the first write and the retried write might prevent a scenario where the application was distracted for longer than localLogicalsessiontimeoutminutes. However the logic could not prevent the scenario where the application just got suspended for an extended period of time. Since the possibility of a double write was mentioned clearly in the docs inside a big red warning box which was quite eye-catching, adding this specific edge case to the Q&A section of the spec might help to clear up some confusion and concerns people have about this feature. |
| Comment by Jeremy Mikola [ 25/Jul/18 ] |
|
Assuming my understanding above is correct, I wonder if drivers could work around this edge case by adding some logic to ensure that we do not attempt a retry if localLogicalSessionTimeoutMinutes has elapsed since the first attempt. Since the field is reported by isMaster, drivers do have access to its value. If drivers were to track the time that the first write command is issued, we could check against this before retrying. I suppose there may still be an edge case where the application is suspended between that calculation and the retried write command being sent on an outgoing socket. If so, this may not be worth the trouble since we can't fully eliminate the edge case. |
| Comment by Jeremy Mikola [ 25/Jul/18 ] |
|
kevin.adistambha: Can you confirm if my understanding below is correct?
|