[DRIVERS-1934] withTransaction API retries too frequently Created: 15/Mar/19 Updated: 25/Sep/23 |
|
| Status: | Backlog |
| Project: | Drivers |
| Component/s: | Transactions |
| Fix Version/s: | None |
| Type: | Spec Change | Priority: | Major - P3 |
| Reporter: | Shane Harvey | Assignee: | Unassigned |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | FY24Q4 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Case: | (copied to CRM) | ||||||||
| Description |
|
Note that this is not really a bug because the spec was designed to retry immediately on purpose. The withTransaction API retries immediately when encountering a TransientTransactionError. I think this may cause resource utilization problems on both the client and server in real world use cases. As a simple example, let's say the client is running two concurrent transactions (A and B) that touch the same document. One of these transactions will error with TrasientTransactionError (caused by a WriteConflict) and immediately retry, let's assume this is transaction B. If A is still in progress then the retry of B will also fail with the same error; it will only succeed after A has completed. This can lead to hundreds of failed transactions per second. Another example is what happens with many concurrent transactions that all contend on the same document. I've provided an example in withTransaction.py
One solution to this problem could be adding a delay before attempting to retry. When I change with_transaction to have a 250 millisecond retry delay the withTransaction script completes much faster and with much fewer retry attempts:
Note that a fixed retry delay is only one solution. We can also investigate others, like exponential backoff or something else. |
| Comments |
| Comment by Scott L'Hommedieu (Inactive) [ 29/Apr/19 ] |
|
None yet. Beta is just getting started. |
| Comment by Ian Whalen (Inactive) [ 29/Apr/19 ] |
|
scott.lhommedieu any feedback so far on this? |
| Comment by Jeremy Mikola [ 01/Apr/19 ] |
|
scott.lhommedieu: Any follow-up on this from beta users? |
| Comment by Jeremy Mikola [ 15/Mar/19 ] |
|
scott.lhommedieu: Is this worth discussing with users during your outreach or the beta program? That might provide some guidance on what type of solution would be best (e.g. fixed delay, back-off). |
| Comment by Shane Harvey [ 15/Mar/19 ] |
|
Also bear in mind that this python example may not be the best because it uses threads which all contend on the Python GIL (global interpreter lock). The same example written in Java/Go/C etc.. may show different results. |