[SERVER-60839] Introduce a TemporarilyUnavailable error type Created: 20/Oct/21  Updated: 29/Oct/23  Resolved: 16/Feb/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc0, 5.0.15

Type: Bug Priority: Major - P3
Reporter: Dmitry Agranat Assignee: Josef Ahmad
Resolution: Fixed Votes: 0
Labels: RDY
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
backported by SERVER-72910 [v5.0] Backport wtRCToStatus changes ... Closed
Depends
depends on WT-8290 Adding a new API to the session to re... Closed
Duplicate
is duplicated by SERVER-61454 Change retry policy when txns are rol... Closed
Problem/Incident
Related
related to SERVER-61909 Hang inserting or deleting document w... Closed
related to DOCS-14887 [SERVER] Add section about WT dirty d... Closed
related to SERVER-64050 set TemporarilyUnavailableException::... Closed
is related to SERVER-62650 RecordStore RecordId initialization c... Closed
is related to SERVER-65254 Disable TemporarilyUnavailableExcepti... Closed
is related to SERVER-65360 TemporarilyUnavailable errors incorre... Closed
is related to SERVER-63620 Evaluate locations that throw WriteCo... Backlog
is related to SERVER-63340 Add a --loadShedding server parameter Closed
is related to SERVER-63720 Architecture guide updates for Tempor... Closed
is related to SERVER-67984 Re-enable TemporarilyUnavailableExcep... Closed
is related to SERVER-63333 Attach retryable error label to Tempo... Closed
is related to SERVER-63338 Add uassertWTOK, invariantWTOK, wtRCT... Closed
Backwards Compatibility: Minor Change
Operating System: ALL
Backport Requested:
v5.0
Sprint: Execution Team 2022-02-07, Execution Team 2022-02-21
Participants:
Case:
Linked BF Score: 10

 Description   

The TemporarilyUnavailable error indicates that the operation has been aborted, likely due to excessive server load (e.g. transaction rolled back for eviction). This error is retried in the server with an increasingly larger backoff. Internal operations are retried indefinitely, user operations are retried up to a fixed number of attempts before returning TemporarilyUnavailable to the client.
 
------
 
Original title: Instead of WriteConflict, return a more specialized error when oldest transactions are rolled back for eviction
Original description: Currently, when a write operation is hitting the wt dirty threshold limit, we take the error from WiredTiger, a WT_ROLLBACK, and up-convert to a WriteConflict. This is misleading and should print something more specific instead. Something that would indicate the actual reason.



 Comments   
Comment by Githook User [ 18/Jan/23 ]

Author:

{'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'}

Message: SERVER-72910: Partial backport SERVER-60839 Make wtRCToStatus require a WT_SESSION pointer

This is groundwork for further differentiating WT return codes.

(cherry picked from commit f4aaa34d623e7385b2ac5b332ee07ece1f22c428)
Branch: v5.0
https://github.com/mongodb/mongo/commit/1b7e12704065cfd1e85189d05f73b9e9f2e7d90f

Comment by Githook User [ 18/Jan/23 ]

Author:

{'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'}

Message: SERVER-72910: Partial backport SERVER-60839 Make wtRCToStatus require a WT_SESSION pointer

(cherry picked from commit 7cfa78a4e20eb59c4d592bb12b6493c451b8dd13)
Branch: v5.0
https://github.com/10gen/mongo-enterprise-modules/commit/ba79f5533fbf296d01bfd58b27cf80e93fc528b5

Comment by Yujin Kang Park [ 17/Jan/23 ]

gregory.noma@mongodb.com, thanks for the suggestion. I have created SERVER-72910.

Comment by Yujin Kang Park [ 17/Jan/23 ]

Requesting backport to 5.0, at least for the first commit in the ticket (regarding passing WT_SESSION to wtRCToStatus_slow)

https://github.com/mongodb/mongo/commit/f4aaa34d623e7385b2ac5b332ee07ece1f22c428

louis.williams@mongodb.com I am assuming we don't want to backport the temporarily unavailable error.

Comment by Githook User [ 15/Feb/22 ]

Author:

{'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'}

Message: SERVER-60839 Add TemporarilyUnavailable error

Introduce a TemporarilyUnavailable error and exception type for load
shedding. This error indicates that the operation has been aborted,
likely due to excessive server load.

Errors are retried with an increasingly larger backoff. Internal operations
are retried indefinitely, user operations up to a fixed number of attempts.
Branch: master
https://github.com/mongodb/mongo/commit/581c58c475a872e25b2e3bf7cf5ccd52425ef7c7

Comment by Louis Williams [ 07/Feb/22 ]

kevin.jernigan, there are 2 cases to consider:

Tests that use multi-document transactions handle WriteConflictExceptions as a TransientTransactionError and retry indefinitely. This is what we tell users to do, and in fact, newer drivers do this automatically for users.

For non-multi-document transactions, this error is currently being retried indefinitely inside the server. The proposed behavior is to retry a finite number of times before eventually letting it escape.

The problem here is that our multi-document transactions tests were designed to handle this type of error, but the rest of our tests (i.e. most of them) are not.

Comment by Kevin Jernigan (Inactive) [ 04/Feb/22 ]

When this condition happens today, i.e. when a write operation hits the Wired Tiger dirty threshold limit, we convert to a WriteConflict. How do we handle this in our test infrastructure - don't we fail entire tests for commands that aren't retryable? If so, then what changes if we return a more specialized error for this condition - won't the same tests fail that would fail without the changes in this ticket?

Comment by Githook User [ 02/Feb/22 ]

Author:

{'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'}

Message: SERVER-60839 Make wtRCToStatus require a WT_SESSION pointer

This is groundwork for further differentiating WT return codes.
Branch: master
https://github.com/mongodb/mongo/commit/f4aaa34d623e7385b2ac5b332ee07ece1f22c428

Comment by Githook User [ 02/Feb/22 ]

Author:

{'name': 'Josef Ahmad', 'email': 'josef.ahmad@mongodb.com', 'username': 'josefahmad'}

Message: SERVER-60839 Make wtRCToStatus require a WT_SESSION pointer
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/7cfa78a4e20eb59c4d592bb12b6493c451b8dd13

Comment by Eric Milkie [ 14/Jan/22 ]

Thanks for the clarifications; I modified the title of this ticket for better specificity. Should we close SERVER-61454 as a duplicate?

Comment by Louis Williams [ 14/Jan/22 ]

milkie, after discussing with keith.smith, he confirmed that there is only one scenario for a transaction being rolled-back due to pinning cache space, and that is the "oldest pinned transaction ID rolled back for eviction".

The "synchronous" case you described is just a generalization of the asynchronous case. When a very large transaction pins cache space and is unable to evict pages, WiredTiger will start to roll-back transactions, starting from the oldest, until it gets to the large one. So these two cases that you described are not distinguishable from WiredTiger's perspective.

Comment by Eric Milkie [ 13/Jan/22 ]

It sounds like this ticket is starting to overlap with SERVER-61454. There are actually two similar cases for transaction rollback; one is asynchronous via other threads performing eviction and is based on transaction id age, and one I believe is synchronous within the transaction thread itself once that transaction pins too many pages with uncommitted writes, regardless of transaction age. I was assuming this ticket SERVER-60839 was dealing with the latter situation. In any event, I think we should treat these two cases differently with respect to retry logic.

Comment by Louis Williams [ 13/Jan/22 ]

We should consider retrying internally once or twice in the existing writeConflictRetry path before ultimately letting this error escape. Additionally, we considering labeling this error code as retryable so that drivers can retry once on their end.

We won't be able to let this error escape internal threads. We can only let the error escape for user-originating operations.

Comment by Louis Williams [ 05/Jan/22 ]

Using the work from WT-8290, we can now call WT_SESSION::get_rollback_reason after receiving a WT_ROLLBACK. If the reason is "oldest pinned transaction ID rolled back for eviction", we will return an error code indicating that the operation exceeded a memory limit. Perhaps the existing ExceededMemoryLimit would be a good error code to use.

Generated at Thu Feb 08 05:50:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.