[SERVER-65418] Release all resources before sleep in WriteConflictException::logAndBackoff Created: 10/Apr/22  Updated: 26/Oct/23

Status: Open
Project: Core Server
Component/s: None
Affects Version/s: 4.0.23, 5.0.7
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: peng zhenyi Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: former-storex-namer, writeConflict
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-66751 Determine if lock acquisition can hap... Open
is related to SERVER-66750 Determine if lock acquisition can hap... Closed
is related to SERVER-66752 Determine if lock acquisition can hap... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Participants:

 Description   

When modify one single document concurrently, only on request can commit successfully, and other requests get WriteConflictExceptions and retry internally(https://github.com/mongodb/mongo/blob/v5.0/src/mongo/db/concurrency/write_conflict_exception.h#L70-L106).

 

 
logAndBackofff(https://github.com/mongodb/mongo/blob/v5.0/src/mongo/db/concurrency/write_conflict_exception.cpp#L51-L59) is called before each retry attempt, and sleep is called when numAttempts is greater than 3(https://github.com/mongodb/mongo/blob/v5.0/src/mongo/util/log_and_backoff.cpp#L39-L51)

 
But the resources of these requests are not released while sleeping, so here is the problem : newly incoming requests are stucked, because the global tickets and global/database/collection locks are held by a lot of "sleeping retry requests".

For example, if 128 write requests are sleeping for 10ms (and waiting to retry again) in the same time, then there are no available global write ticket during this 10ms. All newly incoming write requests are stucked, but MongoDB has nothing to do (and just sleep).

I think it is better to release all resources before sleep and get back resources after sleep if the retry-function is not in a WUOW. In this way, MongoDB can handle more requests in this period. My basic idea is this:

// release all resources
Locker::LockSnapshot ls;
invariant(opCtx->lockState()->saveLockStateAndUnlock(&ls));
 
// logAndBackoff sleep
 
// get all resources back and retry
opCtx->lockState()->restoreLockState(ls); 

 


Generated at Thu Feb 08 06:02:41 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.