[SERVER-79922] Investigate solutions to the fsync (lock:true) starvation scenario Created: 11/Aug/23  Updated: 18/Dec/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Nandini Bhartiya Assignee: Nandini Bhartiya
Resolution: Unresolved Votes: 0
Labels: cs-subteam1, sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-66340 Improve distributed transaction commi... Closed
Participants:

 Description   

With SERVER-78149, users can now issue fsync(lock: true) through mongos which in turn acquires a Global S lock on all the shards. However, the mongos fsync (lock:true) request has a default of a 90s timeout and this may lead to scenarios where the fsyncLock request is never granted, especially when transactions are being heavily used by the users or internally. 



 Comments   
Comment by Josef Ahmad [ 14/Dec/23 ]

Can this starvation occur anymore now that we've removed the deadlock condition via SERVER-66340?

Comment by Nandini Bhartiya [ 18/Aug/23 ]

A possible solution that gregory.noma@mongodb.com & I discussed was:

  • To create a new global resource for fsync(as done for the setFCV command (in SERVER-65821)).
  • All the requests that need a gobal lock in IX/X acquire this in the IX mode (in the global lock ctor).
  • The fsync command acquires the lock on the new resource in S mode (not in IX else, it will still be [compatible with other requests|README.md]), thereby not allowing any new transactions to start and instead queue behind the fsync lock request.
  • The transaction coordinator and the journal flusher threads opt out of the new lock resource, thereby allowing an in-flight two phase transactions to commit/abort.
  • Once the fsync command has acquired the lock on the new resource, it can then request for the Global lock in MODE_S and it will be granted since it will be enqueued at the front on the lock queue.
Comment by Nandini Bhartiya [ 16/Aug/23 ]

Adding this ticket as a dependent on SERVER-66340 and SERVER-66341,as there is no feasible solution for the fsync(lock:true) case as there was with the setFCV command in SERVER-65821

Generated at Thu Feb 08 06:42:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.