[SERVER-61792] Extending lock in batcher hurts performance Created: 30/Nov/21  Updated: 29/Oct/23  Resolved: 01/Dec/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.2.0

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Matthew Russotto
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Replication 2021-12-13
Participants:
Linked BF Score: 135

 Description   

The solution to avoiding an uninterruptible lock in SERVER-61334 causes performance degradation in the insert_vector test. Experimentation shows that the issue is not the double-locking but rather the extension of the lock duration. This is because taking the global lock also takes the PBWM, which results in synchronizing the OplogBatcher to the OplogApplier.

We should not take the PBWM nor any WT tickets for this particular lock acquisition, and audit other GlobalLock uses added as part of FileCopyBasedInitialSync to see if they should also not take the PBWM or WT tickets.



 Comments   
Comment by Githook User [ 01/Dec/21 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}

Message: SERVER-61792 Extending lock in batcher hurts performance

Before SERVER-61334, we took RSTL IX, PBWM IS, Global IS, once, and released those almost
immediately. SERVER-61334 took RSTL IX, PBWM IS, and Global IS, then took the same locks
recursively, and held them a little longer (not much! It's just getting BSON objects off an
in-memory queue). Global IS is nearly always uncontended (except shutdown and storage change), as is
RSTL IX, and getting a lock we already have is very cheap. But holding PBWM IS a little longer was
enough to essentially serialize oplog application and batching a lot of the time.

This fix takes RSTL IX and Global IS, then takes those locks recursively, and holds them a bit
longer than the original, but there's no contention. The purpose of SERVER-61334 was to ensure we
did not enqueue an uninterruptible global lock while a storage change held Global X, as that results
in deadlock; this purpose is preserved by this change.
Branch: master
https://github.com/mongodb/mongo/commit/526f24e10905eb80f58dcd1ddd5db37853c91a60

Comment by Matthew Russotto [ 30/Nov/21 ]

Looks like all the other cases added for FCBIS are primary, startup, or already have ShouldNotBlockSecondaryApplication, so it's just this one.

Generated at Thu Feb 08 05:53:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.