[SERVER-42175] WiredTigerRecordStore::reclaimOplog can block for extended periods of time Created: 11/Jul/19  Updated: 27/Oct/23  Resolved: 03/Feb/20

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: 3.4.3, 3.6.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Backlog - Storage Execution Team
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
is related to SERVER-30727 Secondary stuck in loop of current el... Closed
is related to SERVER-43322 Add tracking tools for measuring Oplo... Closed
is related to WT-4011 Checkpoint should not read truncated ... Closed
is related to SERVER-34129 Don't hold database or collection loc... Closed
Assigned Teams:
Storage Execution
Operating System: ALL
Sprint: Execution Team 2019-09-09
Participants:

 Description   

In the case where we're truncating a very large amount of oplog entries, WiredTigerRecordStore::reclaimOplog may hold its locks (Global IX since 3.6, DB/Collection IX for the oplog in 3.4) for an extended period of time, blocking operations such as stepdown which may require a global X lock.

One potential solution to this problem is for reclaimOplog to yield all locks periodically.



 Comments   
Comment by Geert Bosch [ 31/Jan/20 ]

Given that we believe this to be fixed, shall we close this ticket?

Comment by Maria van Keulen [ 20/Sep/19 ]

Given the performance improvements to oplog truncation in WT-4011, we do not have reason to believe that the issue described in this ticket is still a problem after 3.6.4. The tracking tools added in SERVER-43322 will help us determine whether there are still issues to investigate.

I am putting this ticket back in Needs Scheduling for it to be triaged as a pre-3.6.4 issue. We will file additional tickets as necessary if there are issues post-3.6.4.

Comment by Maria van Keulen [ 16/Sep/19 ]

This issue may be the root cause of SERVER-30727, so I am marking them as related. This ticket is specific to 3.4 and releases prior to SERVER-34129, since it pertains to reclaimOplog holding an IX lock on "local".

Comment by Maria van Keulen [ 13/Sep/19 ]

Putting this ticket on hold in favor of doing SERVER-43322, since SERVER-43322 will give us data on how to best proceed.

Comment by Maria van Keulen [ 06/Sep/19 ]

Per the discussions in SERVER-34129, we want to make sure that restartCatalog still conflicts with oplog truncation, so yielding all locks during oplog truncation is tricky.

geert.bosch and I brainstormed alternate solutions to this issue, among which being to have a larger total number of oplog stones when the oplog is very large. This way, individual stones will be smaller, so truncating the contents of one stone would not necessitate holding the global IX lock for as long.

Generated at Thu Feb 08 04:59:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.