[SERVER-77972] Investigate that PreImagesTruncateManager on Secondaries takes PBWM lock Created: 09/Jun/23  Updated: 29/Oct/23  Resolved: 24/Jul/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Haley Connelly Assignee: Haley Connelly
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
is duplicated by SERVER-78042 Truncates on change collections and p... Closed
Related
related to SERVER-78042 Truncates on change collections and p... Closed
is related to SERVER-78599 Prevent change collections from trunc... Closed
is related to SERVER-79234 Revisit allDurable and lastApplied ch... Backlog
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution EMEA Team 2023-07-10, Execution EMEA Team 2023-07-24, Execution EMEA Team 2023-08-07
Participants:
Linked BF Score: 114

 Description   

During truncate marker initiailisation on a secondary, the PBWM lock is acquired in MODE_IS by default.

A deadlock can happen as follows:

  • There is a large prepared transaction, the primary waits for the commit to be replicated
  • The OplogApplier on the secondary tries to apply a new batch with the commit, but needs to acquire the PBWM lock in MODE_X
  • The ChangeStreamExpiredPreImagesRemover initialisation acquired the PBWM lock in MODE_IS, but gets stuck waiting for the prepared transaction to commit or abort, but it can't because it needs the PBWM lock.


 Comments   
Comment by Githook User [ 21/Jul/23 ]

Author:

{'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}

Message: SERVER-77972 Prevent pre-image truncation from taking PBWM
Branch: master
https://github.com/mongodb/mongo/commit/4ff092d683be418230ef28fa3f3c81833b82c570

Comment by Haley Connelly [ 03/Jul/23 ]

This is also likely an issue with change collections. Initially I was planning to do this in this ticket, but decided it deserves its own ticket and filed SERVER-78599

Comment by Haley Connelly [ 19/Jun/23 ]

To add some context on why all_durable isn't sufficient for secondaries (please, correct me if this seems wrong).

  • WiredTiger publishes all durable at the end of every WT commit
    • In Secondary oplog application, entries are applied in parallel.
    • If WiredTiger doesn't know there is a TS(50) opTime for an oplog batch which has already applied/ committed TS(51), the all durable could be TS(51) despite TS(50) not being committed
Comment by Haley Connelly [ 15/Jun/23 ]

Proposed fix (collaboration with suganthi.mani@mongodb.com):
(1) Skip Acquiring PBMW lock for marker initialisation

    • Setting the ReadSource to lastApplied isn't necessary provided we do part (2).  
    • Truncate marker initalisation is an estimation, and if it initialisation opens a cursor mid oplog batch, we miss reading some entries, which is inconsequential. 

(2) Only truncate range if marker.lastRecord <= all_durable && marker.lastRecord <= lastApplied

      • all_durable: For the primary, ensures that all writes with opTimes less than marker.lastRecord are committed and there are no holes.
      • lastApplied: For the secondaries, without the PBMW lock, we can read in-between an oplog batch. Since lastApplied is only updated after each oplog batch, this guarantees the truncate range is in a consistent state
Comment by Haley Connelly [ 13/Jun/23 ]

The goal should be to ensure that a key within a range cannot be written after the range is truncated to prevent inconsistent views of  the data.

Comment by Haley Connelly [ 13/Jun/23 ]

suganthi.mani@mongodb.com Thanks so much for laying out this scenario! We discussed this in a bit more detail offline and I think we've landed on a solution.

The scenario where TS(200) would be truncated before TS(100) is committed would be very unlikely, but more likely the smaller the 'expireAfterSeconds'. 

During truncate initialisation, we could use 'lastApplied' as a read source, and wait for it to become durable before performing our scan. This could ensure that out of order inserts during secondary oplog application do not impact initialisation.  

On top of this, I've filed SERVER-78042 to address the potential issue of trying to truncate a record whose timestamp is greater than the last durable timestamp. 

Comment by Suganthi Mani [ 12/Jun/23 ]

haley.connelly@mongodb.com After speaking to you last Friday, I realized that not taking PBWM lock during the truncate marker initialization, can cause Changestream(CS) cursors to skip some entries (assuming that truncate range is used for both Serverless change collection and pre-image collection). Consider this scenario.

  1. secondary starts to apply a batch containing operations at TS(50) ,TS(100) and TS(200)
  2. oplog applier applies TS(200) but not yet applied TS(50) & TS(100).
  3.  Truncate marker initialization not taking PBWM allows to read these out-of-order writes which can cause WT truncate all records till TS(200).
  4. Now. oplog applier  applies TS(50) and TS(100).
  5. Before the client tries to re-establish CS cursor on this secondary, the last read entry by the cursor is TS(50). Since TS(50) entry is  present, CS cursor can resume on this secondary but now miss reporting TS(200) op to the client.

To be noted, we can have the same problem on primary even with PBWM lock held and if the read snapshot is lastApplied (not a no-hole point). Consider this scenario.

  1. Node1 is primary, CS cursor last read entry is TS(50) (which is majority committed)
  2. Node1 commits TS(200), then truncates the change entries till TS(200).
  3. Node1 then commits TS(100).
  4. TS(200) managed to get majority committed.
  5. Say node1 crashes,  CS cursor resumes on node2 and reads till TS(100).
  6. Now, node2  crashes and CS cursor able to resume on node1 as it has TS(100), resulting in miss reporting TS(200) to client.

max.hirschhorn@mongodb.com  and I think, the underlying problem in both cases is that we can have new writes with recordID older than Marker.lastRecord

Generated at Thu Feb 08 06:37:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.