[SERVER-77972] Investigate that PreImagesTruncateManager on Secondaries takes PBWM lock Created: 09/Jun/23 Updated: 29/Oct/23 Resolved: 24/Jul/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 7.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Haley Connelly | Assignee: | Haley Connelly |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||
| Sprint: | Execution EMEA Team 2023-07-10, Execution EMEA Team 2023-07-24, Execution EMEA Team 2023-08-07 | ||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||
| Linked BF Score: | 114 | ||||||||||||||||||||||||||||
| Description |
|
During truncate marker initiailisation on a secondary, the PBWM lock is acquired in MODE_IS by default. A deadlock can happen as follows:
|
| Comments |
| Comment by Githook User [ 21/Jul/23 ] |
|
Author: {'name': 'Haley Connelly', 'email': 'haley.connelly@mongodb.com', 'username': 'haleyConnelly'}Message: |
| Comment by Haley Connelly [ 03/Jul/23 ] |
|
This is also likely an issue with change collections. Initially I was planning to do this in this ticket, but decided it deserves its own ticket and filed |
| Comment by Haley Connelly [ 19/Jun/23 ] |
|
To add some context on why all_durable isn't sufficient for secondaries (please, correct me if this seems wrong).
|
| Comment by Haley Connelly [ 15/Jun/23 ] |
|
Proposed fix (collaboration with suganthi.mani@mongodb.com):
(2) Only truncate range if marker.lastRecord <= all_durable && marker.lastRecord <= lastApplied
|
| Comment by Haley Connelly [ 13/Jun/23 ] |
|
The goal should be to ensure that a key within a range cannot be written after the range is truncated to prevent inconsistent views of the data. |
| Comment by Haley Connelly [ 13/Jun/23 ] |
|
suganthi.mani@mongodb.com Thanks so much for laying out this scenario! We discussed this in a bit more detail offline and I think we've landed on a solution. The scenario where TS(200) would be truncated before TS(100) is committed would be very unlikely, but more likely the smaller the 'expireAfterSeconds'. During truncate initialisation, we could use 'lastApplied' as a read source, and wait for it to become durable before performing our scan. This could ensure that out of order inserts during secondary oplog application do not impact initialisation. On top of this, I've filed |
| Comment by Suganthi Mani [ 12/Jun/23 ] |
|
haley.connelly@mongodb.com After speaking to you last Friday, I realized that not taking PBWM lock during the truncate marker initialization, can cause Changestream(CS) cursors to skip some entries (assuming that truncate range is used for both Serverless change collection and pre-image collection). Consider this scenario.
To be noted, we can have the same problem on primary even with PBWM lock held and if the read snapshot is lastApplied (not a no-hole point). Consider this scenario.
max.hirschhorn@mongodb.com and I think, the underlying problem in both cases is that we can have new writes with recordID older than Marker.lastRecord |