[SERVER-31924] OplogStones can capture record ids to truncate out of order Created: 10/Nov/17  Updated: 06/Dec/22  Resolved: 23/Jan/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Backlog - Storage Execution Team
Resolution: Duplicate Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-32533 In oplog truncate with WT, don't use ... Closed
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:
Linked BF Score: 0

 Description   

Capped collections in WiredTiger normally trigger deletes as inserts are performed. For performance reasons, the oplog truncates old documents in batches. This is done via a data structure known as OplogStones.

A background thread is triggered to periodically look at the oplog size and it may then choose to call reclaimOplog

reclaimOplog calls one truncate for each OplogStone popped from the beginning of _oplogStones. The truncate method deletes records ranging from the previous stone's lastRecord (saved on the _oplogStones->firstRecord here) to the current stone's lastRecord.

The invariant for this to work is that the lastRecord in consecutive stones must be increasing. As inserts to the oplog commit, their recordId will increase the lastRecord "if applicable".

Why "if applicable"? With document level locking storage engines, transactions can commit out of timestamp order. Since these RecordIds are* the timestamp values in disguise, the OplogStone datastructure has to deal with record ids arriving out of order.

The only piece wrong with that logic is that if !_stones.empty() returns false (i.e: all the existing stones have been purged), we will unconditionally create a new stone. This stone will have the recordId that committed. Because there are no stones, the code did not validate it would be consumed in a valid call to truncate.

The corollary logic (for demonstration, not the required solution) here would be to then check (against firstRecord)

if (_stones.empty() && lastRecord < firstRecord) {
  return;
}

This would protect the code from attempting a truncation where "start > stop".

One idea is to never pass in the start cursor to WiredTiger. Always let WiredTiger handle positioning the cursor to the beginning of the oplog (oldest record) to start truncating from.



 Comments   
Comment by Michael Cahill (Inactive) [ 23/Jan/18 ]

This issue is addressed by the fix in SERVER-32533.

In particular, while we're still maintaining a firstRecord in oplogStones, we do not rely on it for correctness.

Comment by Daniel Gottlieb (Inactive) [ 10/Nov/17 ]

Note this bug is difficult to hit. It was caught using the minimum command line configurable oplog size (1MB) and was under a highly concurrent test that I believe could roll over the oplog multiple times per second.

Generated at Thu Feb 08 04:28:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.