[SERVER-31924] OplogStones can capture record ids to truncate out of order Created: 10/Nov/17 Updated: 06/Dec/22 Resolved: 23/Jan/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Storage |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Daniel Gottlieb (Inactive) | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | neweng | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Storage Execution
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 0 | ||||||||||||
| Description |
|
Capped collections in WiredTiger normally trigger deletes as inserts are performed. For performance reasons, the oplog truncates old documents in batches. This is done via a data structure known as OplogStones. A background thread is triggered to periodically look at the oplog size and it may then choose to call reclaimOplog reclaimOplog calls one truncate for each OplogStone popped from the beginning of _oplogStones. The truncate method deletes records ranging from the previous stone's lastRecord (saved on the _oplogStones->firstRecord here) to the current stone's lastRecord. The invariant for this to work is that the lastRecord in consecutive stones must be increasing. As inserts to the oplog commit, their recordId will increase the lastRecord "if applicable". Why "if applicable"? With document level locking storage engines, transactions can commit out of timestamp order. Since these RecordIds are* the timestamp values in disguise, the OplogStone datastructure has to deal with record ids arriving out of order. The only piece wrong with that logic is that if !_stones.empty() returns false (i.e: all the existing stones have been purged), we will unconditionally create a new stone. This stone will have the recordId that committed. Because there are no stones, the code did not validate it would be consumed in a valid call to truncate. The corollary logic (for demonstration, not the required solution) here would be to then check (against firstRecord)
This would protect the code from attempting a truncation where "start > stop". One idea is to never pass in the start cursor to WiredTiger. Always let WiredTiger handle positioning the cursor to the beginning of the oplog (oldest record) to start truncating from. |
| Comments |
| Comment by Michael Cahill (Inactive) [ 23/Jan/18 ] |
|
This issue is addressed by the fix in In particular, while we're still maintaining a firstRecord in oplogStones, we do not rely on it for correctness. |
| Comment by Daniel Gottlieb (Inactive) [ 10/Nov/17 ] |
|
Note this bug is difficult to hit. It was caught using the minimum command line configurable oplog size (1MB) and was under a highly concurrent test that I believe could roll over the oplog multiple times per second. |