[SERVER-35435] Renaming during replication recovery incorrectly allows size adjustments Created: 06/Jun/18 Updated: 29/Oct/23 Resolved: 12/Jun/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Storage |
| Affects Version/s: | None |
| Fix Version/s: | 4.0.0-rc6, 4.1.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Daniel Gottlieb (Inactive) | Assignee: | Judah Schvimer |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v4.0
|
||||||||||||||||||||
| Sprint: | Repl 2018-06-18 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
Keeping correct counts with the WiredTigerSizeStorer is complex, error prone and seemingly impossible. Particularly with the move to only checkpointing stable data colliding with crashes, rollback and deletions due to capped collections. Thanks to geert.bosch's recent dive into other size storer related problems, I am proud to now announce collection renames (within the same database) are added to the list of operations that will require careful handling of minutiae to maintain correct-er counts. One common scenario the WiredTiger integration layer attempts to keep correct is coming back online after a clean shutdown at an arbitrary stable timestamp. The state of (non-empty) collections and their sizes is that the size storer table contains the correct size after replication replays from the stable timestamp to the top of oplog (where the node left off when shutting down). To do this, the code refrains from updating counts when in replication recovery (among some other conditions). One exception to this rule is when a collection is created during replication recovery. This condition is unfortunately necessary because the WTSizeStorer maps "idents" to counts. When a collection is recreated during replication recovery, a new ident is chosen (the previous one is lost to the void). Because the previous mapping, albeit correct, is lost, the code must count inserts coming in to be correct. The intersection of these behaviors along with renameCollection's behavior to create a new record store object (referencing the same underlying table) will juke the WTRecordStore constructor into allowing size adjustments during replication recovery on the same underlying ident. Thus a sequence involving a rename from A -> B that manifests as an incorrect count:
The attached data files, when brought up as a replica set (on localhost:27017), will demonstrate count() != itcount() Note that replication recovery replaying a sequence of:
must allow size adjustments on B. As if it's being "inherited" from A. |
| Comments |
| Comment by Githook User [ 12/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'username': 'judahschvimer', 'name': 'Judah Schvimer', 'email': 'judah@mongodb.com'}Message: (cherry picked from commit 8b698cac2d19f0fec502db10501e7059a10d2897) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Githook User [ 12/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Author: {'username': 'judahschvimer', 'name': 'Judah Schvimer', 'email': 'judah@mongodb.com'}Message: | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Judah Schvimer [ 06/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Daniel Gottlieb (Inactive) [ 06/Jun/18 ] | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Assigning to judah.schvimer (and I put it in the current sprint, apologies if that's inappropriate) as he's helping manage some other tickets describing different ways counts can go wrong (mostly with capped collections). The solutions are not obviously independent, so it makes sense to avoid concurrent development in this area. |