Keeping correct counts with the WiredTigerSizeStorer is complex, error prone and seemingly impossible. Particularly with the move to only checkpointing stable data colliding with crashes, rollback and deletions due to capped collections. Thanks to geert.bosch's recent dive into other size storer related problems, I am proud to now announce collection renames (within the same database) are added to the list of operations that will require careful handling of minutiae to maintain correct-er counts.
One common scenario the WiredTiger integration layer attempts to keep correct is coming back online after a clean shutdown at an arbitrary stable timestamp. The state of (non-empty) collections and their sizes is that the size storer table contains the correct size after replication replays from the stable timestamp to the top of oplog (where the node left off when shutting down).
One exception to this rule is when a collection is created during replication recovery. This condition is unfortunately necessary because the WTSizeStorer maps "idents" to counts. When a collection is recreated during replication recovery, a new ident is chosen (the previous one is lost to the void). Because the previous mapping, albeit correct, is lost, the code must count inserts coming in to be correct.
The intersection of these behaviors along with renameCollection's behavior to create a new record store object (referencing the same underlying table) will juke the WTRecordStore constructor into allowing size adjustments during replication recovery on the same underlying ident.
Thus a sequence involving a rename from A -> B that manifests as an incorrect count:
- At shutdown collection B has 2 documents and a correct count of 2.
- At the stable timestamp, Collection A exists with 1 document and a count of 2.
- Replication recovery plays a rename from A -> B. This marks the collection for size adjustment.
- Replication recovery inserts a second document into B. This increases the count from 2 -> 3.
The attached data files, when brought up as a replica set (on localhost:27017), will demonstrate count() != itcount()
Note that replication recovery replaying a sequence of:
- create collection A
- rename A -> B
must allow size adjustments on B. As if it's being "inherited" from A.