Counts in the WTSizeStorer table are not adjusted in the same transaction that performs an insert/update as that would become a serialization point for concurrent inserts/deletes and would result in an expensive WCE.
Instead, an atomic counter is maintained and flushed every so often to the SizeStorer. Among other things, this data is flushed on clean shutdown. These writes are not timestamped (would be difficult) and thus what's put on disk is the counts for "now" which are unlikely to be the counts as of the stable timestamp. At startup, when replication plays forward the oplog during recovery, an insert that was already accounted for in the sizestorer's view of the data, will be counted again.
The proposed fix, trust the WTSizeStorer to have the proper counts for collections after recovery is played. Specifically:
- Introduce state representing the server (or operation context) is in "recovery mode".
- Step 1, however breaks a special case: when the creation of a collection wasn't included in a stable timestamp, so the collection gets recreated during recovery with an `ident` that's different than the one used at shutdown.
- Introduce more state, the set of collections created during recovery.
- Allow updates to `_changeNumRecords`/`_increaseDataSize` if the collection being updated is in this set.
- Another special case comes up when the collection exists in the stable checkpoint, but none of the writes made it into the stable checkpoint. When a collection is empty, as deemed by a cursor "findOne", the record store setup assumes its count should be zero. This is correct in a non-RTT world, but would violate the expectation that the WTSizeStorer is the authority of counts. This code would also needs to be adjusted.
These changes would keep WT counts accurate on clean shutdown, but not on rollback.
SERVER-33493 is tracking changes for that purpose.