[SERVER-63233] Incorrect fastcount after FCBIS Created: 02/Feb/22  Updated: 29/Oct/23  Resolved: 02/Mar/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 6.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Lingzhi Deng Assignee: Matthew Russotto
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Gantt Dependency
has to be done after SERVER-63908 Fix and test partial index build hand... Closed
Related
related to DOCS-15142 Ensure we document that fastcount may... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.3
Sprint: Repl 2022-02-21, Repl 2022-03-07
Participants:

 Description   

This was discovered in the new test (fcbis_clear_appliedThrough_after_recovery.js) I added with "Extend Backup Cursor" for FCBIS as part of SERVER-62745. From my discussion with Matthew: We get the size storer from the initial snapshot, then we disable size adjustment during oplog replay. But for oplog entries added after the initial snapshot's recovery optime, we probably DO want size adjustment. I'm surprised we haven't seen it, we might be missing test coverage for extend.



 Comments   
Comment by Githook User [ 01/Mar/22 ]

Author:

{'name': 'Matthew Russotto', 'email': 'matthew.russotto@mongodb.com', 'username': 'mtrussotto'}

Message: SERVER-63233 Do oplog recovery of data after the checkpointTimestamp (or extended timestamp) on the
syncing node. This will result in fast count being correct nearly all the time.
Branch: master
https://github.com/10gen/mongo-enterprise-modules/commit/ec2275bfdf120b5e457e8be078a6bd173c5432c5

Comment by Matthew Russotto [ 09/Feb/22 ]

The cause of this issue (fortunately) isn't getting fastcount wrong for extend. Basically it is because FCBIS is like unclean shutdown, and we know unclean shutdown can corrupt fastcounts. Specifically, any oplog record truncated may have (but not necessarily did have) its fastcount updated, which we will then re-do when we do recovery. This test happens to force a condition where it happens all the time – we hang waiting for a particular optime (which is the one we truncate to) to become stable, then we force a later stable checkpoint (which forces fastcount to be flushed) to happen before doing the backup.

Generated at Thu Feb 08 05:57:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.