[SERVER-84121] Backup cursor service reports incorrect 'ns' field in the backup cursor response. Created: 12/Dec/23  Updated: 25/Jan/24  Resolved: 25/Jan/24

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 8.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Wei Hu
Resolution: Fixed Votes: 0
Labels: storex-shortlist
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Assigned Teams:
Storage Execution
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Execution Team 2024-01-08, Execution Team 2024-01-22, Execution Team 2024-02-05
Participants:
Linked BF Score: 6

 Description   

It's a bug in mongodb code, particularly this part of the code. Basically, we don't use this mdb_catalog checkpoint cursor that was opened prior to the backup cursor open; instead, we use a different  mdb_catalog  checkpoint cursor ( getParsedCatalogEntry() opens a new checkpoint cursor), opened after the backup cursor open, to fill in the 'ns' field in the backup cursor response . This means that if a checkpoint occurred after this BackupCursorOpenConflictWithCheckpoint check, the alternate checkpoint cursor might be operating on a different snapshot than the backup cursor, potentially resulting in incorrect 'ns' information in the backup cursor response.

As a result, this causes selective_backup_restore_e2e.js to inadvertently skip copying files that's actually part of backup snapshot, resulting in the restore node crash due to missing files.

=================
Some code improvements were identified during the BF investigation.

1) "BackupCursorOpenConflictWithCheckpoint" check currently performs two checks, namely "LastStableRecoveryTimestamp" and "checkpoint id." It would be beneficial to either enhance the error message or add a debug log message to specify which check failed and provide details about the mismatched checkpoint and recovery timestamp.

2) Use ReadSourceScope RAII instead of explicitly setting the TimestampReadSource here and here for better readability. Also, the RAII make sure we abandon the snapshot before we explicitly setting the TimestampReadSource in the recovery unit.



 Comments   
Comment by Githook User [ 25/Jan/24 ]

Author:

{'name': 'Wei Hu', 'email': 'wei.hu@mongodb.com', 'username': 'wh5a'}

Message: SERVER-84121 BackupCursorService should parse catalog entry from the already opened cursor

GitOrigin-RevId: 11295651dbc664d562a03ab98295af48a14535f0
Branch: master
https://github.com/mongodb/mongo/commit/a21933c71366d3822967cbe82fcd7eddd2bb29dc

Comment by Suganthi Mani [ 16/Jan/24 ]

copy-paste of my slack message

In production, encountering the incorrect empty value is easy/common, while the possibility of obtaining an incorrect non-empty value is technically feasible but highly unlikely due to the extremely slim chance of a getting into random ID collision by this function. 

The problematic 'ns' field in the backup cursor response is currently only used by the backup team. Atlas dedicated/Serverless backup (CPS manager) does not use this field, but Cloud Manager and Ops Manager (on-prem customers - mongod/agent deployed in customer's datacenter) use this for selective backup. It's important to note that Cloud Manager and Ops Manager use yearly release cycles, meaning that even if the issue is addressed in version 7.3, they would have to wait until version 8.0 for the release.

Generated at Thu Feb 08 06:54:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.