[SERVER-62257] The 'valid' flag response for the validate cmd w/ repair can be inaccurate in some cases when a duplicate record is deleted Created: 23/Dec/21  Updated: 02/May/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shin Yee Tan Assignee: Backlog - Storage Execution Team
Resolution: Unresolved Votes: 0
Labels: techdebt
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Storage Execution
Sprint: Execution Team 2023-05-01
Participants:

 Description   

The validate command returns incorrect 'valid'==true/false responses when a duplicate record is deleted. Deleting a duplicate record implicitly deletes index entries and causes validate state's index entry counts to become incorrect. The 'valid' response can be set to false when checks are made that the number of index entries and the number of record entries make sense for a given index type.

There are a few additional count tracking inaccuracies in the code that were fixed in the PR in the comments. But identifying which indexes are implicitly affected by a record deletion is a tricky problem and needs significant refactoring in code with significant techdebt. The PR fixes alone would just shift the cases around when 'valid' is incorrectly true/false and do not fully address the bug.

 

When missing index entries are identified as duplicate documents in validation repair mode, the duplicate document is deleted from collection and moved to a local lost and found. deleteDocument will call _unindexKeys in index_catalog_impl to remove the record being deleted from the indexes it is in. When a duplicate document is missing from an index, we want to ensure that the matching index key of the original document is not unindexed. The index the duplicate document is missing from should be unchanged when the duplicate is deleted from collection. This part will be done in SERVER-50081.

For validation repair mode, we want to know what indexes the record was removed from and update the respective index key num counts accordingly.



 Comments   
Comment by Dianna Hohensee (Inactive) [ 28/Apr/23 ]

Flagging for rescheduling after spending a few days on it because addressing the problem would require significant refactoring in techdebt heavy code.

Comment by Dianna Hohensee (Inactive) [ 28/Apr/23 ]

I've put up a PR with some counter fixes and comments thereof. However, they aren't sufficient to fully correct the validate command index entry counters that ultimately effect whether validate returns 'valid'==false/true.

At a high level, when validate repairs a missing index entry, inserting the index key can cause a duplicate key error if the record store has two documents that violate an unique index. Validate then deletes the record, implicitly deleting ALL the index entries associated in OTHER indexes. However, validate only tracks a change in number of index entries for that one index, validate does not update the counts for other affected indexes. Sometimes those other affected indexes also run logic for a missing index entry, and with the current code in master that implicitly fixes the index entry that got previously implicitly deleted and bumps the numKeys count up for that index (when it shouldn't, the numKeys should have been decremented then incremented, net zero). Sometimes the other affected indexes don't have a missing index entry associated with the record that was deleted, and the index numKeys remains +1 too high. This last one is the one I have not figured out how to fix easily. We either would need to track/group index entries present and missing for each RecordId that has a problem earlier in the process; or communicate back up the stack what index entries were deleted implicitly when the record was deleted – probably not this solution because we call unindex() on every index in the collection, leaving it to low level code to find out whether the index entry exists for a given recordId.

This is the repair code to add missingIndexEntries, across all indexes
First that function tries to insert the index keys
Then if that encounters a duplicate key error, the collection record is deleted / moved to lost and found
Lastly the index keys for that specific index is re-added after deleting the record implicitly deletes the all the index entries for the record.

 

Comment by Dianna Hohensee (Inactive) [ 27/Apr/23 ]

Note: I think this is also broken. We call that function in a loop through all the indexes in sequence, to check whether the collection is valid or not after repair.

By virtue of index iteration ordering probably being the same across missingIndexEntry corrections and validateIndexKeyCount calls, the numRecords (collection entries) gets adjusted globally across calls. If I fix that to be case by case – note IndexValidateResults is per index – then the error goes away that this ticket is trying to fix – valid is true. But at the same time the test still thinks that there are 2 index entries per numKeys, which is not accurate. They're just both equally inaccurate now.

Comment by Dianna Hohensee (Inactive) [ 26/Apr/23 ]

I think the best approach might be to do something unusual.

I'm think of decorating the opCtx with a small data structure that the validate command can fill with a list of index names, and then way down through the IndexCatalog::unindexRecord to each index::_unindex method, there can be a check whether the decoration pointer is set and update whether an entry was deleted.

We can track numKeysDeleted in the IndexCatalog::unindexRecord interface, but that doesn't tell us which indexes had an entry deleted. Actually, it looks like there's a numKeysDeleted all the way down into each index::unindex method, so maybe it could be collected higher up.

Comment by Dianna Hohensee (Inactive) [ 24/Apr/23 ]

Shin Yee did some code spelunking to help me sort out what bug is remaining in this ticket.

The problem is that validate in repair mode calls a function to repair an index entry and then correctly increments validate's numKeys count for that specific index. However, the validate code does not increment the numKeys count for any other index. Following the code that repairs an index entry, the repair function may decide to move the record to lost and found, which can ultimately delete the document, implicitly deleting other index entries, causing validate's numKeys counts for any other index on the collection to become incorrect.

Comment by Yuhong Zhang [ 28/Dec/21 ]

Currently, when unindexing a key from the id index, we will blindly remove a matching entry without looking at the recordID. As a result, when we have two identical documents with the same _id field but only one index entry in the id index, we will remove it during the validation repair.

Generated at Thu Feb 08 05:54:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.