[SERVER-60597] EFT index cursors may skip entries after deletions Created: 11/Oct/21  Updated: 27/Oct/23  Resolved: 27/Apr/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Ian Boros Assignee: Backlog - Storage Execution Team
Resolution: Gone away Votes: 0
Labels: EFT
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Storage Execution
Participants:

 Description   

It is possible for an EFT index cursor to "skip" over an entry when the prior entry was deleted in the same recovery unit. This is the sequence of events:

  1. An index cursor gets positioned on a key.
  1. Under the same recovery unit (and same snapshot), that key is deleted.
  1. save() is called on the EFT index cursor.
    As part of save(), we will dereference the radix_iterator which has some magic logic to reposition itself if the key it is sitting on was deleted. The result is that we advance the cursor to the next key, and the index iterator treats its "save point" as the new key.
  1. restore() is called on the EFT index cursor.
    restore() will make a new radix_iterator pointing to the same or next key as the old one, but in the new snapshot. There is some special logic in EFT index cursors to track whether the "last move" was a restore().
  1. next() is called on the EFT cursor.
    This will advance the cursor, despite the fact that the iterator is currently pointing to a key that's never been returned. Therefore we "skip" the record after the one that was deleted.

I have not found a case where this problem is actually observable in mongodb. I do not believe it is possible to reproduce on the remove() path, since delete operations save their state before removing the document. It is, however, easy to reproduce by inserting calls to save() and restore(). For example, applying the below patch to the validation/repair logic and then running the relevant dbtests will show this issue in action.

diff --git a/src/mongo/db/catalog/validate_adaptor.cpp b/src/mongo/db/catalog/validate_adaptor.cpp
index c9a37451d7..bfbf9bbcf9 100644
--- a/src/mongo/db/catalog/validate_adaptor.cpp
+++ b/src/mongo/db/catalog/validate_adaptor.cpp
@@ -346,6 +346,10 @@ void ValidateAdaptor::traverseIndex(OperationContext* opCtx,
         }
 
         try {
+            // Saving and restoring a cursor should effectively be a no op. But it isn't.
+            indexCursor->save();
+            indexCursor->restore();
+
             indexEntry = indexCursor->nextKeyString(opCtx);
         } catch (const DBException& ex) {
             if (TestingProctor::instance().isEnabled() && ex.code() != ErrorCodes::WriteConflict) {
 

./build/optdebug/install/bin/dbtest validate_tests --storageEngine=ephemeralForTest 

(Base rev 9c72ff1047)

The test will "miss" index keys which actually are present and report that it was unable to repair the index.

 

 



 Comments   
Comment by Ian Boros [ 11/Oct/21 ]

CC henrik.edin

My patch for SERVER-60024 should have fixed this issue, by ensuring save() does not reposition itself off of a deleted key. During CR Henrik mentioned that this may be related to some existing BFs involving EFT, and that we should file a ticket anyway so that there's a record of this problem somewhere.

Generated at Thu Feb 08 05:50:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.