-
Type: Bug
-
Resolution: Gone away
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
Storage Execution
It is possible for an EFT index cursor to "skip" over an entry when the prior entry was deleted in the same recovery unit. This is the sequence of events:
- An index cursor gets positioned on a key.
- Under the same recovery unit (and same snapshot), that key is deleted.
- save() is called on the EFT index cursor.
As part of save(), we will dereference the radix_iterator which has some magic logic to reposition itself if the key it is sitting on was deleted. The result is that we advance the cursor to the next key, and the index iterator treats its "save point" as the new key.
- restore() is called on the EFT index cursor.
restore() will make a new radix_iterator pointing to the same or next key as the old one, but in the new snapshot. There is some special logic in EFT index cursors to track whether the "last move" was a restore().
- next() is called on the EFT cursor.
This will advance the cursor, despite the fact that the iterator is currently pointing to a key that's never been returned. Therefore we "skip" the record after the one that was deleted.
I have not found a case where this problem is actually observable in mongodb. I do not believe it is possible to reproduce on the remove() path, since delete operations save their state before removing the document. It is, however, easy to reproduce by inserting calls to save() and restore(). For example, applying the below patch to the validation/repair logic and then running the relevant dbtests will show this issue in action.
diff --git a/src/mongo/db/catalog/validate_adaptor.cpp b/src/mongo/db/catalog/validate_adaptor.cpp index c9a37451d7..bfbf9bbcf9 100644 --- a/src/mongo/db/catalog/validate_adaptor.cpp +++ b/src/mongo/db/catalog/validate_adaptor.cpp @@ -346,6 +346,10 @@ void ValidateAdaptor::traverseIndex(OperationContext* opCtx, } try { + // Saving and restoring a cursor should effectively be a no op. But it isn't. + indexCursor->save(); + indexCursor->restore(); + indexEntry = indexCursor->nextKeyString(opCtx); } catch (const DBException& ex) { if (TestingProctor::instance().isEnabled() && ex.code() != ErrorCodes::WriteConflict) {
./build/optdebug/install/bin/dbtest validate_tests --storageEngine=ephemeralForTest
(Base rev 9c72ff1047)
The test will "miss" index keys which actually are present and report that it was unable to repair the index.