[SERVER-56839] Index seeks concurrent with recently-committed prepared transactions can return wrong results Created: 11/May/21  Updated: 29/Oct/23  Resolved: 25/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.2.0, 4.4.0, 5.0.0
Fix Version/s: 4.4.7, 5.0.0-rc2, 4.2.16, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Louis Williams Assignee: Louis Williams
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-57270 Disable prepare_read_cursor_out_of_bo... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0, v4.4, v4.2
Steps To Reproduce:

Create this failpoint:

diff --git a/src/mongo/db/storage/wiredtiger/wiredtiger_index.cpp b/src/mongo/db/storage/wiredtiger/wiredtiger_index.cpp
index f91a65eb05..b91ce66b6f 100644
--- a/src/mongo/db/storage/wiredtiger/wiredtiger_index.cpp
+++ b/src/mongo/db/storage/wiredtiger/wiredtiger_index.cpp
@@ -82,6 +82,7 @@ namespace {
 
 MONGO_FAIL_POINT_DEFINE(WTCompactIndexEBUSY);
 MONGO_FAIL_POINT_DEFINE(WTEmulateOutOfOrderNextIndexKey);
+MONGO_FAIL_POINT_DEFINE(WTIndexPauseAfterSearchNear);
 
 using std::string;
 using std::vector;
@@ -1050,6 +1051,16 @@ protected:
 
         LOGV2_TRACE_CURSOR(20089, "cmp: {cmp}", "cmp"_attr = cmp);
 
+        WTIndexPauseAfterSearchNear.executeIf(
+            [](const BSONObj&) {
+                LOGV2(0, "hanging in search_near");
+                WTIndexPauseAfterSearchNear.pauseWhileSet();
+            },
+            [&](const BSONObj& data) {
+                LOGV2(0, "indexName", "name"_attr = data["indexName"].str());
+                return data["indexName"].str() == _idx.indexName();
+            });
+

And run this test:

/**
 * Reproduces a bug where searching for a key returns an adjacent key in a recently-committed
 * prepared transaction.
 *
 * Create an index with a single key, "a". Insert a new key for "b" in a prepared transaction. This
 * creates a prepared, but uncommitted index entry before the key we want to search for, "c", which
 * doesn't exist. We will query (search_near internally) for "c" and the cursor will initially land
 * on "a". This is less than they key were searching for, so the cursor is advanced to the next key,
 * expecting to land on something greater than or equal to "c". Before this happens, the prepared
 * transaction commits, making "b" visible. As a result, the cursor lands on and returns "b" even
 * though we queried for "c".
 */
(function() {
"use strict";
load("jstests/core/txns/libs/prepare_helpers.js");
load("jstests/libs/fail_point_util.js");
load('jstests/libs/parallel_shell_helpers.js');
 
const replTest = new ReplSetTest({nodes: 1});
replTest.startSet();
replTest.initiate();
 
const primary = replTest.getPrimary();
const dbName = 'test';
const collName = 'coll';
 
const db = primary.getDB(dbName);
assert.commandWorked(db[collName].createIndex({x: 1}));
assert.commandWorked(db[collName].insert({x: 'a'}));
 
const session = primary.startSession({causalConsistency: false});
const sessionDB = session.getDatabase(dbName);
const sessionColl = sessionDB.getCollection(collName);
session.startTransaction();
sessionColl.insert({x: 'b'});
let prepareTimestamp = PrepareHelpers.prepareTransaction(session);
 
let failpoint = configureFailPoint(primary, "WTIndexPauseAfterSearchNear", {indexName: 'x_1'});
 
// After the query on 'c' starts, we commit the transaction and advance the cursor, landing on 'b'
// instead of returning nothing.
const awaitShell = startParallelShell(function() {
    assert.eq(0, db.coll.findOne({x: 'c'}));  // fails (or crashes in debug builds)
}, primary.port);
 
failpoint.wait();
assert.commandWorked(PrepareHelpers.commitTransaction(session, prepareTimestamp));
failpoint.off();
awaitShell();
 
replTest.stopSet();
})();

Sprint: Execution Team 2021-05-31
Participants:
Linked BF Score: 50

 Description   

Read-only queries that perform index scans have the potential to return the wrong keys if they scan near recently-committed prepared transactions.

The bug is as follows, in this order:

  • An index has a key for "a" and it is visible to all operations.
  • A key for "b" is not visible because it is prepared and uncommitted by a transaction.
  • Another operation queries for a key, "c". This operation uses search_near() for a key starting with "c". Because there are no keys afterward and because read-only queries ignore prepared updates, the cursor lands on an adjacent key, "a".
  • The prepared transaction involving "b" commits.
  • Because "a" compares less than "c", we call next() on the cursor to uphold our contract to return a key that will compare greater than or equal to "c" (note this is the case for the actual key we would store for "c"). Due to the concurrently committed transaction, the call to next() actually lands on "b", which is newly visible, as it compares after "a". We expect that a call to next() guarantees we land on a key after "c", however, that is not the case, and leads us to return keys other than the ones we were searching for.

This does not affect write operations since they enforce and block on prepare conflicts. This only affects read-only queries.



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 13/Jul/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-56839 Index seeks should advance their cursor until a
matching key is found

This fixes a bug that can result in a seek on an index returning the
wrong key if it is concurrent with a recently-committed prepared
transaction on an adjacent record.

(cherry picked from commit b18749767fc53a9b5822312a063afb26883a774a)
(cherry picked from commit cc7ffaa7a7e7462037a40fffa833987e7aa644c5)
(cherry picked from commit 0fc497c9d6a4a30fe0624ce0d7d8f8b7428c684d)
Branch: v4.2
https://github.com/mongodb/mongo/commit/da8835fbda87fa3afe53e37adb6b1ea38e61a5f7

Comment by Githook User [ 16/Jun/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-56839 Index seeks should advance their cursor until a
matching key is found

This fixes a bug that can result in a seek on an index returning the
wrong key if it is concurrent with a recently-committed prepared
transaction on an adjacent record.

(cherry picked from commit b18749767fc53a9b5822312a063afb26883a774a)
(cherry picked from commit cc7ffaa7a7e7462037a40fffa833987e7aa644c5)
Branch: v4.4
https://github.com/mongodb/mongo/commit/0fc497c9d6a4a30fe0624ce0d7d8f8b7428c684d

Comment by Githook User [ 15/Jun/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-56839 Index seeks should advance their cursor until a
matching key is found

This fixes a bug that can result in a seek on an index returning the
wrong key if it is concurrent with a recently-committed prepared
transaction on an adjacent record.

(cherry picked from commit b18749767fc53a9b5822312a063afb26883a774a)
Branch: v5.0
https://github.com/mongodb/mongo/commit/cc7ffaa7a7e7462037a40fffa833987e7aa644c5

Comment by Githook User [ 25/May/21 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-56839 Index seeks should advance their cursor until a
matching key is found

This fixes a bug that can result in a seek on an index returning the
wrong key if it is concurrent with a recently-committed prepared
transaction on an adjacent record.
Branch: master
https://github.com/mongodb/mongo/commit/b18749767fc53a9b5822312a063afb26883a774a

Comment by Daniel Gottlieb (Inactive) [ 17/May/21 ]

Thanks for the clarifications louis.williams! I've responded with updated claims for posterity.

The MongoDB code assumes that if search_near lands on a key that compares lower than the search key, calling next() is guaranteed to return a key that compares higher than the search key.

Ah, that's an invariant that makes intuitive sense to me. I can see why violating that promise can lead to wrong results.

To answer your first question: if an operation reads without a timestamp, as is the case in my example, it gets a snapshot where an update is prepared. WiredTiger normally returns WT_PREPARE_CONFLICT for reads until this operation commits/aborts, but "ignore_prepare" allows the operator to return the pre-image. Once the prepared operation commits, it becomes visible.

I think I see. WT makes a txnId "visible" when a transaction is prepared (meaning new readers won't add said txnId to its copy of "concurrent transactions"). But the individual updates themselves are obviously marked as prepared, resulting in a WT_PREPARE_CONFLICT until that's no longer true.

Comment by Louis Williams [ 17/May/21 ]

daniel.gottlieb, I updated some text to hopefully answer your question:

Due to the concurrently committed transaction, the call to next() actually lands on "b", which is newly visible, as it compares after "a". We expect that a call to next() guarantees we land on a key after "c", however, that is not the case, and leads us to return keys other than the ones we were searching for.

The MongoDB code assumes that if search_near lands on a key that compares lower than the search key, calling next() is guaranteed to return a key that compares higher than the search key. The same logic also applies in the other direction for reverse cursors. "ignore_prepare" doesn't guarantee snapshot isolation and therefore this may not always be true.

To answer your first question: if an operation reads without a timestamp, as is the case in my example, it gets a snapshot where an update is prepared. WiredTiger normally returns WT_PREPARE_CONFLICT for reads until this operation commits/aborts, but "ignore_prepare" allows the operator to return the pre-image. Once the prepared operation commits, it becomes visible.

Comment by Daniel Gottlieb (Inactive) [ 13/May/21 ]

Clarification: Does the query for "c" have to be something that reads at a timestamp? If there's no read timestamp, my understanding is that if the query for "c" gets a snapshot prior to the prepared transaction committing, the "c" query's storage engine snapshot will never see "b"

Question: I didn't understand the last bullet point regarding the bug:

Due to the concurrently committed transaction, we land on "b", which is now after "a". This allows us to return a key other than the one we were searching for

Is the claim that higher level code isn't properly vetting any $match filters on returned documents? Or is the MDB Index API returning that we found an exact match when that wasn't true (perhaps allowing query to optimize away any double-checking of what gets returned)?

Generated at Thu Feb 08 05:40:19 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.