[SERVER-44577] Ensure WiredTiger cursors have started a transaction before reading data Created: 12/Nov/19  Updated: 29/Oct/23  Resolved: 07/Apr/20

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 4.2.0, 4.4.0
Fix Version/s: 4.2.7, 4.4.0-rc1, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Suganthi Mani Assignee: Louis Williams
Resolution: Fixed Votes: 0
Labels: groomed
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-44582 Assert that a storage transaction is ... Closed
Duplicate
is duplicated by SERVER-44581 Index builds must ensure a storage tr... Closed
Problem/Incident
Related
related to SERVER-44507 Hybrid index build is able to commit ... Closed
related to SERVER-41033 set ignore_prepare=true throughout an... Closed
related to SERVER-41034 Invariant if we get a prepare conflic... Closed
related to SERVER-44582 Assert that a storage transaction is ... Closed
is related to SERVER-44590 Logged snapshot IDs should match when... Closed
is related to SERVER-47407 Avoid WriteUnitOfWork in index build ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2
Steps To Reproduce:

/*
 * Test shows that an hybrid index build collection scan phase to get into prepare conflict despite 
 * its prepare conflict behavior set to kIgnoreConflictsAllowWrites. 
 *
 * Its illegal for an internal thread (+ignores interrupt) to get into prepare conflict as it can lead 3 way dead lock
 * involving prepared transaction, hybrid index build and step down/step up. So, we crash the system
 * when we hit that situation.
 */
load("jstests/libs/check_log.js");
load("jstests/replsets/rslib.js");
load("jstests/core/txns/libs/prepare_helpers.js");
 
(function() {
 
"use strict";
 
const rst = new ReplSetTest({nodes: 2});
rst.startSet();
rst.initiate();
 
const dbName = jsTestName();
const collName = "coll";
 
const primary = rst.getPrimary();
const primaryDB = primary.getDB(dbName);
const primaryColl = primaryDB[collName];
const collNss = primaryColl.getFullName();
 
jsTestLog("Do document writes");
for (var i = 0; i < 3; i++) {
    assert.commandWorked(primaryColl.insert({_id: i, y:i}, {"writeConcern": {"w": 1}}));
}
rst.awaitReplication();
 
const secondary = rst.getSecondary();
const secondaryAdmin = secondary.getDB("admin");
 
// Make the hybrid build collection scan to yield often.
assert.commandWorked(
    secondary.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 2}));
 
jsTestLog("Enable setYieldAllLocksHang fail point");
assert.commandWorked(secondaryAdmin.runCommand(
    {configureFailPoint: "setYieldAllLocksHang", data: {namespace: collNss}, mode: "alwaysOn"}));
 
TestData.dbName = dbName;
TestData.collName = collName;
 
const indexThread = startParallelShell(() => {
    jsTestLog("Create index");
    const primaryDB = db.getSiblingDB(TestData.dbName);
    assert.commandWorked(primaryDB[TestData.collName].createIndex({"y": 1}));
}, primary.port);
 
// Wait until index build (collection scan phase) on secondary yields.
jsTestLog("Wait for the hybrid index build on secondary to hang");
assert.soon(
    function() {
        const result = secondaryAdmin.currentOp({"command.createIndexes": collName});
        assert.commandWorked(result);
        if (result.inprog.length === 1 && result.inprog[0].numYields > 0) {
            return true;
        }
 
        return false;
    },
    function() {
        return "Failed to find operation in currentOp() output: " +
            tojson(secondaryAdmin.currentOp());
    },
    30 * 1000);
 
jsTestLog("Step up the current secondary");
assert.commandWorked(secondary.adminCommand({"replSetStepUp": 1}));
waitForState(secondary, ReplSetTest.State.PRIMARY);
waitForState(primary, ReplSetTest.State.SECONDARY);
const newPrimary = rst.getPrimary();
 
// Make sure the secondary was able to step up successfully.
assert.eq(newPrimary, secondary)
 
jsTestLog("Start a txn");
const session = newPrimary.startSession();
const sessionDB = session.getDatabase(dbName);
const sessionColl = sessionDB.getCollection(collName);
session.startTransaction();
assert.commandWorked(sessionColl.insert({_id: 20}));
 
jsTestLog("Prepare txn");
const prepareTimestamp = PrepareHelpers.prepareTransaction(session);
 
// Make the hybrid index build to hang after first drain
assert.commandWorked(newPrimary.adminCommand(
    {configureFailPoint: "hangAfterIndexBuildFirstDrain", mode: "alwaysOn"}));
 
// This will make the hybrid build previously started on secondary (now primary) to resume.
jsTestLog("Disable setYieldAllLocksHang fail point");
assert.commandWorked(
    newPrimary.adminCommand({configureFailPoint: "setYieldAllLocksHang", mode: "off"}));
 
// SERVER CRASH happens here as collection scan phase of hybrid build gets into prepare conflict.
 
// Wait until the first drain completed.
checkLog.contains(newPrimary, "Hanging after index build first drain");
 
assert.commandWorked(
    newPrimary.adminCommand({configureFailPoint: "hangAfterIndexBuildFirstDrain", mode: "off"}));
 
jsTestLog("Step down the new primary");
assert.commandWorked(newPrimary.adminCommand({"replSetStepDown": 60 * 60, "force": true}));
 
rst.stopSet();
})();

Sprint: Execution Team 2020-01-13, Execution Team 2020-01-27, Execution Team 2020-04-06, Execution Team 2020-04-20
Participants:
Linked BF Score: 21

 Description   

Collection scan phase of hybrid index build can encounter a prepare conflict. In order to avoid 3 way dead lock mentioned in SERVER-41033, we set the prepare conflict behavior for hybrid index build to be PrepareConflictBehavior::kIgnoreConflictsAllowWrites. But, then it seems, collection scan phase still encounters a prepare conflict and as a result the server crashes due to the fix added by SERVER-41034. To be noted, index build started on secondary ignore interrupts.

Modifying the repro script of SERVER-44507 slightly on the number of documents getting inserted into the collection, can result in hybrid index builder encountering a prepare conflict during collection scan phase. Doing a deeper investigation showed that, we can lead to WiredTiger losing the config settings (ignore_prepare=force/true) for a WT session and can cause the WT to enforce ignore_prepare=false (default setting) for the hybrid index build collection scan and lead to prepare conflict.

Consider, I have inserted 3 documents into collection A before index build start. Then,
1) Assume, Index build starts and collection scan phase plan executor yields after reading 2nd document in the collection A.
2) Now, a transaction comes in and inserts 4th document into the collection A and prepares it.
3) Index Build collection scan phase plan executor resumes. As a result, we reacquire locks and open a new WT storage transaction with ignore_prepare=force.
4) Now the 3rd document gets read with the ignore_prepare=force config setting. And, then we write either it to external sorter (in-memory buffer) or update the index table (storage layer) for multi-index keys.
5) Index Build collection scan now tries to read 4th document which is in prepared state.

Taking a closer look at step#3 -5, we see the below sequence.

  1. Collection scan's plan executor explicitly begins WT Transaction on state restore. The transaction gets opened with ignore_prepare=force . Now, the recovery unit state becomes kActiveNotInUnitOfWork(see the recovery unit state transition diagram)
  2. Read the 3rd document using cursor.next().
  3. Start write unit of work. Since this is the outer WUOW, the recovery unit state becomes kActive.
  4. Write the 3rd document either to external sorter memory(bulk) or to the index table(real).
  5. Commit Write unit of work. This will cause the transaction to be committed in WT and reset the config settings flag (like ignore_prepare) associated with that session as part of __wt_txn_release(). And now the recovery unit state becomes kInactive
  6. Now, we read the 4th document using cursor.next()  w/o explicit begin_transaction() call from MongoDB layer. So, WT is going to make a read in an implicit transaction  with the default setting (i.e.) ignore_prepare=false. As a result, any subsequent read/write calls after sequence #5 is no longer guarded from prepare conflict error.

Problems because of the behavior
1) Concurrency issue - 3 way dead lock like in SERVER-41033
2) Possible data corruption - Not sure?

      Thinking about the safety/correctness  of the reads and writes performed inside an implicit WT transaction after sequence #5, we see that the implicit transaction will be opened with the default isolation level = read-committed as opposed to WT transaction opened by MongoDB layer in snapshot isolation level. So, is it possible to end up  writing  the index key for a document (say doc X) twice? One from the code path - read the docX during collection scan and insert index keys; and other from the side write table drain code path. If so, I think, we can hit problem like in SERVER-41529.

3) Should we need to audit our codebase to see if we have the same behavior (i.e.) read/write outside a transaction (read-committed isolation) and is it safe to do it?



 Comments   
Comment by Githook User [ 16/Apr/20 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-44577 Ensure WiredTiger RecordStore and SortedDataInterface cursors have started a transaction before reading data

(cherry picked from commit 34c8bef2ebbe355aef0439a5b85313db1c2f242d)
(cherry picked from commit d71b8cbccdd8bb2cf547989c630229d4c73b2856)

SERVER-44577 Add uses_prepare_transaction tag to test

(cherry picked from commit 4e23322e27b6372e7c747a68a3f657f16211b540)
(cherry picked from commit 15281acee990afa8bf4a67afaf7ce9880e444391)

SERVER-47407 Avoid using WriteUnitOfWork in index build collection scan loop

(cherry picked from commit e9379fcee79456d3795b598ccad91a0694007d0e)
(cherry picked from commit 61c58c415c5f0ec2b8948f54b95b0ec48659b6a2)
Branch: v4.2
https://github.com/mongodb/mongo/commit/1085f4b003369973a545c3fc527689fdcb940c55

Comment by Githook User [ 16/Apr/20 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-44577 Add uses_prepare_transaction tag to test

(cherry picked from commit 4e23322e27b6372e7c747a68a3f657f16211b540)
Branch: v4.4
https://github.com/mongodb/mongo/commit/15281acee990afa8bf4a67afaf7ce9880e444391

Comment by Githook User [ 16/Apr/20 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-44577 Add uses_prepare_transaction tag to test
Branch: master
https://github.com/mongodb/mongo/commit/4e23322e27b6372e7c747a68a3f657f16211b540

Comment by Githook User [ 10/Apr/20 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-44577 Ensure WiredTiger RecordStore and SortedDataInterface cursors have started a transaction before reading data

(cherry picked from commit 34c8bef2ebbe355aef0439a5b85313db1c2f242d)
Branch: v4.4
https://github.com/mongodb/mongo/commit/d71b8cbccdd8bb2cf547989c630229d4c73b2856

Comment by Githook User [ 07/Apr/20 ]

Author:

{'name': 'Louis Williams', 'email': 'louis.williams@mongodb.com', 'username': 'louiswilliams'}

Message: SERVER-44577 Ensure WiredTiger RecordStore and SortedDataInterface cursors have started a transaction before reading data
Branch: master
https://github.com/mongodb/mongo/commit/34c8bef2ebbe355aef0439a5b85313db1c2f242d

Comment by Suganthi Mani [ 14/Nov/19 ]

Just a quick correction, the isolation level of reads and writes performed inside an implicit WT transaction after sequence #5, will still be have the snapshot isolation level and NOT as read-committed isolation. Because when we release the resources of a WT transaction on commit, we set the transaction's isolation level to be session's isolation level (= snapshot isolation level). So, I feel, there won't be any data corruption/correctness issue.

Comment by Louis Williams [ 12/Nov/19 ]

Funilly enough, I filed this same bug in SERVER-44581. This particular problem can't cause data corruption, just deadlocks. If the reverse case were true (i.e. prepare conflicts are ignored when they weren't supposed to), that would be problematic.

There's no reason we couldn't encounter this problem in other parts of the code, so I also filed SERVER-44582 to investigate adding assertions that cursor read operations must occur inside transactions. 

Comment by Eric Milkie [ 12/Nov/19 ]

I don't see how you could get index data corruption (#2) here.

Generated at Thu Feb 08 05:06:22 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.