Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-44577

Ensure WiredTiger cursors have started a transaction before reading data

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 4.2.0, 4.4.0
    • Fix Version/s: 4.4.0-rc1, 4.5.1, 4.2.7
    • Component/s: Storage
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4, v4.2
    • Steps To Reproduce:
      Hide

      /*
       * Test shows that an hybrid index build collection scan phase to get into prepare conflict despite 
       * its prepare conflict behavior set to kIgnoreConflictsAllowWrites. 
       *
       * Its illegal for an internal thread (+ignores interrupt) to get into prepare conflict as it can lead 3 way dead lock
       * involving prepared transaction, hybrid index build and step down/step up. So, we crash the system
       * when we hit that situation.
       */
      load("jstests/libs/check_log.js");
      load("jstests/replsets/rslib.js");
      load("jstests/core/txns/libs/prepare_helpers.js");
       
      (function() {
       
      "use strict";
       
      const rst = new ReplSetTest({nodes: 2});
      rst.startSet();
      rst.initiate();
       
      const dbName = jsTestName();
      const collName = "coll";
       
      const primary = rst.getPrimary();
      const primaryDB = primary.getDB(dbName);
      const primaryColl = primaryDB[collName];
      const collNss = primaryColl.getFullName();
       
      jsTestLog("Do document writes");
      for (var i = 0; i < 3; i++) {
          assert.commandWorked(primaryColl.insert({_id: i, y:i}, {"writeConcern": {"w": 1}}));
      }
      rst.awaitReplication();
       
      const secondary = rst.getSecondary();
      const secondaryAdmin = secondary.getDB("admin");
       
      // Make the hybrid build collection scan to yield often.
      assert.commandWorked(
          secondary.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 2}));
       
      jsTestLog("Enable setYieldAllLocksHang fail point");
      assert.commandWorked(secondaryAdmin.runCommand(
          {configureFailPoint: "setYieldAllLocksHang", data: {namespace: collNss}, mode: "alwaysOn"}));
       
      TestData.dbName = dbName;
      TestData.collName = collName;
       
      const indexThread = startParallelShell(() => {
          jsTestLog("Create index");
          const primaryDB = db.getSiblingDB(TestData.dbName);
          assert.commandWorked(primaryDB[TestData.collName].createIndex({"y": 1}));
      }, primary.port);
       
      // Wait until index build (collection scan phase) on secondary yields.
      jsTestLog("Wait for the hybrid index build on secondary to hang");
      assert.soon(
          function() {
              const result = secondaryAdmin.currentOp({"command.createIndexes": collName});
              assert.commandWorked(result);
              if (result.inprog.length === 1 && result.inprog[0].numYields > 0) {
                  return true;
              }
       
              return false;
          },
          function() {
              return "Failed to find operation in currentOp() output: " +
                  tojson(secondaryAdmin.currentOp());
          },
          30 * 1000);
       
      jsTestLog("Step up the current secondary");
      assert.commandWorked(secondary.adminCommand({"replSetStepUp": 1}));
      waitForState(secondary, ReplSetTest.State.PRIMARY);
      waitForState(primary, ReplSetTest.State.SECONDARY);
      const newPrimary = rst.getPrimary();
       
      // Make sure the secondary was able to step up successfully.
      assert.eq(newPrimary, secondary)
       
      jsTestLog("Start a txn");
      const session = newPrimary.startSession();
      const sessionDB = session.getDatabase(dbName);
      const sessionColl = sessionDB.getCollection(collName);
      session.startTransaction();
      assert.commandWorked(sessionColl.insert({_id: 20}));
       
      jsTestLog("Prepare txn");
      const prepareTimestamp = PrepareHelpers.prepareTransaction(session);
       
      // Make the hybrid index build to hang after first drain
      assert.commandWorked(newPrimary.adminCommand(
          {configureFailPoint: "hangAfterIndexBuildFirstDrain", mode: "alwaysOn"}));
       
      // This will make the hybrid build previously started on secondary (now primary) to resume.
      jsTestLog("Disable setYieldAllLocksHang fail point");
      assert.commandWorked(
          newPrimary.adminCommand({configureFailPoint: "setYieldAllLocksHang", mode: "off"}));
       
      // SERVER CRASH happens here as collection scan phase of hybrid build gets into prepare conflict.
       
      // Wait until the first drain completed.
      checkLog.contains(newPrimary, "Hanging after index build first drain");
       
      assert.commandWorked(
          newPrimary.adminCommand({configureFailPoint: "hangAfterIndexBuildFirstDrain", mode: "off"}));
       
      jsTestLog("Step down the new primary");
      assert.commandWorked(newPrimary.adminCommand({"replSetStepDown": 60 * 60, "force": true}));
       
      rst.stopSet();
      })();
      

      Show
      /* * Test shows that an hybrid index build collection scan phase to get into prepare conflict despite * its prepare conflict behavior set to kIgnoreConflictsAllowWrites. * * Its illegal for an internal thread (+ignores interrupt) to get into prepare conflict as it can lead 3 way dead lock * involving prepared transaction, hybrid index build and step down/step up. So, we crash the system * when we hit that situation. */ load("jstests/libs/check_log.js"); load("jstests/replsets/rslib.js"); load("jstests/core/txns/libs/prepare_helpers.js"); (function() { "use strict"; const rst = new ReplSetTest({nodes: 2}); rst.startSet(); rst.initiate(); const dbName = jsTestName(); const collName = "coll"; const primary = rst.getPrimary(); const primaryDB = primary.getDB(dbName); const primaryColl = primaryDB[collName]; const collNss = primaryColl.getFullName(); jsTestLog("Do document writes"); for (var i = 0; i < 3; i++) { assert.commandWorked(primaryColl.insert({_id: i, y:i}, {"writeConcern": {"w": 1}})); } rst.awaitReplication(); const secondary = rst.getSecondary(); const secondaryAdmin = secondary.getDB("admin"); // Make the hybrid build collection scan to yield often. assert.commandWorked( secondary.adminCommand({setParameter: 1, internalQueryExecYieldIterations: 2})); jsTestLog("Enable setYieldAllLocksHang fail point"); assert.commandWorked(secondaryAdmin.runCommand( {configureFailPoint: "setYieldAllLocksHang", data: {namespace: collNss}, mode: "alwaysOn"})); TestData.dbName = dbName; TestData.collName = collName; const indexThread = startParallelShell(() => { jsTestLog("Create index"); const primaryDB = db.getSiblingDB(TestData.dbName); assert.commandWorked(primaryDB[TestData.collName].createIndex({"y": 1})); }, primary.port); // Wait until index build (collection scan phase) on secondary yields. jsTestLog("Wait for the hybrid index build on secondary to hang"); assert.soon( function() { const result = secondaryAdmin.currentOp({"command.createIndexes": collName}); assert.commandWorked(result); if (result.inprog.length === 1 && result.inprog[0].numYields > 0) { return true; } return false; }, function() { return "Failed to find operation in currentOp() output: " + tojson(secondaryAdmin.currentOp()); }, 30 * 1000); jsTestLog("Step up the current secondary"); assert.commandWorked(secondary.adminCommand({"replSetStepUp": 1})); waitForState(secondary, ReplSetTest.State.PRIMARY); waitForState(primary, ReplSetTest.State.SECONDARY); const newPrimary = rst.getPrimary(); // Make sure the secondary was able to step up successfully. assert.eq(newPrimary, secondary) jsTestLog("Start a txn"); const session = newPrimary.startSession(); const sessionDB = session.getDatabase(dbName); const sessionColl = sessionDB.getCollection(collName); session.startTransaction(); assert.commandWorked(sessionColl.insert({_id: 20})); jsTestLog("Prepare txn"); const prepareTimestamp = PrepareHelpers.prepareTransaction(session); // Make the hybrid index build to hang after first drain assert.commandWorked(newPrimary.adminCommand( {configureFailPoint: "hangAfterIndexBuildFirstDrain", mode: "alwaysOn"})); // This will make the hybrid build previously started on secondary (now primary) to resume. jsTestLog("Disable setYieldAllLocksHang fail point"); assert.commandWorked( newPrimary.adminCommand({configureFailPoint: "setYieldAllLocksHang", mode: "off"})); // SERVER CRASH happens here as collection scan phase of hybrid build gets into prepare conflict.   // Wait until the first drain completed. checkLog.contains(newPrimary, "Hanging after index build first drain"); assert.commandWorked( newPrimary.adminCommand({configureFailPoint: "hangAfterIndexBuildFirstDrain", mode: "off"})); jsTestLog("Step down the new primary"); assert.commandWorked(newPrimary.adminCommand({"replSetStepDown": 60 * 60, "force": true})); rst.stopSet(); })();
    • Sprint:
      Execution Team 2020-01-13, Execution Team 2020-01-27, Execution Team 2020-04-06, Execution Team 2020-04-20
    • Linked BF Score:
      21

      Description

      Collection scan phase of hybrid index build can encounter a prepare conflict. In order to avoid 3 way dead lock mentioned in SERVER-41033, we set the prepare conflict behavior for hybrid index build to be PrepareConflictBehavior::kIgnoreConflictsAllowWrites. But, then it seems, collection scan phase still encounters a prepare conflict and as a result the server crashes due to the fix added by SERVER-41034. To be noted, index build started on secondary ignore interrupts.

      Modifying the repro script of SERVER-44507 slightly on the number of documents getting inserted into the collection, can result in hybrid index builder encountering a prepare conflict during collection scan phase. Doing a deeper investigation showed that, we can lead to WiredTiger losing the config settings (ignore_prepare=force/true) for a WT session and can cause the WT to enforce ignore_prepare=false (default setting) for the hybrid index build collection scan and lead to prepare conflict.

      Consider, I have inserted 3 documents into collection A before index build start. Then,
      1) Assume, Index build starts and collection scan phase plan executor yields after reading 2nd document in the collection A.
      2) Now, a transaction comes in and inserts 4th document into the collection A and prepares it.
      3) Index Build collection scan phase plan executor resumes. As a result, we reacquire locks and open a new WT storage transaction with ignore_prepare=force.
      4) Now the 3rd document gets read with the ignore_prepare=force config setting. And, then we write either it to external sorter (in-memory buffer) or update the index table (storage layer) for multi-index keys.
      5) Index Build collection scan now tries to read 4th document which is in prepared state.

      Taking a closer look at step#3 -5, we see the below sequence.

      1. Collection scan's plan executor explicitly begins WT Transaction on state restore. The transaction gets opened with ignore_prepare=force . Now, the recovery unit state becomes kActiveNotInUnitOfWork(see the recovery unit state transition diagram)
      2. Read the 3rd document using cursor.next().
      3. Start write unit of work. Since this is the outer WUOW, the recovery unit state becomes kActive.
      4. Write the 3rd document either to external sorter memory(bulk) or to the index table(real).
      5. Commit Write unit of work. This will cause the transaction to be committed in WT and reset the config settings flag (like ignore_prepare) associated with that session as part of __wt_txn_release(). And now the recovery unit state becomes kInactive
      6. Now, we read the 4th document using cursor.next()  w/o explicit begin_transaction() call from MongoDB layer. So, WT is going to make a read in an implicit transaction  with the default setting (i.e.) ignore_prepare=false. As a result, any subsequent read/write calls after sequence #5 is no longer guarded from prepare conflict error.

      Problems because of the behavior
      1) Concurrency issue - 3 way dead lock like in SERVER-41033
      2) Possible data corruption - Not sure?

            Thinking about the safety/correctness  of the reads and writes performed inside an implicit WT transaction after sequence #5, we see that the implicit transaction will be opened with the default isolation level = read-committed as opposed to WT transaction opened by MongoDB layer in snapshot isolation level. So, is it possible to end up  writing  the index key for a document (say doc X) twice? One from the code path - read the docX during collection scan and insert index keys; and other from the side write table drain code path. If so, I think, we can hit problem like in SERVER-41529.

      3) Should we need to audit our codebase to see if we have the same behavior (i.e.) read/write outside a transaction (read-committed isolation) and is it safe to do it?

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              louis.williams Louis Williams
              Reporter:
              suganthi.mani Suganthi Mani
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              9 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: