Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-97743

rsSyncApplyStop failpoint may not be respected

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Replication
    • ALL
    • Repl 2024-12-09, Repl 2024-12-23, Repl 2025-01-06

      The rsSyncApplyStop failpoint is often used in tests to stop replication on one or more secondary nodes. However, the current implementation presents an undesired race condition.

      The oplog applier is checking for the failpoint right before getting a batch to apply, so the following interleaving can happen:

      1. Oplog applier checks the failpoint (still off, check passes)|
      2. Test sets failpoint
      3. Oplog applier fetches a batch
      4. Oplog applier applies batch

      All tests using that failpoint must be reviewed/adapted. A possible solution could be switching all current failpoint enable/disable command invocations with calls to the fsync lock/unlock commands that seems to be properly synchronized with the oplog applier.

      REPRODUCIBLE

      // To execute:
      // ./buildscripts/resmoke.py run --suite=replica_sets --storageEngine=wiredTiger --jobs=1 --storageEngineCacheSizeGB=0.5 --dbpath=/tmp/testpath jstests/replsets/rsSyncApplyStop_bug.js
      
      import {ReplSetTest} from "jstests/libs/replsettest.js";
      
      const replTest = new ReplSetTest({nodes: 2});
      replTest.startSet();
      replTest.initiate();
      
      const primary = replTest.getPrimary();
      const secondary = replTest.getSecondary();
      
      for (var i = 0; i < 100; i++) {
          secondary.adminCommand({configureFailPoint: 'rsSyncApplyStop', mode: 'alwaysOn'});
      
          var commandSucceeded = false;
      
          try {
              assert.commandWorked(
                  primary.getDB('test').coll.insert({x: i}, {writeConcern: {w: 2, wtimeout: 100}}));
              commandSucceeded = true;
          } catch (e) {
              assert(e.writeConcernError && ErrorCodes.isWriteConcernError(e.writeConcernError.code),
                     'Got error different than write concern error on inserting ' + i +
                         '. Command returned: ' + e);
          }
      
          secondary.adminCommand({configureFailPoint: 'rsSyncApplyStop', mode: 'off'});
      
          assert(!commandSucceeded,
                 'No error was thrown despite write concern should not have been satisfied on inserting ' + i);
      }
      

      OUTPUT
      The test failed after executing it a couple times (play with the wtimeout to reproduce locally):

       [js_test:rsSyncApplyStop_bug] d20041| 2024-12-02T12:08:43.833+00:00 W  CONTROL  [conn1] Set failpoint{"failPointName":"rsSyncApplyStop","failPoint":{"mode":0,"data":{},"timesEntered":6}}
      [js_test:rsSyncApplyStop_bug] uncaught exception: Error: assert failed : No error was thrown despite write concern should not have been satisfied on inserting 39 :
      [js_test:rsSyncApplyStop_bug] doassert@src/mongo/shell/assert.js:20:14
      [js_test:rsSyncApplyStop_bug] assert@src/mongo/shell/assert.js:152:17
      [js_test:rsSyncApplyStop_bug] @jstests/replsets/rsSyncApplyStop_bug.js:28:11
      [js_test:rsSyncApplyStop_bug] Error: assert failed : No error was thrown despite write concern not satisfied on inserting 39
      [js_test:rsSyncApplyStop_bug] failed to load: jstests/replsets/rsSyncApplyStop_bug.js
      

            Assignee:
            austin.miller@mongodb.com Austin Miller
            Reporter:
            pierlauro.sciarelli@mongodb.com Pierlauro Sciarelli
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated: