Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-21037

Initial sync can miss documents if concurrent update results in error (mmapv1 only)

    • Fully Compatible
    • ALL
    • QuInt B (11/02/15)

      It is possible for an initial sync collection clone to skip over documents that are concurrently updated, under certain circumstances. When this happens, the initial sync will report success, but the newly-synced member will be silently missing these documents.

      The following conditions are required to trigger this scenario:

      • The sync source must be running with the mmapv1 storage engine.
      • When the collection scan query issued by the initial sync is yielding locks, an update must be issued against the document pointed to by the query's record cursor. This update must meet both of the following criteria:
        • The update must increase the size of the document, such that a document move is required.
        • The update must fail to generate an oplog entry (e.g. if the update fails with a duplicate key error).

      With mmapv1, an update of a document generates an invalidation for all active cursors pointing to that document (as a result, those cursors are advanced). Documents that are updated in this manner during an initial sync are copied to the sync target during the "oplog replay" initial sync phase. However, the copy is not performed if the update does not generate an oplog entry, which causes the synced collection to be missing the document.

      This is a regression introduced in the 3.0.x series of the server. In the 2.6.x series and prior, invalidations are not issued if the update would generate an error; this logic was removed with the introduction of the storage API in the 3.0.x series.

      This issue can be reproduced with the following script:

      var rst = new ReplSetTest({nodes: 2,
                                 nodeOptions: {storageEngine: "mmapv1",
                                               setParameter: "internalQueryExecYieldIterations=2"}});
      rst.startSet();
      rst.initiate();
      var primary = rst.getPrimary();
      var secondary = rst.getSecondary();
      assert.writeOK(primary.getDB("test").foo.insert([{_id: 0, a: 0}, {_id: 1, a: 1}, {_id: 2, a: 2}]));
      assert.commandWorked(primary.getDB("test").foo.ensureIndex({a: 1}, {unique: true}));
      rst.awaitReplication();
      rst.stop(secondary);
      startParallelShell(
          'while (true) { \
               db.foo.update({_id: 1}, {$set: {x: new Array(1024).join("x"), a: 2}}); \
               sleep(1000); \
           }', primary.port);
      rst.start(secondary);
      rst.waitForState(secondary, rst.SECONDARY, 60 * 1000);
      reconnect(secondary.getDB("test"));
      assert.eq(3, secondary.getDB("test").foo.count());
      

      The assertion on the last line trips with the message "3 != 2", as the newly-synced member is missing the document {_id: 1, a: 1}.

      The following patch to the server greatly increases reproducibility:

      diff --git a/src/mongo/db/query/query_yield.cpp b/src/mongo/db/query/query_yield.cpp
      index 4e0d463..7edde6e 100644
      --- a/src/mongo/db/query/query_yield.cpp
      +++ b/src/mongo/db/query/query_yield.cpp
      @@ -62,6 +62,10 @@ void QueryYield::yieldAllLocks(OperationContext* txn, RecordFetcher* fetcher) {
           // locks). If we are yielding, we are at a safe place to do so.
           txn->recoveryUnit()->abandonSnapshot();
      
      +    if (txn->getNS() == "test.foo") {
      +        sleepmillis(2000);
      +    }
      +
           // Track the number of yields in CurOp.
           CurOp::get(txn)->yielded();
      

      Reproduced with master (07168e08) and 3.0.7.

            Assignee:
            geert.bosch@mongodb.com Geert Bosch
            Reporter:
            rassi J Rassi
            Votes:
            0 Vote for this issue
            Watchers:
            13 Start watching this issue

              Created:
              Updated:
              Resolved: