Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17923

Creating/dropping multiple background indexes on the same collection can cause fatal error on secondaries

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: 3.0.1
    • Fix Version/s: 3.0.4, 3.1.4
    • Component/s: Replication
    • Labels:
    • Backwards Compatibility:
      Minor Change
    • Operating System:
      ALL
    • Backport Completed:
    • Linked BF Score:
      0

      Description

      Issue Status as of Jun 09, 2015

      ISSUE SUMMARY
      On a MongoDB replica set, when a secondary node is running multiple background index builds on a given collection, metadata changes to that same collection may lead to a fatal error on the secondary node.

      Metadata changes that may trigger this behavior include renaming and dropping the collection, and dropping the database that contains the collection.

      USER IMPACT
      If a quorum of secondary nodes experience the error and shut down, the replica set will no longer have enough voting nodes operational, leading to loss of write availability.

      WORKAROUNDS
      Avoid collection creation, drop, and rename operations while building indexes in the background on that same collection.

      AFFECTED VERSIONS
      MongoDB 3.0.0 through 3.0.3.

      FIX VERSION
      The fix is included in the 3.0.4 production release.

      Original description

      Create and destroy indexes with different options, and variations, on the same collection from multiple clients and there is a chance that secondaries will fassert when applying the oplog. Thus far, no problem has been observed on the primary.

      Tested using 3.0.1 enterprise. Known to occur on ubuntu 12.01 and windows 8.

      Attached is the script used in each shell session. The "test.ts" collection had 250K small documents structured as {_id:ObjectId,server:int,cpu:int} however neither the structure nor quantity of documents seem to be important as other variations also trigger the fault. Background indexing appears to be a crucial requirement. The fault was originally observed on a sharded cluster with operations performed via a mongos, but a basic replica-set is all that is needed.

      Sometimes the secondaries can be restarted, recover, and rejoin normally. Sometimes they fassert again on restart, persistently, until re-sync'ed. Both these results were observed in consecutive runs with no known difference to explain the different recovery result (other than timing).

      Also attached is log output of an example restart (on windows) where the secondary could not recover.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                1 Vote for this issue
                Watchers:
                10 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: