Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-39910

Race in rollback_drop_database.js

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Minor - P4 Minor - P4
    • 4.1.9
    • Affects Version/s: 4.1.8
    • Component/s: Replication
    • Labels:
      None
    • Fully Compatible
    • ALL
    • Repl 2019-03-11, Repl 2019-03-25
    • 11

      The test executes a dropDatabase command, which is written to the oplog as a dropCollection entry followed by a dropDatabase entry. The test wants the dropCollection entry to be majority committed, so it can test what happens when only the dropDatabase entry is rolled back.

      Expected test sequence:

      1. Start a 3-node set, nodes are named "rollback", "tieBreaker", "syncSource"
      2. Create a database with one collection, then drop the database
      3. The dropDatabase command first drops the collection, writes the dropCollection oplog entry, and waits (by default) for the dropCollection entry to be majority-committed.
      4. Both secondaries are lucky enough to receive the entry, and at least one of them acknowledges it.
      5. Now the dropCollection entry is majority-committed, so dropDatabase proceeds on the primary to drop the database.
      6. The primary hits the dropDatabaseHangBeforeLog failpoint - it doesn't write the dropDatabase oplog entry. It hangs holding the global write lock.
      7. The RollbackTest fixture in transitionToRollbackOperations sees that both secondaries are caught up, so it proceeds to the next state.

      Failure sequence:

      1. Start a 3-node set, nodes are named "rollback", "tieBreaker", "syncSource"
      2. Create a database with one collection, then drop the database
      3. The dropDatabase command first drops the collection, writes the dropCollection oplog entry, and waits (by default) for the dropCollection entry to be majority-committed.
      4. Only one of the secondaries is lucky enough to receive the entry.
      5. Now the dropCollection entry is majority-committed, so dropDatabase proceeds on the primary to drop the database.
      6. The primary hits the dropDatabaseHangBeforeLog failpoint - it doesn't write the dropDatabase oplog entry. It hangs holding the global write lock, which blocks the other secondary from receiving the dropCollection entry.
      7. The RollbackTest fixture in transitionToRollbackOperations sees that one of the secondaries is not caught up, so waits until the test times out.

      This race was introduced in SERVER-38865, when the tieBreaker node was changed from an arbiter to a secondary. Before that change, when the dropCollection entry was majority committed that meant all secondaries (which was just "syncSource") had replicated it.

      Tess hypothesizes there are other rollback tests that also rely on the old meaning of "majority committed" to be the same as "all secondaries have replicated". I'll restore that guarantee by stopping replication on the new secondary, "tieBreaker", at the beginning of the RollbackTest, and in transitionToRollbackOperations() I'll only await replication on "syncSource".

            Assignee:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Reporter:
            jesse@mongodb.com A. Jesse Jiryu Davis
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: