[SERVER-39910] Race in rollback_drop_database.js Created: 01/Mar/19  Updated: 29/Oct/23  Resolved: 08/Mar/19

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 4.1.8
Fix Version/s: 4.1.9

Type: Bug Priority: Minor - P4
Reporter: A. Jesse Jiryu Davis Assignee: A. Jesse Jiryu Davis
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Problem/Incident
is caused by SERVER-38865 Create rollback test fixture that is ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-03-11, Repl 2019-03-25
Participants:
Linked BF Score: 11

 Description   

The test executes a dropDatabase command, which is written to the oplog as a dropCollection entry followed by a dropDatabase entry. The test wants the dropCollection entry to be majority committed, so it can test what happens when only the dropDatabase entry is rolled back.

Expected test sequence:

  1. Start a 3-node set, nodes are named "rollback", "tieBreaker", "syncSource"
  2. Create a database with one collection, then drop the database
  3. The dropDatabase command first drops the collection, writes the dropCollection oplog entry, and waits (by default) for the dropCollection entry to be majority-committed.
  4. Both secondaries are lucky enough to receive the entry, and at least one of them acknowledges it.
  5. Now the dropCollection entry is majority-committed, so dropDatabase proceeds on the primary to drop the database.
  6. The primary hits the dropDatabaseHangBeforeLog failpoint - it doesn't write the dropDatabase oplog entry. It hangs holding the global write lock.
  7. The RollbackTest fixture in transitionToRollbackOperations sees that both secondaries are caught up, so it proceeds to the next state.

Failure sequence:

  1. Start a 3-node set, nodes are named "rollback", "tieBreaker", "syncSource"
  2. Create a database with one collection, then drop the database
  3. The dropDatabase command first drops the collection, writes the dropCollection oplog entry, and waits (by default) for the dropCollection entry to be majority-committed.
  4. Only one of the secondaries is lucky enough to receive the entry.
  5. Now the dropCollection entry is majority-committed, so dropDatabase proceeds on the primary to drop the database.
  6. The primary hits the dropDatabaseHangBeforeLog failpoint - it doesn't write the dropDatabase oplog entry. It hangs holding the global write lock, which blocks the other secondary from receiving the dropCollection entry.
  7. The RollbackTest fixture in transitionToRollbackOperations sees that one of the secondaries is not caught up, so waits until the test times out.

This race was introduced in SERVER-38865, when the tieBreaker node was changed from an arbiter to a secondary. Before that change, when the dropCollection entry was majority committed that meant all secondaries (which was just "syncSource") had replicated it.

Tess hypothesizes there are other rollback tests that also rely on the old meaning of "majority committed" to be the same as "all secondaries have replicated". I'll restore that guarantee by stopping replication on the new secondary, "tieBreaker", at the beginning of the RollbackTest, and in transitionToRollbackOperations() I'll only await replication on "syncSource".



 Comments   
Comment by Githook User [ 08/Mar/19 ]

Author:

{'name': 'A. Jesse Jiryu Davis', 'email': 'jesse@mongodb.com', 'username': 'ajdavis'}

Message: SERVER-39910 Fix race in rollback_drop_database.js
Branch: master
https://github.com/mongodb/mongo/commit/9a7cfb73da3a86d1c20f674140f1f908e2bae0c8

Generated at Thu Feb 08 04:53:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.