[SERVER-32989] `repairDatabase` can race with `dropDatabase`. Created: 30/Jan/18  Updated: 29/Oct/23  Resolved: 13/Feb/18

Status: Closed
Project: Core Server
Component/s: Replication, Storage
Affects Version/s: 3.6.0, 3.7.1
Fix Version/s: 3.6.5, 3.7.3

Type: Bug Priority: Major - P3
Reporter: Daniel Gottlieb (Inactive) Assignee: Judah Schvimer
Resolution: Fixed Votes: 0
Labels: rollback-non-functional
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-33566 `restartCatalog` can race with `dropD... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6
Sprint: Repl 2018-02-12, Repl 2018-02-26
Participants:
Linked BF Score: 0

 Description   

3.6 introduced two phase drops where, with respect to dropDatabase, all collection drops must be majority confirmed before the database drop is processed. When starting to process dropDatabase, the in-memory Database catalog object is marked as "drop pending". This is used to protect against things like collections being created, which is normally protected by the database lock preventing concurrent access. The problem is that when waiting for the drops to become majority confirmed, dropDatabase releases all locks. That is done because locks should not be held by a block that is running a blocking operation.

The interleaving of `repairDatabase` and `dropDatabase`, specifically, is amusing. `repairDatabase` I believe deletes and recreates the in-memory `Database` object, resetting the `_dropPending` member field to false. This allows a `createCollection` to now occur before the `dropDatabase` finishes being processed.

Some considerations for addressing this bug:

I believe this means that anything acquiring a Database object should check Database::isDropPending. Additionally, it may make sense for AutoGetDb to check the drop pending flag as a way to centralize that logic. Note that repairDatabase grabs a global lock, circumventing any protection the AutoGetDb would offer.



 Comments   
Comment by Githook User [ 19/Apr/18 ]

Author:

{'email': 'judah@mongodb.com', 'username': 'judahschvimer', 'name': 'Judah Schvimer'}

Message: SERVER-32989 prevents `repairDatabase` race with `dropDatabase`

(cherry picked from commit 1e2160c8f0480a1d8be1d671b5b7e22e52986a2c)
Branch: v3.6
https://github.com/mongodb/mongo/commit/0982a72621b81b063a73e382ad24b707de9a3755

Comment by Githook User [ 13/Feb/18 ]

Author:

{'email': 'judah@mongodb.com', 'name': 'Judah Schvimer', 'username': 'judahschvimer'}

Message: SERVER-32989 prevents `repairDatabase` race with `dropDatabase`
Branch: master
https://github.com/mongodb/mongo/commit/1e2160c8f0480a1d8be1d671b5b7e22e52986a2c

Comment by Judah Schvimer [ 01/Feb/18 ]

As discussed with daniel.gottlieb, while other operations probably should not succeed on a database that is drop pending, a more general solution for preventing operations on databases that are drop pending is out of scope of this ticket.

Comment by Judah Schvimer [ 01/Feb/18 ]

To start, I will fix this race by checking if the database is drop-pending after repairDatabase takes the GlobalWrite lock, which it holds for the entire repair (I will enforce that we at least take a database X lock for repair). We only ever set the database to be drop-pending while holding an X lock on the database, so once the GlobalWrite lock is taken by repairDatabase, the database cannot be set to drop-pending until the repair completes. We call repair at the end of initial sync as well, so we must make sure this is done in both places or in the common code.

Generated at Thu Feb 08 04:31:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.