[SERVER-33632] Make UUID catalog reload atomic Created: 02/Mar/18  Updated: 29/Oct/23  Resolved: 17/Apr/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: 3.7.6

Type: Task Priority: Major - P3
Reporter: Gregory McKeon (Inactive) Assignee: Geert Bosch
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-32367 AutoGetCollectionOrView and its relat... Closed
related to SERVER-69389 Command checkAuthorization may throw ... Closed
is related to SERVER-29839 Protect against UUID->NSS mapping cha... Closed
is related to SERVER-34615 find by UUID can return NamespaceNotF... Closed
is related to SERVER-33474 Blacklist restart_catalog.js from sui... Closed
Backwards Compatibility: Fully Compatible
Sprint: Storage NYC 2018-03-26, Storage NYC 2018-04-09, Storage NYC 2018-04-23
Participants:
Linked BF Score: 63

 Description   

The fuzzer uncovered an issue where the database was locked in global exclusive mode while the catalog was being modified. A concurrent thread performed UUID to namespace resolution, trying to query a collection, and got a NamespaceNotFound error because the UUID catalog was transiently empty. This situation should not have been observable because of the global exclusive lock.



 Comments   
Comment by Githook User [ 17/Apr/18 ]

Author:

{'name': 'Geert Bosch', 'email': 'geert@mongodb.com', 'username': 'GeertBosch'}

Message: SERVER-33632 UUIDCatalog queries fall back to pre-close state

While the UUIDCatalog is closed for reloading during rollback and catalog reload,
queries by UUID may need to resolve the NSS for authorization purposes before taking
locks. Use a shadow catalog to allow such name resolution.
Branch: master
https://github.com/mongodb/mongo/commit/c45a62c24761110505161ce4ecc263f4bfce9c59

Comment by Spencer Brody (Inactive) [ 06/Mar/18 ]

Per offline discussion, UUID->NS mapping resolution must be possible outside of DB locks for access control checks.

Solution is to make all mutation to UUID catalog atomic by synchronizing within the UUIDCatalog object.

Comment by Kaloian Manassiev [ 06/Mar/18 ]

From the description of this ticket and from this comment in BF-8050, I interpreted it as if the Global lock was not acquired by the applier thread during the "Closing database catalog", but I see now that the times very close so this might be just a logging issue.

louis.williams, thanks for the explanation. I think I got a bit overzealous looking for a lock manager problem due to the above message, sorry about that.

spencer, the reason for doing the resolution outside of lock and then re-resolving was two-fold - one, because UUIDs can theoretically cross databases (if we start using them for cross-DB renames) and two, because the replication applier already takes advantage of this fact here. In this case we don't have the owning database.

Dave already started combining UUIDs together with the database name as part of this commit. So if we require the catalog to be stable we could start locking the DB before performing checks, but that would require figuring out what DB to lock in initial sync.

Comment by Spencer Brody (Inactive) [ 06/Mar/18 ]

Yes, we meant that the GlobalLock is held in mode X.

I don't think the problem is around interruption, I think the problem is that AutoGetCollection resolves the UUID->ns mapping before taking any locks. According to this comment this appears to be intentional, but I don't think it is safe/correct.

I think we believed this was correct because we re-resolve the ns->UUID after taking the locks. But if the UUID wasn't found in the cache at all, then resolveNamespaceStringOrUUID throws NamespaceNotFound before we even get to do the double-check

Comment by Louis Williams [ 05/Mar/18 ]

I don't think this scenario is possible.
The GlobalLock constructor can only be interrupted in lockGlobalComplete, so if GlobalLock::waitForLockUntil() does not throw, a lock has to have either been acquired or timed out.
If the lock acquisition is interrupted during lockComplete, it will remove the LockRequest, which means that isLocked() cannot be true.
Because lockBegin() is not interruptible (yet), when _result == LOCK_OK, when set by lockBegin, there are no more points where the constructor could be interrupted again.

Comment by Kaloian Manassiev [ 03/Mar/18 ]

The description of the ticket says "database is locked in Global exclusive mode" - did you mean to say the database was locked in X mode or that the Global X lock was being held? From reading BF-8050 it sounds like the latter, but wanted to confirm.

Regardless, the Global X lock should conflict with whatever DB mode lock is being held.

I looked at the DBLock code and I see that if the global lock acquisition is interrupted for some reason, the GlobalLock object will get wrongly constructed in isLocked() == true state, because waitForLockUntil does not throw, but we already set GlobalLock::_result to LOCK_OK. Later on, the DBLock will just proceed with acquiring the DB lock, but in reality would not have acquired the Global part of the hierarchy.

geert.bosch, am I understanding this correctly? There didn't use to be a way for the GlobalLock object to be constructed unlocked until we introduced interruptability, but now it can happen.

Comment by Judah Schvimer [ 02/Mar/18 ]

After this is fixed, we should un-blacklist "restartCatalog" from the fuzzer to see what other bugs it will catch. See SERVER-33474.

Comment by Spencer Brody (Inactive) [ 02/Mar/18 ]

We believe the problem is that all UUID->Namespace resolution needs to happen under a global lock of some form.

Generated at Thu Feb 08 04:34:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.