[SERVER-70832] Don't take DB lock in MODE_X when installing new sharding database metadata Created: 25/Oct/22  Updated: 08/Nov/23  Resolved: 30/Jan/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 6.1.1, 6.2.0-rc6
Fix Version/s: 6.3.0-rc0

Type: Task Priority: Major - P3
Reporter: Jordi Serra Torrens Assignee: Antonio Fuschetto
Resolution: Fixed Votes: 0
Labels: shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
is caused by SERVER-64209 Push the DatabaseShardingState state ... Closed
Related
related to SERVER-73345 Cancel ongoing refresh on DB metadata... Closed
related to SERVER-40258 Relax locking requirements for shardi... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v6.0, v5.0, v4.4
Sprint: Sharding EMEA 2022-11-28, Sharding EMEA 2022-12-12, Sharding EMEA 2022-12-26, Sharding EMEA 2023-01-09, Sharding EMEA 2023-01-23, Sharding EMEA 2023-02-06
Participants:
Case:
Story Points: 3.33

 Description   

Database metadata refresh should not take the database lock in MODE_X. Instead, we should be able to do it under MODE_IX. This prevents database metadata refreshes from blocking behind ongoing transactions. This is similar to what SERVER-40258 did for collection metadata refreshes.



 Comments   
Comment by Githook User [ 30/Jan/23 ]

Author:

{'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}

Message: SERVER-70832 Don't take DB lock in MODE_X when installing new sharding database metadata
Branch: master
https://github.com/mongodb/mongo/commit/2772d1b849fc297e82aa166f45d03d93b77906ee

Comment by Antonio Fuschetto [ 02/Dec/22 ]

This change is not trivial as initially assumed and would cause at least one race condition. Relaxing the lock on database when the metadata is set from X mode to IX mode, the AutoGetDb constructor is exposed to a race condition with the thread that refreshes the database metadata. The sequence of events triggering the problem are:

  1. Thread A needs to access to a database, so the AutoGetDb constructor is invoked (let say locking a database in IS mode).
  2. The name is saved, the lock is acquired, but the database pointer is null because the DatabaseHolder doesn't own any entry for the given name.
  3. Now the thread B, that refreshes the database metadata, retrieves the database metadata from the config server and acquires the database lock in IX mode (previously, in X mode, it would have waited).
  4. The thread B adds an entry in DatabaseHolder and sets the database metadata.
  5. Now, on thread A, the AutoGetDb constructor checks the database version and succeeds.
  6. The thread A has an AutoGetDb object that has been validated (version checked) but the internal pointer to the database is null.

Before SERVER-64209, the database metadata was managed by the DatabaseShardingState class. This machinery offers an additional locking resource to the hierarchy, that is DSS lock, which was use to synchronize the access to its information, such as the database metadata. It is presumable to assume that the DSS lock was created to have more fine-grained synchronization, in order to avoid locking the database resource directly.

Now the database metadata is managed by the DatabaseHolder class, since this is an information associated to the database directly (and this makes sense from the point of view that all collections are sharded). Although we have no fine grain mechanism mentioned above, unless to use the one currently exposed by the DSS or to implement something similar at the database level (to be used only for metadata access).

We will discuss about this problem internally in order to find a solution fully compatibile with our future goals

Generated at Thu Feb 08 06:17:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.