[SERVER-66972] Database critical section does not serialize with ongoing refreshes Created: 02/Jun/22  Updated: 29/Oct/23  Resolved: 29/Aug/22

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 5.0.9, 6.0.0-rc8
Fix Version/s: 6.1.1, 5.0.14, 6.0.3, 6.2.0-rc0

Type: Bug Priority: Major - P3
Reporter: Tommaso Tocci Assignee: Antonio Fuschetto
Resolution: Fixed Votes: 0
Labels: PM-2144-Milestone-0
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
depends on SERVER-69108 SCCL can immediately return config an... Closed
is depended on by SERVER-69444 Make the joining of concurrent critic... Closed
Problem/Incident
causes SERVER-69930 Unexpected error message in the logs ... Closed
Related
related to SERVER-68661 Deadlock with transactions after step... Closed
is related to SERVER-70793 Make database metadata refresh first ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v6.1, v6.0, v5.0
Sprint: Sharding EMEA 2022-07-11, Sharding EMEA 2022-07-25, Sharding EMEA 2022-08-08, Sharding EMEA 2022-08-22, Sharding EMEA 2022-09-05
Participants:
Linked BF Score: 20

 Description   

Consider the following scenario:

  1. A StaleDatabaseVersion error is thrown due to db version mismatch (no critical section in place)
  2. The exception is bubbled up and spawn a database refresh
  3. The database critical section is acquired
  4. The database refresh completes and installs the new db version in the DSS
  5. The critical section is released

To guarantee correctness of the critical section we must ensure that all the refreshes started before the critical section acquisition (3.) will be invalidated and no new refreshes could start before (5.)



 Comments   
Comment by Githook User [ 05/Oct/22 ]

Author:

{'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}

Message: SERVER-66972 Database critical section does not serialize with ongoing refreshes
Branch: v6.1
https://github.com/mongodb/mongo/commit/59053967edeea2ace11b0eb9fbe4542dc56a0cab

Comment by Githook User [ 03/Oct/22 ]

Author:

{'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}

Message: SERVER-66972 Database critical section does not serialize with ongoing refreshes
Branch: v5.0
https://github.com/mongodb/mongo/commit/6e1b7b9990646d89113e613667e6ce5303a2c706

Comment by Githook User [ 03/Oct/22 ]

Author:

{'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}

Message: SERVER-66972 Database critical section does not serialize with ongoing refreshes
Branch: v6.0
https://github.com/mongodb/mongo/commit/546af4aa74cd24d59272b41878f8af14519ad433

Comment by Githook User [ 29/Aug/22 ]

Author:

{'name': 'Antonio Fuschetto', 'email': 'antonio.fuschetto@mongodb.com', 'username': 'afuschetto'}

Message: SERVER-66972 Database critical section does not serialize with ongoing refreshes
Branch: master
https://github.com/mongodb/mongo/commit/0ff2527bab030462910a03fdb9a9435d29db1fe8

Comment by Antonio Fuschetto [ 25/Jul/22 ]

The solution that I am implementing is designed to be backported, since this race condition affects old branches as well. With the Sharding-first Catalog project the logic of how the database version is managed at shard level will change, which probably allows us to get rid of the current logic to refresh the DB version in case of mismatch with the one provided by the router. This will included, presumably, in version 7.0.

The fix will provide the following properties to the current onDbVersionMismatch implementation:

  • No threads are in the critical section, can enter it, or can X-lock the database when the current version is read
  • No other threads are refreshing the database version (only one update at a time)
  • When the version is cleared (dropped database or moved primary shard), any pending refreshes on that database can be aborted
Comment by Antonio Fuschetto [ 14/Jun/22 ]

Whatever will be the strategy to resolve this problem, the procedure that checks and possibly recovers the database version must satisfy the following conditions:

  • No critical section is taken (wait for it leaves)
  • No refresh is in progress (wait for it completes)
  • If the database information is cleared, abort any pending recovery requests
Generated at Thu Feb 08 06:06:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.