[SERVER-74841] collMod should not call catalogClient::getCollection during secondary replication Created: 14/Mar/23  Updated: 29/Oct/23  Resolved: 10/Apr/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.0.0-rc0

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Allison Easton
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Problem/Incident
is caused by SERVER-68769 If a shard key index cannot be droppe... Closed
is caused by SERVER-69429 Missing checks in collMod for shard k... Closed
Related
is related to SERVER-68769 If a shard key index cannot be droppe... Closed
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Sharding EMEA 2023-04-03, Sharding EMEA 2023-04-17
Participants:
Linked BF Score: 135

 Description   

This is problematic in a config catalog environment because CatalogClient::getCollection will perform a read with read concern majority with afterClusterTime against the config server. And since itself is the config server, it can get stuck waiting for the clusterTime since the node's opTime will not advance because it is still processing the collMod op, causing a cyclic dependency.

It is also not ideal that this call is performed while collection MODE_X is being held.



 Comments   
Comment by Githook User [ 10/Apr/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-74841 collMod should not call catalogClient::getCollection during secondary replication
Branch: master
https://github.com/mongodb/mongo/commit/ba99b48f5bcf38d7e46f44bff9d69bfb694ba6a7

Comment by Jack Mulrow [ 22/Mar/23 ]

Sounds good to me. I just wanted to note the option, but yeah if this doesn't only run on config servers then changing to local isn't even possible and wouldn't address those other issues.

Comment by Randolph Tan [ 22/Mar/23 ]

Hm... I'm not sure changing to local is a good idea since this code can run on any shard. I also think that checks should be performed only on the primary to minimize the chances of them observing different things and arriving to a different conclusion.

Comment by Jack Mulrow [ 22/Mar/23 ]

Just noting that by default, getting the catalogClient() from the Grid now gets a client with a ShardRemote for the config server, even on the config server (when the catalog shard feature flag is enabled). If this code always runs on the config server, we did add a way to still get a ShardLocal catalog client via ShardingCatalogManager::localCatalogClient(), so if the only problem here is that we're doing a network request, we can switch to using the ShardLocal catalog client instead.

Comment by Randolph Tan [ 14/Mar/23 ]

The problematic calls:
https://github.com/mongodb/mongo/blob/b83e40d9508c662cbc75d363f83974b1efeb3f36/src/mongo/db/catalog/coll_mod.cpp#L362
https://github.com/mongodb/mongo/blob/b83e40d9508c662cbc75d363f83974b1efeb3f36/src/mongo/db/catalog/coll_mod.cpp#L396

Generated at Thu Feb 08 06:28:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.