[SERVER-66588] In catalog shard POC the config secondary should be prevented writing Created: 19/May/22  Updated: 27/Oct/23  Resolved: 02/Mar/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: [DO NOT USE] Backlog - Sharding NYC
Resolution: Gone away Votes: 0
Labels: sharding-nyc-subteam2, sharding-nyc-subteam2-catalog-poc
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Sharding NYC
Participants:
Story Points: 4

 Description   

The quick fix I did for SERVER-66224 was fixing tests but in general case it's wrong. Max said:
 
"Brett and I have been debugging an issue and learned that secondaries in the config server replica set attempt to extend the lease of the distributed lock. The writes the secondaries do are through ShardLocal so they end up failing with NotWritablePrimary - https://github.com/mongodb/mongo/blob/3805148358ae9b82e5f3b9307bd25fbf7a4dd4b5/src/mongo/db/s/dist_lock_catalog_replset.cpp#L206-L215I haven't been following the ShardLocal / ShardRemote / ShardConfig but would like to make certain we forbid secondaries from contacting the config server primary and extending the lease. Only the primary of the replica set should ever be doing the distributed lock pinging so fixing that may be the ultimate preferred solution
 
dist_lock_catalog_replset.cpp
Status DistLockCatalogImpl::ping(OperationContext* opCtx, StringData processID, Date_t ping) {
   auto request = write_ops::FindAndModifyCommandRequest(_lockPingNS);
   request.setQuery(BSON(LockpingsType::process() << processID));
   request.setUpdate(write_ops::UpdateModification::parseFromClassicUpdate(
       BSON("$set" << BSON(LockpingsType::ping(ping)))));

I'm saying it we should ideally prevent secondaries from pinging the distributed lock. Secondaries aren't authoritative
 
At minimum the writes the secondaries attempt to do today must still happen locally (and thus fail with NotWritablePrimary) if they are going to happen at all
 
Yes normally https://github.com/mongodb/mongo/blob/5dff90ff1e8a672a8716f0c9c936f8f50e56fd0b/src/mongo/db/repl/oplog.cpp#L367 would abort the local storage transaction on the secondary because config.lockpings is a replicated collection
 
oplog.cpp
       uasserted(ErrorCodes::NotWritablePrimary, ss);
The specific case I'm worried about is secondary node in the catalog shard wants to ping the distributed lock so it contacts the current primary of the catalog shard. Instead it be the exclusive responsibility of the primary of the shards to do that pinging
 
Today on the CSRS the secondary node in the CSRS wants to ping the distributed lock so it tries to write to config.lockpings locally and gets a NotWritablePrimary error"
 



 Comments   
Comment by Jack Mulrow [ 02/Mar/23 ]

Gone away with the changes from SERVER-65891.

Comment by Jack Mulrow [ 04/Nov/22 ]

andrew.shuvalov@mongodb.com, has this gone away since SERVER-65891 was finished?

Comment by Kaloian Manassiev [ 20/May/22 ]

Just FYI that we should be able at this point to throw out the DistLock if this is what is causing you problems. If rather than investigating how to prevent secondaries from doing writes you just threw out the DistLock for this POC, that might save you some time.

Generated at Thu Feb 08 06:05:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.