[SERVER-83564] Make sure the process field is indexed in config.locks Created: 24/Nov/23  Updated: 07/Feb/24  Resolved: 11/Jan/24

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 6.0.11
Fix Version/s: 5.0.25, 4.4.29, 6.0.14

Type: Bug Priority: Major - P3
Reporter: Dmitry Ryabtsev Assignee: Sulabh Mahajan
Resolution: Fixed Votes: 1
Labels: car-investigation, sharding-emea-pm-iteration-planning
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Related
Assigned Teams:
Catalog and Routing
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0, v4.4
Sprint: CAR Team 2023-12-11, CAR Team 2023-12-25, CAR Team 2024-01-08, CAR Team 2024-01-22
Participants:
Case:

 Description   

It has been observed that in a case with a large number of collections being created / dropped certain update operations on the locks collection may never be able to complete if the CSRS does not have enough capacity to execute a COLLSCAN update in under 30 seconds:

// From v6.0.11
{"t":{"$date":"2023-11-23T08:49:07.822Z"},"s":"I","c":"WRITE","id":51803,"ctx":"conn43450","msg":"Slow query","attr":{"type":"update","ns":"config.locks","command":{"q":{"process":"atlas-xxxxxx-shard-0"},"u":{"_set":{"state":0}},"multi":true,"upsert":false},"planSummary":"COLLSCAN","numYields":8360,"queryHash":"F919F136","planCacheKey":"4256882D","ok":0,"errMsg":"operation exceeded time limit","errName":"MaxTimeMSExpired","errCode":50,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":8361}},"FeatureCompatibilityVersion":{"acquireCount":{"w":8361}},"ReplicationStateTransition":{"acquireCount":{"w":8361}},"Global":{"acquireCount":{"w":8361}},"Database":{"acquireCount":{"w":8361}},"Collection":{"acquireCount":{"w":8361}},"Mutex":{"acquireCount":{"r":1}}},"flowControl":{"acquireCount":8361,"timeAcquiringMicros":64969},"remote":"192.168.24.33:52636","durationMillis":30066}}

Unless there is a work planned to ensure the locks collection does not get inflated, the solution is to make sure there is an index on { "process": 1 }.



 Comments   
Comment by Githook User [ 07/Feb/24 ]

Author:

{'name': 'Sulabh Mahajan', 'email': 'sulabh.mahajan@mongodb.com', 'username': 'sulabhM'}

Message: SERVER-83564 Add an index on the process field for config.locks (#17757) (#18682)

GitOrigin-RevId: 9201e694024b4c461b990c777a7da25d1468a683
Branch: v4.4
https://github.com/mongodb/mongo/commit/033bb1289103c92ae73d4a0f49bb47beb6d79256

Comment by Githook User [ 30/Jan/24 ]

Author:

{'name': 'Sulabh Mahajan', 'email': 'sulabh.mahajan@mongodb.com', 'username': 'sulabhM'}

Message: SERVER-83564 Add an index on the process field for config.locks (#17757)

GitOrigin-RevId: 9da07653331f02bec726dd9343ef39c2c52acb19
Branch: v5.0
https://github.com/mongodb/mongo/commit/67ee3f620d75ab115c2a85879f60a75bb99f8c80

Comment by Githook User [ 11/Jan/24 ]

Author:

{'name': 'Sulabh Mahajan', 'email': 'sulabh.mahajan@mongodb.com', 'username': 'sulabhM'}

Message: SERVER-83564 Add an index on the process field for config.locks (#17757)

GitOrigin-RevId: 716488c24624812685c255533516d12f252725d2
Branch: v6.0
https://github.com/mongodb/mongo/commit/a0fdc804659c89af25bec31c9025d2c6f8cdd703

Comment by Sulabh Mahajan [ 21/Dec/23 ]

Thinking out loud here. Reading the code suggests that whenever an election happens, and a new primary gets selected, we call ReplicationCoordinatorExternalState::onTransitionToPrimary(). Its implementation seems to call ReplicationCoordinatorExternalStateImpl::_shardingOnTransitionToPrimaryHook(), which checks if this node is a part of the config server, and initializes the config database if needed. The relevant collections and indexes are created if required. The function for the index creation is ShardingCatalogManager::_initConfigIndexes().

I can change that function to ensure an index gets created on the process field. createIndexOnConfigCollection() logs a message if the index is being created on a non-empty collection, as usually a config database's collection and indexes would be created before writes to it. So, when upgrading, as a node in the config server replica set becomes the primary, there will be a one-time log message about the index creation. It seems okay to me.

Comment by Haley Connelly [ 05/Dec/23 ]

We agree that the patch should be to add an index on the "process field".

As pierlauro.sciarelli@mongodb.com mentioned, the issue is only pre-6.1: we should pursue the simplest patch to mitigate issues with updating "config.locks". Garbage collection would involve significant changes, so we should instead add an index to prevent the need for full table scans.

Context
Entries in "config.locks" are uniquely identified by "_id: <namespace>".  So, if sharded collection "test.c" is dropped then recreated, the "dropCollection" entry in "config.locks" will be overwritten by "createCollection". However, if many unique namespaces are dropped/ created over time, the collection can accumulate many "config.locks" entries. 

If, in the future, customers start seeing excessive growth in config.locks, where the collection data grows to an unreasonable size, we should revisit the idea of creating a garbage collection script.

Comment by Pierlauro Sciarelli [ 24/Nov/23 ]

[note] This problem only affects pre-v6.1.0 versions since the distributed lock has gone away under SERVER-65891.

Generated at Thu Feb 08 06:52:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.