[SERVER-57407] Avoid taking ReplicationCoordinator mutex when deciding to gossip cluster times with external clients Created: 03/Jun/21 Updated: 29/Oct/23 Resolved: 17/Sep/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.1.0-rc0 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Jack Mulrow | Assignee: | Randolph Tan |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-wfbf-day | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Backport Requested: |
v5.0, v4.4, v4.2, v4.0
|
||||||||||||||||||||
| Sprint: | Sharding 2021-07-12, Sharding 2021-07-26, Sharding 2021-08-09, Sharding 2021-08-23 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||
| Linked BF Score: | 135 | ||||||||||||||||||||
| Description |
|
To decide whether to gossip cluster times to and from external clients, mongod will check if it is in a readable state (as of Currently, this check happens twice for each received external request and uses the ReplicationCoordinator method getMemberState(). This method takes the ReplicationCoordinator mutex, which is often contested and can have a performance impact on certain workloads. Instead, a method that does not take this mutex should be used, like ReplicationCoordinator::isInPrimaryOrSecondaryState(). |
| Comments |
| Comment by Randolph Tan [ 17/May/22 ] |
|
qwan@stripe.com We didn't see a significant difference in the perf numbers in one of our workloads so we ended up not backporting it. |
| Comment by Qian Wan [ 12/May/22 ] |
|
This patch is said to be back-ported to v4.x, did it ever happen? |
| Comment by Githook User [ 17/Sep/21 ] |
|
Author: {'name': 'Randolph Tan', 'email': 'randolph@10gen.com', 'username': 'renctan'}Message: |
| Comment by Bruce Lucas (Inactive) [ 03/Sep/21 ] |
|
I opened |
| Comment by Andrew Shuvalov (Inactive) [ 04/Jun/21 ] |
|
I would like to stress that if there is a Mutex that is already known to experience substantial lock contention then avoiding to use this mutex in simple "get()" methods on critical path is kicking the can down the road. Besides doing this, an effort to actually reduce the contention must be made. Remember the maxima - "lock contention is always a bug". Just after giving few minutes look at the Mutex I can immediately suggest to explore the possibility that the `_heartbeatHandles` collection used in 4 methods only that tracks outstanding heartbeats is pretty autonomous and can have its own mutex. That will in its order allow to call the `_replExecutor->scheduleRemoteCommand()` without holding any lock, which will allow nested locking because executor is already internally synchronized. So the suggestion is: |