[SERVER-40856] Log replication progress in hang analyzer Created: 26/Apr/19  Updated: 06/Dec/22

Status: Open
Project: Core Server
Component/s: Replication, Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.1 Desired

Type: Improvement Priority: Major - P3
Reporter: Judah Schvimer Assignee: Backlog - Server Tooling and Methods (STM) (Inactive)
Resolution: Unresolved Votes: 0
Labels: move-sdp-candidate, tig-hanganalyzer, tig-qwin-eligible
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-40857 Remove write concern wtimeouts that a... Backlog
Assigned Teams:
Server Tooling & Methods
Participants:

 Description   

Some repl hangs are due to nodes not replicating rather than an actual deadlock. It would be helpful if the hang analyzer called replSetGetStatus on every node in the cluster while the process was still alive.

If the replSetGetStatus call hangs because of a deadlock on the ReplicationCoordinator mutex, then the replication progress is probably not important anyways so it's not a problem to just kill that command.



 Comments   
Comment by Lingzhi Deng [ 28/Jan/22 ]

I think this is still something nice to have when diagnosing BFs. But it is not pressing.

Generated at Thu Feb 08 04:56:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.