[SERVER-32906] Improve logging around elections Created: 25/Jan/18  Updated: 30/Oct/23  Resolved: 27/Jul/18

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 4.1.2

Type: Task Priority: Major - P3
Reporter: Spencer Brody (Inactive) Assignee: Tess Avitabile (Inactive)
Resolution: Fixed Votes: 2
Labels: elections, neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Documented
is documented by DOCS-12197 Docs for SERVER-32906: Improve loggin... Closed
Duplicate
is duplicated by SERVER-30851 Log when sending requests in scatter-... Closed
is duplicated by SERVER-31106 Log election losses Closed
Problem/Incident
causes SERVER-36345 Coverity analysis defect 104941: Poin... Closed
Related
is related to SERVER-32907 Turn heartbeat log verbosity back dow... Closed
is related to SERVER-11086 Election handoff to new primary, duri... Closed
is related to SERVER-27908 log every heartbeat during testing Closed
is related to SERVER-32770 ReplSetRequestVotesResponse should de... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.0
Sprint: Repl 2018-07-02, Repl 2018-07-16, Repl 2018-07-30
Participants:

 Description   

To aide diagnosing test failures, we should add logging any time we take an election-related action (election timeout expiring, calling for an election, voting in an election, etc), with all possible inputs to that decision (dump of heartbeat and spanning tree liveness tables, etc.).

This could be done with a new log sub-component, so that we can have higher verbosity in our tests than in prod.



 Comments   
Comment by Githook User [ 27/Jul/18 ]

Author:

{'name': 'Tess Avitabile', 'email': 'tess.avitabile@mongodb.com', 'username': 'tessavitabile'}

Message: SERVER-32906 Improve logging around elections
Branch: master
https://github.com/mongodb/mongo/commit/b4549a63371f9b108e8d7042731c9f1eec622004

Comment by Tess Avitabile (Inactive) [ 10/Jul/18 ]

Thank you!

Comment by William Schultz (Inactive) [ 09/Jul/18 ]

tess.avitabile You should now be able to change these settings to configure Evergreen specific logging levels.

Comment by William Schultz (Inactive) [ 05/Jul/18 ]

The ability to configure log verbosity defaults differently for local testing and tests that run in Evergreen should be enabled by SERVER-32907.

Comment by Tess Avitabile (Inactive) [ 21/Jun/18 ]

Sure, I can add that at the default log level. It's related to elections, and it seems easier than scheduling separate work.

Comment by Spencer Brody (Inactive) [ 21/Jun/18 ]

Hmm... I'd actually like something at default log level for the decision on whether or not to go into catchup mode, but maybe the full replSetGetStatus output is too verbose to log at default level.  I think just a line with the target optime and the current optimes of each other node would suffice.  Maybe that's out of scope for this ticket though.

Comment by Tess Avitabile (Inactive) [ 21/Jun/18 ]

For SERVER-30851: I will add logging to the ScatterGatherRunner to log requests. We already have logging for receiving vote responses.

For SERVER-31106: I will add logging to the LoseElectionGuardV1 destructor.

Yes, we do have logging at level 0 when we fail to receive a response from a node. I believe the ThreadPoolTaskExecutor calls the callback with a non-ok RemoteCommandResponse when there is a timeout. 

Sure, I can add logging of replSetGetStatus output when we decide whether to go into catchup mode. I'll put this in the election sub-component at level 4.

Comment by Spencer Brody (Inactive) [ 21/Jun/18 ]

Everything laid out sounds good. The two linked duplicate tickets (SERVER-30851 and SERVER-31106) also include additional suggestions. Do we log currently when a vote request implicitly becomes a 'no' vote, but without receiving a definitive response from the node (for example due to timeout)?
Finally I was looking at a HELP ticket yesterday and thinking that we should have more logging around when we decide whether or not to enter catchup mode. If you're already adding logging of replSetGetStatus output at that time, I believe that also contains the same information we use for making catchup decisions, so it might be good to add that as well while we're at it.

Comment by Tess Avitabile (Inactive) [ 20/Jun/18 ]

We have the following logging for election events, with the following log levels:

I think we should do the following for this ticket:

  • Add a new logging sub-component for elections, and move all of the above logging into that sub-component (except for sending vote requests, since that is done generically by the task executor). In evergreen, we will log elections at verbosity level 4.
  • Add logging at level 0 for voting in an election.
  • Add logging at level 4 for postponing election timeout.
  • Log replSetGetStatus output at level 4 each time a node takes one of the following actions. The output of replSetGetStatus contains all relevant information to making an election-related decision.
    • Decides to do a catchup takeover.
    • Decides to do a priority takeover.
    • Votes.

spencer, can you please review this plan?

Generated at Thu Feb 08 04:31:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.