[SERVER-41781] Configure shard to have deterministic failover in sharding_multiple_ns_rs.js Created: 14/Jun/19  Updated: 29/Oct/23  Resolved: 24/Oct/19

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: None
Fix Version/s: 4.3.1

Type: Bug Priority: Minor - P4
Reporter: Lamont Nelson Assignee: Lamont Nelson
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

1. run sharding_multiple_ns_rs.js
2. trigger an additional election after the second primary is elected, but before connPoolStats returns up-to-date information

Sprint: Repl 2019-07-01, Sharding 2019-08-26, Sharding 2019-09-23, Sharding 2019-10-07, Sharding 2019-10-21, Sharding 2019-11-04
Participants:
Linked BF Score: 9

 Description   

The test has one sharded collection and an unsharded collection. It inserts data into both, waits for replication to happen, and then kills the primary of the shard. The test fails while waiting for a new primary to be recognized; it verifies this by calling the "connPoolStats" command.

The test gets the current primary by calling st.rs0.getPrimary() (which equals d21521 at this point) after the first one was killed. It waits for up to 5mins to verify that "connPoolStats" reflects this node as the new primary. The waiting started at 2019-06-06T16:07:22 (line 2368).

According to the logs a new primary, d21521, was promoted at 16:07:20 (line 2183). However, this was not reflected in the output of the connPoolStats command. At 16:04:44 (line 3471), node d21522 decided to start a third election. This seems strange since there was a heartbeat between the two immediately before on line 3417. Regardless, d21522 won the election and was promoted to primary at 2019-06-06T16:07:44 (line 3605).

The "connPoolStats" command reported d21520 (the original primary), as the primary up until 2019-06-06T16:07:50 (line 4210). After this, d21522 was returned as primary by this command. Since the test is still waiting for d21521 to become primary the test will ultimately fail after the timeout period.

Since multiple elections can occur, this failure could be avoided by modifying awaitRSClientHosts with the ability to check who the current primary of the shard is multiple times – not just when the function is initially called. Perhaps, awaitRSClientHosts could accept either a string or a function to produce the current expected primary would work.



 Comments   
Comment by Githook User [ 24/Oct/19 ]

Author:

{'name': 'Lamont Nelson', 'username': 'lamontnelson', 'email': 'lamont.nelson@mongodb.com'}

Message: SERVER-41781 Make one node in replica set non-voting to prevent non-deterministic election in the test.
Branch: master
https://github.com/mongodb/mongo/commit/1137e076c4315f531ef77d2e70ad11370cb3872c

Comment by Lamont Nelson [ 21/Oct/19 ]

Talked with Esha and decided to just do priority 0 here. The code review link is here: https://mongodbcr.appspot.com/477540005/

Comment by Vesselina Ratcheva (Inactive) [ 10/Jul/19 ]

To summarize the above, the shard needs to be configured to either only have two nodes or to give a priority of 0 to the third node.

Comment by Vesselina Ratcheva (Inactive) [ 10/Jul/19 ]

I've had a bit more time to look at this now and I believe awaitRSClientHosts does what it is intended to do from an API/function standpoint. We use it in our tests to verify that the specified host(s) are in the specified state (with agreement from mongos). The requirement is that caller knows what hosts and states they want. Making the function more lenient risks masking bugs (we have over 30 usages as of today).

The real problem here is in the usage, or rather in the expectations of the test itself. The shard has three nodes and no assignment of priorities between them. This by itself already makes failovers non-deterministic, as either of the other two nodes can get elected if one node goes down. It also opens up the possibility of multiple elections, as we have seen in this test - a flaky network and/or machine slowness can easily lead to this. Line 3417 does not actually show a successful heartbeat - that is only the request. The response came in much later, on line 3506, and thus the node started another election in the interim. The test needs to configure the shard such that there is only one way for the failover to occur.

We have historically similar situations many times with tests that we write on the replication team. Thankfully, we have over time gotten much better at configuring our tests so that we avoid this. We typically give a priority of 0 to the third node (which makes it completely unelectable), or take that node out of the picture altogether and only have two replica set members. Either of these solutions should work for this test.

Generated at Thu Feb 08 04:58:40 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.