[SERVER-39770] FSM connection cache setup can fail with step down Created: 22/Feb/19  Updated: 29/Oct/23  Resolved: 24/May/19

Status: Closed
Project: Core Server
Component/s: Testing Infrastructure
Affects Version/s: None
Fix Version/s: 4.1.12

Type: Bug Priority: Major - P3
Reporter: Randolph Tan Assignee: Vesselina Ratcheva (Inactive)
Resolution: Fixed Votes: 0
Labels: prepare_testing, tig-concurrency
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-39993 Add kill and terminate versions of co... Closed
Related
related to SERVER-41096 ContinuousStepdown thread and resmoke... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Repl 2019-06-03
Participants:
Linked BF Score: 18

 Description   

As part of setting up an fsm test, we sometimes try to create connections to all nodes here, which will eventually call whatsmyuri. However, if this is running on a continuous stepdown suite, you can hit a network error while trying to establish the connection. It is also unclear if we ever re-establish the connections if the connections were closed after the setup due to stepdown.



 Comments   
Comment by Githook User [ 24/May/19 ]

Author:

{'email': 'vesselina.ratcheva@10gen.com', 'name': 'Vesselina Ratcheva', 'username': 'vessy-mongodb'}

Message: SERVER-39770 Add retry logic to FSM connection cache setup
Branch: master
https://github.com/mongodb/mongo/commit/efe7bc8007aa932a1533af95d980909cd4a39670

Comment by Max Hirschhorn [ 22/May/19 ]

Would it be possible to re-prioritize this ticket? I expect we would see a nontrivial amount of failures if I were to push SERVER-39993 without this fix.

I think adding retries around new Mongo(...) and new SpecificSecondaryReaderMongo(...) with option (a) is going to be the easier route. SERVER-41096 describes a bug with the management of the stepdown file used to synchronize the background thread in resmoke.py and the concurrency framework in the mongo shell that I worry attempting to do option (b) might mean we need to fix both issues simultaneously.

vesselina.ratcheva, do you want to chat through how to go about doing this? The network_error_and_txn_override.js override file is already load()'d by the FSM worker thread (though it happens after the new Mongo(...) and new SpecificSecondaryReaderMongo(...) calls right now) so we could use connect() to do the retries instead of writing a separate assert.soon().

Note that overriding the global Mongo object isn't viable because of how the DBClientConnection is stored in its private data field and can only be accessed through the C++-backed JavaScript object.

Comment by Vesselina Ratcheva (Inactive) [ 22/May/19 ]

max.hirschhorn I've been seeing this pretty frequently in the new suites for SERVER-39993. Would it be possible to re-prioritize this ticket? I expect we would see a nontrivial amount of failures if I were to push SERVER-39993 without this fix.

Comment by Max Hirschhorn [ 25/Feb/19 ]

It is also unclear if we ever re-establish the connections if the connections were closed after the setup due to stepdown.

Just to clarify - the connections owned by the main thread are reconnected by calling Cluster#reestablishConnectionsAfterFailover() which just calls ReplSetTest#getPrimary() on the CSRS and replica set shards in the case of a sharded cluster.

As part of setting up an fsm test, we sometimes try to create connections to all nodes here, which will eventually call whatsmyuri.

We could consider (a) retrying connection establishment when args.passConnectionCache=true, when TestData.pinningSecondary=true, and when establishing connections for all other cases, or (b) populating the connection cache non-lazily and starting the stepdown thread slightly later. We should actually only start the stepdown thread after latch.getCount() === 0, which corresponds to whether the worker threads have finished their initialization or has failed to initialized. That is to say, to implement (b) we should start the stepdown thread after calling threadMgr.checkFailed(0.2).

Generated at Thu Feb 08 04:53:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.