[SERVER-39770] FSM connection cache setup can fail with step down Created: 22/Feb/19 Updated: 29/Oct/23 Resolved: 24/May/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.12 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | Vesselina Ratcheva (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | prepare_testing, tig-concurrency | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Sprint: | Repl 2019-06-03 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Linked BF Score: | 18 | ||||||||||||||||
| Description |
|
As part of setting up an fsm test, we sometimes try to create connections to all nodes here, which will eventually call whatsmyuri. However, if this is running on a continuous stepdown suite, you can hit a network error while trying to establish the connection. It is also unclear if we ever re-establish the connections if the connections were closed after the setup due to stepdown. |
| Comments |
| Comment by Githook User [ 24/May/19 ] |
|
Author: {'email': 'vesselina.ratcheva@10gen.com', 'name': 'Vesselina Ratcheva', 'username': 'vessy-mongodb'}Message: |
| Comment by Max Hirschhorn [ 22/May/19 ] |
I think adding retries around new Mongo(...) and new SpecificSecondaryReaderMongo(...) with option (a) is going to be the easier route. vesselina.ratcheva, do you want to chat through how to go about doing this? The network_error_and_txn_override.js override file is already load()'d by the FSM worker thread (though it happens after the new Mongo(...) and new SpecificSecondaryReaderMongo(...) calls right now) so we could use connect() to do the retries instead of writing a separate assert.soon(). Note that overriding the global Mongo object isn't viable because of how the DBClientConnection is stored in its private data field and can only be accessed through the C++-backed JavaScript object. |
| Comment by Vesselina Ratcheva (Inactive) [ 22/May/19 ] |
|
max.hirschhorn I've been seeing this pretty frequently in the new suites for |
| Comment by Max Hirschhorn [ 25/Feb/19 ] |
Just to clarify - the connections owned by the main thread are reconnected by calling Cluster#reestablishConnectionsAfterFailover() which just calls ReplSetTest#getPrimary() on the CSRS and replica set shards in the case of a sharded cluster.
We could consider (a) retrying connection establishment when args.passConnectionCache=true, when TestData.pinningSecondary=true, and when establishing connections for all other cases, or (b) populating the connection cache non-lazily and starting the stepdown thread slightly later. We should actually only start the stepdown thread after latch.getCount() === 0, which corresponds to whether the worker threads have finished their initialization or has failed to initialized. That is to say, to implement (b) we should start the stepdown thread after calling threadMgr.checkFailed(0.2). |