Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
- sa-backlog

Assigned Teams:

Server Programmability
Operating System:
ALL
Linked BF Score:
0
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The test set_step_param.js (https://github.com/mongodb/mongo/blob/e3f9dca6ad9888fd696c99f1b1ae2a4c7fdd932b/jstests/noPassthrough/set_step_params.js#L1) ; attempts to test the ability to cap the number of connections the connection pools can have in the currently-establishing state via the 'maxConnecting' sharding-task-executor parameter. It works as follows:

set the maxConnecting parameter to two on mongos
set the 'waitInHello' failpoint on the shard. this blocks the shard from responding to hellos
drop all connections from mongos to the shard
send two find()'s to the mongos that would target the shard. to service these finds(), mongos attempts to grab two connections to the shard. Because the connections were just dropped, the connectionPool attempts to establish two new connections to the shard. But because the waitInHello failpoint is set, those connections get stuck in the 'establishing' state
the test then asserts that two connections are in the 'refreshing'/'establishing' state

However, the following bad interleaving is possible, causing a deadlock for the operations:

After the waitInHello failpoint is set on the shards, concurrently with the dropConnections call on mongos, mongos' ReplicaSetMonitor's hello_monitor is scheduling a hello on the shard to monitor the shard. The scheduling fails with error PooledConnectionsDropped, because the request for a connection from the connection pool is failed as the dropConnections is being processed.
The RSM then marks the host as failed and attempts to schedule another 'hello' to refresh monitoring of it
The RSM's hello is scheduled and sent but blocks on the shard-side due to the waitInHello failpoint
set_step_params.js sends the finds() intended to open connections. However, before these finds attempt to acquire connections from the connection pool, they first need to perform 'targeting' - resolving the shard to host and ports
the targeting relies on the RSM to have information about the shard, but the RSM's monitoring is blocked by the failpoing

In short, the test is stuck waiting for the RSM to update monitoring of the shard, but the RSM is blocked waiting for the test to release the waitInHello failpoint.

To fix this we should probably have the test hang connection establishment from the mongos connection-pool under-test on the mongos side, rather than using the waitInHello failpoint to hand connection establishment on the shard-side. This will allow other necessary connections to the shard, like this RSM monitoring, to go through.

Assignee:: Unassigned
Reporter:: George Wangensteen (Inactive)
Participants:: George Wangensteen
Votes:: 1 Vote for this issue
Watchers:: 3 Start watching this issue

Created:: Apr 29 2022 03:43:20 PM UTC
Updated:: Oct 23 2024 03:42:31 PM UTC

Details

Description

Attachments

Activity

People

Dates