Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-66084

set_step_params.js can induce deadlock by preventing targeter from discovering shard

    • Type: Icon: Bug Bug
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Service Arch
    • ALL
    • 0

      The test set_step_param.js (https://github.com/mongodb/mongo/blob/e3f9dca6ad9888fd696c99f1b1ae2a4c7fdd932b/jstests/noPassthrough/set_step_params.js#L1) ; attempts to test the ability to cap the number of connections the connection pools can have in the currently-establishing state via the 'maxConnecting' sharding-task-executor parameter. It works as follows: 

      • set the maxConnecting parameter to two on mongos
      • set the 'waitInHello' failpoint on the shard. this blocks the shard from responding to hellos
      • drop all connections from mongos to the shard
      • send two find()'s to the mongos that would target the shard. to service these finds(), mongos attempts to grab two connections to the shard. Because the connections were just dropped, the connectionPool attempts to establish two new connections to the shard. But because the waitInHello failpoint is set, those connections get stuck in the 'establishing' state
      • the test then asserts that two connections are in the 'refreshing'/'establishing' state

      However, the following bad interleaving is possible, causing a deadlock for the operations: 

      • After the waitInHello failpoint is set on the shards, concurrently with the dropConnections call on mongos, mongos' ReplicaSetMonitor's hello_monitor is scheduling a hello on the shard to monitor the  shard. The scheduling fails with error PooledConnectionsDropped, because the request for a connection from the connection pool is failed as the dropConnections is being processed. 
      • The RSM then marks the host as failed and attempts to schedule another 'hello' to refresh monitoring of it
      • The RSM's hello is scheduled and sent but blocks on the shard-side due to the waitInHello failpoint
      • set_step_params.js sends the finds() intended to open connections. However, before these finds attempt to acquire connections from the connection pool, they first need to perform 'targeting' - resolving the shard to host and ports
      • the targeting relies on the RSM to have information about the shard, but the RSM's monitoring is blocked by the failpoing

      In short, the test is stuck waiting for the RSM to update monitoring of the shard, but the RSM is blocked waiting for the test to release the waitInHello failpoint. 

       

      To fix this we should probably have the test hang connection establishment from the  mongos connection-pool under-test on the mongos side, rather than using the waitInHello failpoint to hand connection establishment on the shard-side. This will allow other necessary connections to the shard, like this RSM monitoring, to go through. 

            Assignee:
            backlog-server-servicearch [DO NOT USE] Backlog - Service Architecture
            Reporter:
            george.wangensteen@mongodb.com George Wangensteen
            Votes:
            1 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: