Uploaded image for project: 'Go Driver'
  1. Go Driver
  2. GODRIVER-2390

repeated listCollections & collStats causing connection disconnect

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      The Cloud InTel team has a feature called Data Explorer which uses the automation agent to send commands down to the mongod. When this feature runs on an evergreen box with a standalone mongod 4.2 (mostly fails on a 4.2 standalone but once in a while also fails on newer versions of mongodb and even on a replica set) and an automation agent (which uses a go driver version v1.7.2-0.20210804202952-82d3467025bc) it ends up causing the agent to lose its connection to the mongod when it issues particular commands.

      Specifically, we end up populating the mongod with many databases and collections via a js script, then the agent iterates through all of the dbs and for each db it calls listCollections on the mongod. Sometimes one of those listCollections ends up failing due to a closed connection. In the situations when the listCollections end up finishing however, we then take the first db in the list and we call collStats on each of its collections and one of those collStats calls sometimes ends up failing, also due to a closed connection. The failures occur 10 - 15% of the time that the test runs and most of those are failures on listCollections. The relevant automation agent and mongod logs are attached.

      How to Reproduce

      I have not been able to reproduce this locally. It seems to only fail on the evergreen boxes where we run e2e tests (which might imply this is OS specific or maybe it only occurs when resource usage on a box is high?). Here is an example of one of the failed evergreen e2e tests: https://spruce.mongodb.com/task/mms_e2e_local_E2E_Local_Monitoring_DataExplorer_Read_Write_patch_c8ef0b4e6e5e627d655dd527362adb8d05a7d02f_624b0c86e3c33127d0d804db_22_04_04_15_19_37/tests?execution=0&sortBy=STATUS&sortDir=ASC

      The test doesn't spin up any Atlas boxes, it simply runs an agent and a standalone on the current evergreen box. 

      Additional Background

      Here is the relevant snippet from the automation agent log of that failing test where it shows the listCollections command failing due to a closed connection:

      [2022-04-21T00:27:49.230+0000] [.debug] [src/mongoctl/processctl.go:RunCommandWithTimeout:1132] <myStandalone> [00:27:49.230] Starting RunCommand(dbName=admin, cmd=[\{isMaster 1}]) to ip-10-122-63-219.ec2.internal:20100 (local=false) ...
      [2022-04-21T00:27:49.230+0000] [.debug] [src/mongoclientservice/mongoclientservice.go:createClient:1129] [00:27:49.230] Attempting to connect to connParams = ip-10-122-63-219.ec2.internal:20100 (local=false), connectMode = SingleConnect, deadline = 2022-04-21 00:28:29.230260107 +0000 UTC m=+142.651182497, omitAtmUsers = false identitiesToTry: ([<nil>]), tryTls order: ([false])
      [2022-04-21T00:27:49.230+0000] [.debug] [src/mongoclientservice/mongoclientservice.go:createClient:1186] [00:27:49.230] Trying without TLS for ip-10-122-63-219.ec2.internal:20100 (local=false). identity <nil>
      [2022-04-21T00:27:49.230+0000] [.error] [src/mongoctl/processctl.go:runListCollections:2881] [00:27:49.230] Error running ListCollections with filter map[] : connection(ip-10-122-63-219.ec2.internal:20100[-79]) incomplete read of message header: read tcp 10.122.63.219:54110->10.122.63.219:20100: use of closed network connection
      [2022-04-21T00:27:49.230+0000] [.error] [src/mongoctl/processctl.go:func1:2833] [00:27:49.230] Error getting collections for db config : [00:27:49.230] Error running ListCollections with filter map[] : connection(ip-10-122-63-219.ec2.internal:20100[-79]) incomplete read of message header: read tcp 10.122.63.219:54110->10.122.63.219:20100: use of closed network connection
      [2022-04-21T00:27:49.230+0000] [.debug] [src/cachingresolver/cachingresolver.go:virtualDnsServerMainLoop:72] <Virtual DNS Server 204> [00:27:49.230] Started Virtual DNS Server 204
      [2022-04-21T00:27:49.230+0000] [.error] [src/mongoctl/processctl.go:getCollectionsHelper:2861] [00:27:49.230] Error getting collections for db config : [00:27:49.230] Error executing WithClientFor() for cp=ip-10-122-63-219.ec2.internal:20100 (local=false) connectMode=SingleConnect identityUsed=<nil> : [00:27:49.230] Error getting collections for db config : [00:27:49.230] Error running ListCollections with filter map[] : connection(ip-10-122-63-219.ec2.internal:20100[-79]) incomplete read of message header: read tcp 10.122.63.219:54110->10.122.63.219:20100: use of closed network connection

      And here is the corresponding mongod log:

      2022-04-21T00:27:49.230+0000 D2 COMMAND [conn81] run command config.$cmd { listCollections: 1, filter: {}, nameOnly: true, cursor: {}, lsid: { id: UUID("61e7ec1e-9a4f-4132-bfc1-908c938e3c31") }, $db: "config" }
      2022-04-21T00:27:49.230+0000 D2 NETWORK [conn80] Session from 10.122.63.219:54108 encountered a network error during SourceMessage: HostUnreachable: Connection closed by peer
      2022-04-21T00:27:49.230+0000 D2 NETWORK [conn79] Session from 10.122.63.219:54106 encountered a network error during SourceMessage: HostUnreachable: Connection closed by peer
      2022-04-21T00:27:49.230+0000 I COMMAND [conn81] command config.$cmd appName: "MongoDB Automation Agent v12.0.0.7500 (git: 4b476d0c841cee05d8e870a4264ab1106c9ed903)" command: listCollections { listCollections: 1, filter: {}, nameOnly: true, cursor: {}, lsid: { id: UUID("61e7ec1e-9a4f-4132-bfc1-908c938e3c31") }, $db: "config" } numYields:0 reslen:171 locks:{ ParallelBatchWriterMode: { acquireCount:
      { r: 1 }
      }, ReplicationStateTransition: { acquireCount:
      { w: 1 }
      }, Global: { acquireCount:
      { r: 1 }
      }, Database: { acquireCount:
      { r: 1 }
      }, Collection: { acquireCount:
      { r: 2 }
      }, Mutex: { acquireCount:
      { r: 1 }
      } } protocol:op_msg 0ms
      2022-04-21T00:27:49.230+0000 I NETWORK [conn79] end connection 10.122.63.219:54106 (16 connections now open)
      2022-04-21T00:27:49.230+0000 I NETWORK [conn80] end connection 10.122.63.219:54108 (17 connections now open)
      2022-04-21T00:27:49.230+0000 D3 NETWORK [conn79] Cancelling outstanding I/O operations on connection to 10.122.63.219:54106
      2022-04-21T00:27:49.230+0000 D2 NETWORK [conn82] Session from 10.122.63.219:54120 encountered a network error during SourceMessage: HostUnreachable: Connection closed by peer
      2022-04-21T00:27:49.230+0000 D3 NETWORK [conn80] Cancelling outstanding I/O operations on connection to 10.122.63.219:54108
      2022-04-21T00:27:49.230+0000 I NETWORK [conn82] end connection 10.122.63.219:54120 (15 connections now open)
      2022-04-21T00:27:49.230+0000 D3 NETWORK [conn82] Cancelling outstanding I/O operations on connection to 10.122.63.219:54120
      2022-04-21T00:27:49.230+0000 D2 NETWORK [conn81] Session from 10.122.63.219:54110 encountered a network error during SourceMessage: HostUnreachable: Connection closed by peer
      2022-04-21T00:27:49.230+0000 I NETWORK [conn81] end connection 10.122.63.219:54110 (14 connections now open)
      2022-04-21T00:27:49.230+0000 D3 NETWORK [conn81] Cancelling outstanding I/O operations on connection to 10.122.63.219:54110
      

       

      Here is a link to the code that is being called which ends up failing due to a disconnect:
      https://github.com/10gen/mms-automation/blob/745524a911e35371fd2258183ec36adc2f1aaf41/go_planner/src/com.tengen/cm/mongoctl/processctl.go#L2878

       
      I noticed this in the listCollections docs page:

      Client Disconnection
      Starting in MongoDB 4.2, if the client that issued listCollections disconnects before the operation completes, MongoDB marks listCollections for termination using killOp.

      While it doesn't sound like this would cause a disconnect, it is odd that we are seeing these errors mostly on version 4.2 (and once in a while on a higher version, but never on earlier versions). Maybe before 4.2 the disconnect was always an issue but the driver would reconnect and still be able to get the results of the listCollections command whereas after 4.2 it can't?
       
      This docs page also mentions a few changes that were made to collStats starting in version 4.2: https://www.mongodb.com/docs/manual/reference/command/collStats/

            Assignee:
            preston.vasquez@mongodb.com Preston Vasquez
            Reporter:
            vladimir.isakov@mongodb.com Vladimir Isakov (Inactive)
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: