Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-34487

mongod server does not accept new connections after random time period

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.4.14
    • Component/s: Networking, Sharding, Stability
    • None
    • ALL

      Our mongo cluster consists of 6 different servers, each one contains 1 config server, 1 mongos, and 1 mongod servers.
      After a random time period, the primary shard, which exists on the 1st server, stops accepting connections.
      The whole architecture is running on the top of docker containers, on Google cloud.
      We're using the official image.
      We're running using the following configurations:

      echo never > /sys/kernel/mm/transparent_hugepage/defrag;
      echo never > /sys/kernel/mm/transparent_hugepage/enabled;
      echo 65535 > /proc/sys/net/core/somaxconn;
      mongod \
          --configsvr \
          --replSet config_rs \
          --storageEngine wiredTiger \
          --profile=1 \
          --slowms=300 \
          --port=27019 \
          --dbpath /data/db \
          --journalCommitInterval 500 \
          --setParameter failIndexKeyTooLong=false \
          --noIndexBuildRetry
      

      Each of our servers run at it's startup the following commands:

      sysctl -w vm.swappiness=10
      sysctl -w vm.vfs_cache_pressure=50
      
      sysctl -w fs.file-max=100000
      sysctl -w net.core.somaxconn=65535
      sysctl -w net.core.rmem_max=16777216
      sysctl -w net.core.wmem_max=16777216
      
      sysctl -w net.ipv4.tcp_max_syn_backlog=4096
      sysctl -w net.ipv4.tcp_syncookies=1
      sysctl -w net.ipv4.ip_forward=1
      
      sysctl -w net.core.rmem_max=16777216
      sysctl -w net.core.wmem_max=16777216
      
      sysctl -w vm.overcommit_memory=1
      
      echo never > /sys/kernel/mm/transparent_hugepage/enabled
      

      The mongos log looks like this:

      2018-04-15T08:35:46.773+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244876 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
      2018-04-15T08:35:46.774+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.775+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244878 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
      2018-04-15T08:35:46.775+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244880 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
      2018-04-15T08:35:46.776+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.777+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.777+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.777+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.778+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.781+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244882 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
      2018-04-15T08:35:46.782+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.782+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244884 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
      2018-04-15T08:35:46.784+0000 I -        [conn60513] end connection 10.240.8.112:49550 (2205 connections now open)                                                                                                    
      2018-04-15T08:35:46.796+0000 I -        [conn60212] end connection 10.240.1.110:47314 (2204 connections now open)                                                                                                    
      2018-04-15T08:35:46.801+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.803+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.815+0000 I -        [conn60119] end connection 10.240.8.152:43718 (2203 connections now open)                                                                                                    
      2018-04-15T08:35:46.816+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.819+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244888 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
      2018-04-15T08:35:46.822+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Connecting to database-shard-1:27018                                                                           
      2018-04-15T08:35:46.825+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244890 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
      2018-04-15T08:35:46.829+0000 I -        [conn60484] end connection 10.240.8.10:59090 (2202 connections now open)                                                                                                     
      2018-04-15T08:35:46.832+0000 I -        [conn60223] end connection 10.240.0.233:55630 (2201 connections now open)                                                                                                    
      2018-04-15T08:35:46.892+0000 I -        [conn60133] end connection 10.240.1.12:42636 (2200 connections now open)                                                                                                     
      2018-04-15T08:35:46.922+0000 I -        [conn60524] end connection 10.240.0.184:53948 (2199 connections now open)                                                                                                    
      2018-04-15T08:35:46.946+0000 I -        [conn60451] end connection 10.240.1.0:51832 (2198 connections now open)                                                                                                      
      2018-04-15T08:35:46.966+0000 I -        [conn60516] end connection 10.240.8.43:43668 (2197 connections now open)                                                                                                     
      2018-04-15T08:35:46.968+0000 I -        [conn60341] end connection 10.240.1.17:43064 (2196 connections now open)                                                                                                     
      2018-04-15T08:35:47.014+0000 I NETWORK  [thread2] connection accepted from 10.240.1.62:36442 #60833 (2196 connections now open)                                                                                      
      2018-04-15T08:35:47.016+0000 I NETWORK  [conn60833] received client metadata from 10.240.1.62:36442 conn60833: { driver: { name: "nodejs", version: "2.2.34" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "4.10.0-38-generic" }, platform: "Node.js v8.11.1, LE, mongodb-core: 2.1.18" }               
      2018-04-15T08:35:47.064+0000 I -        [conn60291] end connection 10.240.0.150:58800 (2196 connections now open)                                                                                                    
      2018-04-15T08:35:47.077+0000 I -        [conn60501] end connection 10.240.8.104:35454 (2195 connections now open)                                                                                                    
      2018-04-15T08:35:47.080+0000 I -        [conn59862] end connection 10.240.5.229:60850 (2194 connections now open)                                                                                                    
      2018-04-15T08:35:47.129+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244892 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
      2018-04-15T08:35:47.206+0000 I -        [conn60114] end connection 10.240.5.202:48330 (2193 connections now open)
      

      I tried to connect to the mongod directly, and I've got a connection timed out as well.
      This indicates that the mongod process is the problem. It happens only in the 1st mongod server and all the other 5 are ok at the moment.
      I've checked number of open files, ulimit, somaxcon... all looked fine.

      I'm attaching strace of the mongod process from inside the docker container:
      st st1

        1. st
          2.11 MB
        2. st1
          446 kB

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            wavenator Gal Ben David
            Votes:
            1 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: