[SERVER-34487] mongod server does not accept new connections after random time period Created: 15/Apr/18  Updated: 15/Sep/18  Resolved: 23/Aug/18

Status: Closed
Project: Core Server
Component/s: Networking, Sharding, Stability
Affects Version/s: 3.4.14
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Gal Ben David Assignee: Kelsey Schubert
Resolution: Done Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: HTML File st     HTML File st1    
Operating System: ALL
Participants:

 Description   

Our mongo cluster consists of 6 different servers, each one contains 1 config server, 1 mongos, and 1 mongod servers.
After a random time period, the primary shard, which exists on the 1st server, stops accepting connections.
The whole architecture is running on the top of docker containers, on Google cloud.
We're using the official image.
We're running using the following configurations:

echo never > /sys/kernel/mm/transparent_hugepage/defrag;
echo never > /sys/kernel/mm/transparent_hugepage/enabled;
echo 65535 > /proc/sys/net/core/somaxconn;
mongod \
    --configsvr \
    --replSet config_rs \
    --storageEngine wiredTiger \
    --profile=1 \
    --slowms=300 \
    --port=27019 \
    --dbpath /data/db \
    --journalCommitInterval 500 \
    --setParameter failIndexKeyTooLong=false \
    --noIndexBuildRetry

Each of our servers run at it's startup the following commands:

sysctl -w vm.swappiness=10
sysctl -w vm.vfs_cache_pressure=50
 
sysctl -w fs.file-max=100000
sysctl -w net.core.somaxconn=65535
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
 
sysctl -w net.ipv4.tcp_max_syn_backlog=4096
sysctl -w net.ipv4.tcp_syncookies=1
sysctl -w net.ipv4.ip_forward=1
 
sysctl -w net.core.rmem_max=16777216
sysctl -w net.core.wmem_max=16777216
 
sysctl -w vm.overcommit_memory=1
 
echo never > /sys/kernel/mm/transparent_hugepage/enabled

The mongos log looks like this:

2018-04-15T08:35:46.773+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244876 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
2018-04-15T08:35:46.774+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.775+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244878 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
2018-04-15T08:35:46.775+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244880 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
2018-04-15T08:35:46.776+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.777+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.777+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.777+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.778+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.781+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244882 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
2018-04-15T08:35:46.782+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.782+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244884 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
2018-04-15T08:35:46.784+0000 I -        [conn60513] end connection 10.240.8.112:49550 (2205 connections now open)                                                                                                    
2018-04-15T08:35:46.796+0000 I -        [conn60212] end connection 10.240.1.110:47314 (2204 connections now open)                                                                                                    
2018-04-15T08:35:46.801+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.803+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-1-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.815+0000 I -        [conn60119] end connection 10.240.8.152:43718 (2203 connections now open)                                                                                                    
2018-04-15T08:35:46.816+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.819+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-2-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244888 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
2018-04-15T08:35:46.822+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Connecting to database-shard-1:27018                                                                           
2018-04-15T08:35:46.825+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-3-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244890 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
2018-04-15T08:35:46.829+0000 I -        [conn60484] end connection 10.240.8.10:59090 (2202 connections now open)                                                                                                     
2018-04-15T08:35:46.832+0000 I -        [conn60223] end connection 10.240.0.233:55630 (2201 connections now open)                                                                                                    
2018-04-15T08:35:46.892+0000 I -        [conn60133] end connection 10.240.1.12:42636 (2200 connections now open)                                                                                                     
2018-04-15T08:35:46.922+0000 I -        [conn60524] end connection 10.240.0.184:53948 (2199 connections now open)                                                                                                    
2018-04-15T08:35:46.946+0000 I -        [conn60451] end connection 10.240.1.0:51832 (2198 connections now open)                                                                                                      
2018-04-15T08:35:46.966+0000 I -        [conn60516] end connection 10.240.8.43:43668 (2197 connections now open)                                                                                                     
2018-04-15T08:35:46.968+0000 I -        [conn60341] end connection 10.240.1.17:43064 (2196 connections now open)                                                                                                     
2018-04-15T08:35:47.014+0000 I NETWORK  [thread2] connection accepted from 10.240.1.62:36442 #60833 (2196 connections now open)                                                                                      
2018-04-15T08:35:47.016+0000 I NETWORK  [conn60833] received client metadata from 10.240.1.62:36442 conn60833: { driver: { name: "nodejs", version: "2.2.34" }, os: { type: "Linux", name: "linux", architecture: "x64", version: "4.10.0-38-generic" }, platform: "Node.js v8.11.1, LE, mongodb-core: 2.1.18" }               
2018-04-15T08:35:47.064+0000 I -        [conn60291] end connection 10.240.0.150:58800 (2196 connections now open)                                                                                                    
2018-04-15T08:35:47.077+0000 I -        [conn60501] end connection 10.240.8.104:35454 (2195 connections now open)                                                                                                    
2018-04-15T08:35:47.080+0000 I -        [conn59862] end connection 10.240.5.229:60850 (2194 connections now open)                                                                                                    
2018-04-15T08:35:47.129+0000 I ASIO     [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to database-shard-1:27018 - NetworkInterfaceExceededTimeLimit: Operation timed out, request was RemoteCommand 2244892 -- target:database-shard-1:27018 db:admin cmd:{ isMaster: 1 }  
2018-04-15T08:35:47.206+0000 I -        [conn60114] end connection 10.240.5.202:48330 (2193 connections now open)

I tried to connect to the mongod directly, and I've got a connection timed out as well.
This indicates that the mongod process is the problem. It happens only in the 1st mongod server and all the other 5 are ok at the moment.
I've checked number of open files, ulimit, somaxcon... all looked fine.

I'm attaching strace of the mongod process from inside the docker container:
st st1



 Comments   
Comment by Kelsey Schubert [ 22/Aug/18 ]

Hi benshu,

Thanks for providing the additional detail. After reviewing the gdb output, this appears to be a system issue beneath mongod. We can see that write is stuck at the system level:

1 times Thread 133 (Thread 0x7f374c651700 (LWP 9769)):
#0  0x00007f37879b5c1d in write () at ../sysdeps/unix/syscall-template.S:81
#1  0x00007f378794e423 in _IO_new_file_write (f=0x7f3787c802a0 <_IO_2_1_stdout_>, data=0x7f3789257000, n=318) at fileops.c:1253
#2  0x00007f378794dae3 in new_do_write (fp=0x7f3787c802a0 <_IO_2_1_stdout_>, data=0x7f3789257000 "6\" } }, $configServerState: { opTime: { ts: Timestamp(1534226645, 318), t: 13 } }, $db: \"backend_database\" } numYields:0 reslen:370 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquir"..., to_do=318) at fileops.c:530
#3  0x00007f378794f255 in _IO_new_do_write (fp=<optimized out>, data=<optimized out>, to_do=318) at fileops.c:503
#4  0x00007f378794e951 in _IO_new_file_xsputn (f=0x7f3787c802a0 <_IO_2_1_stdout_>, data=<optimized out>, n=1) at fileops.c:1324
#5  0x00007f3787944aad in __GI__IO_fwrite (buf=<optimized out>, size=1, count=1, fp=0x7f3787c802a0 <_IO_2_1_stdout_>) at iofwrite.c:43
#6  0x00005646d4043613 in std::ostream::write(char const*, long) ()
#7  0x00005646d3ecab31 in mongo::logger::MessageEventDetailsEncoder::encode(mongo::logger::MessageEventEphemeral const&, std::ostream&) ()
#8  0x00005646d2bc4a45 in mongo::logger::ConsoleAppender<mongo::logger::MessageEventEphemeral, mongo::Console>::append(mongo::logger::MessageEventEphemeral const&) ()
#9  0x00005646d3ecace4 in mongo::logger::LogDomain<mongo::logger::MessageEventEphemeral>::append(mongo::logger::MessageEventEphemeral const&) ()
#10 0x00005646d3ec9f69 in mongo::logger::LogstreamBuilder::~LogstreamBuilder() ()
#11 0x00005646d28bc44b in mongo::ServiceEntryPointMongod::handleRequest(mongo::OperationContext*, mongo::Message const&) ()
#12 0x00005646d28c887a in mongo::ServiceStateMachine::_processMessage(mongo::ServiceStateMachine::ThreadGuard) ()
#13 0x00005646d28c4237 in mongo::ServiceStateMachine::_runNextInGuard(mongo::ServiceStateMachine::ThreadGuard) ()
#14 0x00005646d28c7671 in std::_Function_handler<void (), mongo::ServiceStateMachine::_scheduleNextWithGuard(mongo::ServiceStateMachine::ThreadGuard, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName, mongo::ServiceStateMachine::Ownership)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#15 0x00005646d37d3a42 in mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName) ()
#16 0x00005646d28c309f in mongo::ServiceStateMachine::_scheduleNextWithGuard(mongo::ServiceStateMachine::ThreadGuard, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName, mongo::ServiceStateMachine::Ownership) ()
#17 0x00005646d28c55e5 in mongo::ServiceStateMachine::_sourceCallback(mongo::Status) ()
#18 0x00005646d28c5edb in mongo::ServiceStateMachine::_sourceMessage(mongo::ServiceStateMachine::ThreadGuard) ()
#19 0x00005646d28c42bd in mongo::ServiceStateMachine::_runNextInGuard(mongo::ServiceStateMachine::ThreadGuard) ()
#20 0x00005646d28c7671 in std::_Function_handler<void (), mongo::ServiceStateMachine::_scheduleNextWithGuard(mongo::ServiceStateMachine::ThreadGuard, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName, mongo::ServiceStateMachine::Ownership)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#21 0x00005646d37d3fa5 in std::_Function_handler<void (), mongo::transport::ServiceExecutorSynchronous::schedule(std::function<void ()>, mongo::transport::ServiceExecutor::ScheduleFlags, mongo::transport::ServiceExecutorTaskName)::{lambda()#1}>::_M_invoke(std::_Any_data const&) ()
#22 0x00005646d3d960b4 in mongo::(anonymous namespace)::runFunc(void*) ()
#23 0x00007f3787c8d064 in start_thread (arg=0x7f374c651700) at pthread_create.c:309
#24 0x00007f37879c262d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111

From an analysis of gdb, there are 441 threads piled up behind this operation waiting to write to the log. Consequently, as soon as a new connection is established, it hangs waiting for the logger. I'd recommend reviewing syslogs or dmesg for clues about the root cause of this behavior.

Kind regards,
Kelsey

Comment by Hagay Ben Shushan [ 22/Aug/18 ]

Hi Keylsey,

I've uploaded the proper shard-1 logs.

The issue began at 2018-08-14T06:04:07

It was resolved at ~07:20 the same day, only by restarting the mongod on shard-1, which was the affected node.

 

Thanks,

Comment by Kelsey Schubert [ 21/Aug/18 ]

Hi benshu,

To help speed our investigation as we look through the data you've provided, would you please clarify exactly when the issue began, when it was resolved, and which node was affected?

Would you please also double-check the logs on shard-1, and reupload the files for this node as the logs we've received for this node are corrupted.

Thank you,
Kelsey

Comment by Hagay Ben Shushan [ 20/Aug/18 ]

Hi,

I'm an associate of Gal Ben David, and have uploaded all the required information via the upload portal.

I would appreciate if you could reopen the issue as it is still relevant.

 

Thanks,

Comment by Dmitry Agranat [ 21/Jun/18 ]

Hi wavenator,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide the requested information and we will reopen the ticket.

Thanks,
Dima

Comment by Gal Ben David [ 27/May/18 ]

Hi Dimitry.

The logs were full. This is part of the problem. The mongo instance stopped logging when the problem occurs.

I'll try to provide from all the instances, although it could take some time.

Comment by Dmitry Agranat [ 27/May/18 ]

Hi wavenator,

We still need some additional information to progress this investigation. If this is still an issue for you, can you please provide the following covering the time of the incident:

  • An archive of the diagnostic.data directory from all Primary members
  • Archived mongos logs from all mongoS servers, covering at least a few hours
  • Archived mongod logs from all mongoD servers, covering at least a few hours

Thanks,
Dima

Comment by Dmitry Agranat [ 09/May/18 ]

Hi wavenator,

Unfortunately, the provided mongos and mongod logs do not correlate with the {{diagnostic.data}.

  • mongos logs:

          start: 2018 Apr 22 11:11:24.612
            end: 2018 Apr 22 11:12:01.558
    

  • mongod logs:

          start: 2018 Apr 22 11:06:18.613
            end: 2018 Apr 22 11:06:23.620
    

  • The diagnostic.data ends at 11:06:00

Nevertheless, I was able to spot something suspicious in the provided diagnostic.data. In order to progress this investigation, please provide the following covering the time of the incident:

  • An archive of the diagnostic.data directory from all Primary members
  • Archived mongos logs from all mongoS servers, covering at least a few hours
  • Archived mongod logs from all mongoD servers, covering at least a few hours

Even though I have a good suspect at the moment, I need to correlate all the requested data. In addition, I'd like to understand why the reported issue is impacting only specific shard's Primary.

Thanks,
Dima

Comment by Kelsey Schubert [ 23/Apr/18 ]

Thanks for uploading the files, wavenator – we're investigating.

Comment by Gal Ben David [ 19/Apr/18 ]

Ok, I'll work on getting all this information ASAP.
We should stop the logs truncation so you'll receive the full logs, and wait for the next halt. I believe it'll happen soon.
I'll add another comment with the results.

Edit:
Uploaded the files through the curl command.
Can you confirm?

Comment by Kelsey Schubert [ 16/Apr/18 ]

Hi wavenator,

Thank you for reporting this issue. So we can continue to investigate, would you please provide the following information:

  1. An archive of the diagnostic.data directory from the affected node
  2. Complete logs from the affected node
  3. Complete logs from a mongos showing the connection issue
  4. The file produced by the following command executed against the affected node when it is refusing connections:

    gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' > gdb_`date +"%Y-%m-%d_%H-%M-%S"`.txt
    

Since these files may exceed the maximum upload limit to JIRA, I've created a secure upload portal for you to use. Files uploaded to this portal are only visible to MongoDB employees investigating this issue and are routinely deleted after some time.

Thank you for your help,
Kelsey

Generated at Thu Feb 08 04:36:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.