[SERVER-30261] too many connections to a mongod instance will botch performance Created: 21/Jul/17  Updated: 07/Nov/17  Resolved: 29/Sep/17

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Tudor Aursulesei Assignee: Mark Agarunov
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

In some workloads, there are a lot of connections to one or many mongod instances which run as a shard. We then start getting these errors in dmesg:
TCP: request_sock_TCP: Possible SYN flooding on port 10105. Sending cookies. Check SNMP counters.
TCP: request_sock_TCP: Possible SYN flooding on port 10104. Sending cookies. Check SNMP counters.

After the workload is gone, the mongod instance responds very slowly to some very simple queries. If we restart the mongod instances, the problem goes away. We have increased somax, tcp syn backlog, and tcp memory (rmem, wmem), but the issue is not fixed. Is this old ticket still related?

https://jira.mongodb.org/browse/SERVER-2554

Thank you



 Comments   
Comment by Kelsey Schubert [ 29/Sep/17 ]

Hi thestick613,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Kelsey

Comment by Ramon Fernandez Marina [ 14/Sep/17 ]

thestick613, we have not been able to reproduce this issue with the binaries we distribute. Have you by any chance compiled your own binaries? If not, I'm afraid without the information Mark requested above we'll have to close this ticket.

Thanks,
Ramón.

Comment by Tudor Aursulesei [ 22/Aug/17 ]

Moving to 3.4 from 3.2 improved the performance significantly. We still get the syncookie error, but the server is more stable. I'm suspecting because of the new replication engine in 3.4.

Comment by Tudor Aursulesei [ 22/Aug/17 ]

Hello,

You can reproduce this with a brand new mongo setup, there is no need for any log files or diagnostic.data. See this comment. I am using Ubuntu 16.04.2 LTS, and mongod db version v3.4.6 git version: c55eb86ef46ee7aede3b1e2a5d184a7df4bfb5b5 and a generic ubuntu kernel: Linux mongo-rs1 4.4.0-83-generic #106-Ubuntu SMP Mon Jun 26 17:54:43 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux.

Comment by Mark Agarunov [ 22/Aug/17 ]

Hello thestick613,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please provide the following:

  • The complete log files from mongod and mongos
  • Please archive and upload the $dbpath/diagnostic.data directory
  • Please provide the version of MongoDB being used
  • Please provide the operating system, version and kernel version being used.

Thanks,
Mark

Comment by Mark Agarunov [ 08/Aug/17 ]

Hello thestick613,

Thank you for the report. Unfortunately I have not yet been able to reproduce this issue so I would like to request some additional information to better diagnose the behavior. For all affected nodes, please provide the following:

  • The complete log files from mongod and mongos
  • Please archive and upload the $dbpath/diagnostic.data directory
  • Please provide the version of MongoDB being used
  • Please provide the operating system, version and kernel version being used.

Thanks,
Mark

Comment by Tudor Aursulesei [ 22/Jul/17 ]

This script manages to generate the kernel message:

TCP: request_sock_TCP: Possible SYN flooding on port 10104. Sending cookies.  Check SNMP counters.

from gevent import monkey
monkey.patch_all()
 
import pymongo
import gevent
import random
 
from multiprocessing import cpu_count
from multiprocessing.pool import Process
 
def one_connection():
#    gevent.sleep(random.random() * 4)
    pm = pymongo.MongoClient('10.10.10.100:10104')
#    gevent.sleep(random.random() * 4)
    for j in range(100):
        pm.admin.command({'ping': 1})
        gevent.sleep(0.1)
 
def on_core(per_cpu_threads):
    tasks = []
#    gevent.sleep(random.random() * 4)
    for i in range(per_cpu_threads):
        tasks.append(gevent.spawn(one_connection))
    gevent.joinall(tasks)
 
if __name__ == "__main__":
    tcpus = 72
    per_cpu_threads = 10
    print tcpus
 
    procs = []
    for coreid in range(0, 3 * tcpus):
        procs.append(Process(target=on_core, args=(per_cpu_threads, )))
 
    for proc in procs:
        proc.start()
 
    for proc in procs:
        proc.join()

If you uncomment the sleeps, the peak number of connections is still 4320, but there is no more SYNCOOKIE warning.

Comment by Tudor Aursulesei [ 21/Jul/17 ]

strace shows me the limit of 128 is still there, which is too low.

[pid 18057] listen(7, 128)              = 0

Comment by Tudor Aursulesei [ 21/Jul/17 ]

We have been having this problem on 3.2. I have upgraded to 3.4 today. We have been having this problem on the shard servers, not on the mongos, but maybe if we tune down the mongos instances, they will be easier on the mongod instances. Our application makes a lot of connections via mongos instances, but it also connects individually to the each shards for some read-only queries. When restarting a mongod instance, the former secondary which gets promoted to a primary also generates this error. The sudden burst of connections seems to be the problem, not the number.

Comment by Ramon Fernandez Marina [ 21/Jul/17 ]

What version of MongoDB are you using? Depending on which version you are, you may be able to use the knobs in SERVER-25027 to tweak connection pooling in mongos to better suite your needs / node capacity.

Generated at Thu Feb 08 04:23:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.