[SERVER-53898] High frequency try to connect mongod which shutdown Created: 20/Jan/21  Updated: 12/May/21  Resolved: 12/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6 Required, 3.4 Required
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Li xin Assignee: Edwin Zhou
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongodb_dir.tar.gz    
Operating System: ALL
Steps To Reproduce:

Online,I call "rs.remove()"  repeatedly,mongos will print huge amounts of log like this(1GB+ logs everyminute

):

2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
2020-05-25T18:30:42.175+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176106 finished with response: HostUnreachable: Connection refused
2020-05-25T18:30:42.175+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176106, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection
2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
2020-05-25T18:30:42.176+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176107 finished with response: HostUnreachable: Connection refused
2020-05-25T18:30:42.176+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176107, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
2020-05-25T18:30:42.176+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
2020-05-25T18:30:42.176+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
2020-05-25T18:30:42.177+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176108 finished with response: HostUnreachable: Connection refused
2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176108, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
2020-05-25T18:30:42.177+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection
2020-05-25T18:30:42.177+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176109 finished with response: HostUnreachable: Connection refused
2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176109, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
2020-05-25T18:30:42.178+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
2020-05-25T18:30:42.178+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176110 finished with response: HostUnreachable: Connection refused
2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176110, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176111 finished with response: HostUnreachable: Connection refused
2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176111, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176112 finished with response: HostUnreachable: Connection refused
2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176112, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176113 finished with response: HostUnreachable: Connection refused
2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176113, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452

 

I reproduce the problem local.

1.mongos set db.getMongo().setReadPref('secondary')

2.make breakpoint at   NetworkInterfaceASIO::_asyncRunCommand  ‘s
asyncSendMessage(cmd.conn().stream(), &cmd.toSend(), std::move(sendMessageCallback));
3.exec find operate,like:db.test.find({userId:3000})

4.when hit the  breakpoint(step 2),affirm  op is "IsMaster" and shutdown the secondary which the find operate target。

5.Then  huge amounts of log will print and will not stop 

6. When shut down the target physical machine ,because 5(huge amounts try to connect)from "connect refused" to "physical machine offline", NetworkInterfaceASIO-TaskExecutorPool will very slow to process other op.

Participants:

 Comments   
Comment by Edwin Zhou [ 12/May/21 ]

Hi bingfeng198878@163.com

We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please try reproducing this problem on the WiredTiger storage engine and we will reopen the ticket.

Best,
Edwin

Comment by Edwin Zhou [ 26/Apr/21 ]

Hi bingfeng198878@163.com,

Thank you for your response. We recommend upgrading your storage engine to WiredTiger which has been the default storage engine since MongoDB v3.2. From here, you will be able to upgrade your MongoDB version to 4.0 or later.

Once you're able to upgrade to MongoDB 4.0, we'd be interested in knowing if this problem persists!

Best,
Edwin

Comment by Li xin [ 26/Apr/21 ]

Hi Edwin Zhou,

I can't update to MongoDB 4.0 .Because I use mongorocks ,but it can't update(mongorocks version 3.6 is not supported). 

Comment by Edwin Zhou [ 21/Apr/21 ]

Hi bingfeng198878@163.com,

We still need additional information to diagnose the problem. If this is still an issue for you, would you please try upgrading to MongoDB 4.0 or later and see if this problem persists?

Thanks,
Edwin

Comment by Edwin Zhou [ 23/Mar/21 ]

Hi bingfeng198878@163.com,

I appreciate your patience and I apologize for the delay in my response. It appears that you're successful in reproducing the problem by setting breakpoints in the server code. However, we typically diagnose problems in production, and without a successful reproduction without modifying the code, we're unable to identify the steps that lead up to this behavior.

Since MongoDB 3.6 is set to EOL next month per our support policy, can you try upgrading to MongoDB 4.0 or later and see if this issue persists?

Best,
Edwin

Comment by Li xin [ 05/Feb/21 ]

Hi Edwin Zhou,

I'm sorry.I don't know . Because I can't build 4.4 in my Macos.

Comment by Edwin Zhou [ 04/Feb/21 ]

Hi bingfeng198878@163.com,

Thanks for your clarification. Are you able to reliably reproduce this issue in 4.4 as well?

Best,
Edwin

Comment by Li xin [ 04/Feb/21 ]

Hi Edwin Zhou,

first question:

when create connection(mongos to mongod) will call function "
std::unique_ptr<NetworkInterfaceASIO::AsyncOp> ASIOConnection::makeAsyncOp(ASIOConnection* conn)",here will create a "RemoteCommandRequest"  with "isMaster" command.

 
second question:
~~
auto target = [&]

{ return std::max( _parent->_options.minConnections, std::min(_requests.size() + _checkedOutPool.size(), _parent->_options.maxConnections)); }

;
 
when  set mongos‘s startup parameters "ShardingTaskExecutorPoolMinSize" 2,"target()" will be more than 1. I log the value.
 

Comment by Edwin Zhou [ 03/Feb/21 ]

Hi bingfeng198878@163.com,

Thanks for your added details. I have a few things I'd like to clarify:

I'm not sure what you mean by 

when mongos want to connect a mongod for the first time with "IsMaster" command ,shut down the mongod

The mongos will return its own role, but will not connect to another mongod instance. Is there a different behavior you're trying to mimic by doing this?

And in "ConnectionPool::SpecificPool::spawnConnections" function, "target()" is more than 1.

How are you making sure that "target()" is more than 1?

Thanks,
Edwin

Comment by Li xin [ 03/Feb/21 ]

Hi Edwin Zhou,

I think the core step is : when mongos want to connect a mongod for the first time with "IsMaster" command ,shut down the mongod. And in "ConnectionPool::SpecificPool::spawnConnections" function, "target()" is more than 1.

If the above conditions are met, this problem will occur regardless of whether the mongod is the master node or the slave node.

Comment by Edwin Zhou [ 02/Feb/21 ]

Hi bingfeng198878@163.com,

I'm currently unable to reproduce your issue but I'll provide my reproduction steps below so you evaluate what can be missing.

Using MongoDB 3.6.3, I launched a PSS sharded replica set with 1 mongos.
I launched the mongos with the following parameters:

--maxConns 1000 --setParameter ShardingTaskExecutorPoolMaxSize=1 --setParameter taskExecutorPoolSize=1 --setParameter ShardingTaskExecutorPoolMinSize=2 --localThreshold 3 --fork

I then insert 10mil documents

db = db.getSiblingDB("test")
coll = db.getCollection("test_col")
coll.drop();
bulk = coll.initializeUnorderedBulkOp();
for (i = 0; i < 10000000; i++) {
  bulk.insert({ _id: i, localField: i });
}
bulk.execute();

and perform a find query with readPref set to "secondary"

db = db.getSiblingDB("test")
coll = db.getCollection("test_col")
db.getMongo().setReadPref("secondary");
coll.find({ localField: 1 })

and shutdown the secondary thats performing the op

mongo "localhost:27019/admin" --eval "db.shutdownServer()"

Looking into mongos.log, I don't see many log lines printed out. On the other hand, the primary mongod.log is printing out these 11 lines every 2 seconds:

2021-02-02T13:14:19.088-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27020
2021-02-02T13:14:19.090-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27020 - HostUnreachable: Connection refused
2021-02-02T13:14:19.090-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27020 due to failed operation on a connection
2021-02-02T13:14:19.090-0500 I REPL_HB  [replexec-1] Error in heartbeat (requestId: 982) to localhost:27020, response status: HostUnreachable: Connection refused
2021-02-02T13:14:19.090-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27020
2021-02-02T13:14:19.091-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27020 - HostUnreachable: Connection refused
2021-02-02T13:14:19.091-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27020 due to failed operation on a connection
2021-02-02T13:14:19.091-0500 I REPL_HB  [replexec-3] Error in heartbeat (requestId: 984) to localhost:27020, response status: HostUnreachable: Connection refused
2021-02-02T13:14:19.091-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27020
2021-02-02T13:14:19.092-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27020 - HostUnreachable: Connection refused
2021-02-02T13:14:19.092-0500 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27020 due to failed operation on a connection
2021-02-02T13:14:19.092-0500 I REPL_HB  [replexec-3] Error in heartbeat (requestId: 986) to localhost:27020, response status: HostUnreachable: Connection refused

However, that doesn't cover the issue you're seeing in the mongos.log file. Is there anything missing in my reproduction steps that could help narrow down whats causing the issue you're observing?

Thanks,
Edwin

Comment by Li xin [ 01/Feb/21 ]

Hi Edwin Zhou,

in production:

1.I use some query request to secondary

2.shutdown secondary

sometimes,mongos will  has this problem,but it‘s not inevitable.

So, I use debug and add breakpoint to  reproduction the problem.

 

Comment by Edwin Zhou [ 29/Jan/21 ]

Hi bingfeng198878@163.com,

Unfortunately, without a $dbpath covering the timeline of the incident, it is difficult to diagnose why excessive log lines are being printed in production. Are you able to provide a series of steps or events leading up to the incident in production?

Best,
Edwin

Comment by Li xin [ 29/Jan/21 ]

Hi Edwin Zhou,

It's a problem in production, I reproduce it in debug. But I can't  provide “dbpath” in production.In production , we have a flood of requests to "secondary", when "secondary" shut down, some mongos will have the problem and other mongos won't.

Comment by Edwin Zhou [ 28/Jan/21 ]

Hi bingfeng198878@163.com,

We typically diagnose issues that are in production. Is this something you're able to reproduce this issue without setting any breakpoints? Can you reproduce this issue in production?

Thanks,
Edwin

Comment by Li xin [ 28/Jan/21 ]

Hi Edwin Zhou,

I don't understand. What do you mean.

I appear the issue on MacOS with Xcode. Step 2 ,I make a breakpoint at "NetworkInterfaceASIO::_asyncRunCommand" funciton ,"asyncSendMessage(cmd.conn().stream(), &cmd.toSend(), std::move(sendMessageCallback));". 

For example:

I query data like: db.uuid.find({userId:1}) , I know "userId:1" should to shard1 and port 16451(a secondary), when the breakpoint was hit,and "op->request().target.port()" is equal 16451,kill the process with listen port 16451.

Comment by Edwin Zhou [ 27/Jan/21 ]

Hi bingfeng198878@163.com,

I requested it as it helps to pinpoint where and when the incident occurs, and gives us context surrounding the incident you're observing. However, I'm unable to find the incident that you're referring to in your data.

Could you clarify step 2 of your reproduction step? 

2.make breakpoint at   NetworkInterfaceASIO::_asyncRunCommand  ‘s
asyncSendMessage(cmd.conn().stream(), &cmd.toSend(), std::move(sendMessageCallback)); 

I'm not sure here where you are setting your breakpoint. Could you go into more detail?

Thanks,
Edwin

Comment by Li xin [ 27/Jan/21 ]

Hi Edwin Zhou,

I have upload dbpath, but it's not important,any db can appear.

Comment by Edwin Zhou [ 26/Jan/21 ]

Hi bingfeng198878@163.com,

Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket?

Kind regards,
Edwin

Comment by Li xin [ 22/Jan/21 ]

supplement: mongos run command ./mongos --configdb rs_config/127.0.0.1:15350,127.0.0.1:15351,127.0.0.1:15352 --bind_ip 0.0.0.0 --port 2910 --logpath /Users/xinli/mongodb_dir/mongos/logs --maxConns 1000 --setParameter ShardingTaskExecutorPoolMaxSize=1 --setParameter taskExecutorPoolSize=1 --setParameter ShardingTaskExecutorPoolMinSize=2 --localThreshold 3 

 

if "ShardingTaskExecutorPoolMinSize" set 1,the problem won't appear.

Generated at Thu Feb 08 05:32:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.