|
Attachments:
|
mongodb_dir.tar.gz
|
| Operating System: |
ALL
|
| Steps To Reproduce: |
Online,I call "rs.remove()" repeatedly,mongos will print huge amounts of log like this(1GB+ logs everyminute
):
2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
|
2020-05-25T18:30:42.175+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176106 finished with response: HostUnreachable: Connection refused
|
2020-05-25T18:30:42.175+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176106, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
|
2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection
|
2020-05-25T18:30:42.175+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
|
2020-05-25T18:30:42.176+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176107 finished with response: HostUnreachable: Connection refused
|
2020-05-25T18:30:42.176+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176107, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
|
2020-05-25T18:30:42.176+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
2020-05-25T18:30:42.176+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
2020-05-25T18:30:42.177+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
|
2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176108 finished with response: HostUnreachable: Connection refused
|
2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176108, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
|
2020-05-25T18:30:42.177+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection
|
2020-05-25T18:30:42.177+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
|
2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176109 finished with response: HostUnreachable: Connection refused
|
2020-05-25T18:30:42.177+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176109, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
|
2020-05-25T18:30:42.178+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
2020-05-25T18:30:42.178+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
|
2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176110 finished with response: HostUnreachable: Connection refused
|
2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176110, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
|
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection
|
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
|
2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176111 finished with response: HostUnreachable: Connection refused
|
2020-05-25T18:30:42.179+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176111, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
|
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
2020-05-25T18:30:42.179+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
|
2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176112 finished with response: HostUnreachable: Connection refused
|
2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176112, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
|
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Dropping all pooled connections to 127.0.0.1:15452 due to failed operation on a connection
|
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Failed to connect to 127.0.0.1:15452 - HostUnreachable: Connection refused
|
2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176113 finished with response: HostUnreachable: Connection refused
|
2020-05-25T18:30:42.180+0800 D ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Request 176113, db:admin, request.cmdobj: \{ isMaster: 1 },request.metadata:{}
|
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
2020-05-25T18:30:42.180+0800 I ASIO [NetworkInterfaceASIO-TaskExecutorPool-0-0] Connecting to 127.0.0.1:15452
|
I reproduce the problem local.
1.mongos set db.getMongo().setReadPref('secondary')
2.make breakpoint at NetworkInterfaceASIO::_asyncRunCommand ‘s
asyncSendMessage(cmd.conn().stream(), &cmd.toSend(), std::move(sendMessageCallback));
3.exec find operate,like:db.test.find({userId:3000})
4.when hit the breakpoint(step 2),affirm op is "IsMaster" and shutdown the secondary which the find operate target。
5.Then huge amounts of log will print and will not stop
6. When shut down the target physical machine ,because 5(huge amounts try to connect)from "connect refused" to "physical machine offline", NetworkInterfaceASIO-TaskExecutorPool will very slow to process other op.
|
| Participants: |
|
|
Hi bingfeng198878@163.com
We haven’t heard back from you for some time, so I’m going to close this ticket. If this is still an issue for you, please try reproducing this problem on the WiredTiger storage engine and we will reopen the ticket.
Best,
Edwin
|
|
Hi bingfeng198878@163.com,
Thank you for your response. We recommend upgrading your storage engine to WiredTiger which has been the default storage engine since MongoDB v3.2. From here, you will be able to upgrade your MongoDB version to 4.0 or later.
Once you're able to upgrade to MongoDB 4.0, we'd be interested in knowing if this problem persists!
Best,
Edwin
|
|
Hi Edwin Zhou,
I can't update to MongoDB 4.0 .Because I use mongorocks ,but it can't update(mongorocks version 3.6 is not supported).
|
|
Hi bingfeng198878@163.com,
We still need additional information to diagnose the problem. If this is still an issue for you, would you please try upgrading to MongoDB 4.0 or later and see if this problem persists?
Thanks,
Edwin
|
|
Hi bingfeng198878@163.com,
I appreciate your patience and I apologize for the delay in my response. It appears that you're successful in reproducing the problem by setting breakpoints in the server code. However, we typically diagnose problems in production, and without a successful reproduction without modifying the code, we're unable to identify the steps that lead up to this behavior.
Since MongoDB 3.6 is set to EOL next month per our support policy, can you try upgrading to MongoDB 4.0 or later and see if this issue persists?
Best,
Edwin
|
|
Hi Edwin Zhou,
I'm sorry.I don't know . Because I can't build 4.4 in my Macos.
|
|
Hi bingfeng198878@163.com,
Thanks for your clarification. Are you able to reliably reproduce this issue in 4.4 as well?
Best,
Edwin
|
|
Hi Edwin Zhou,
first question:
when create connection(mongos to mongod) will call function "
std::unique_ptr<NetworkInterfaceASIO::AsyncOp> ASIOConnection::makeAsyncOp(ASIOConnection* conn)",here will create a "RemoteCommandRequest" with "isMaster" command.
second question:
~~
auto target = [&]
{
return std::max(
_parent->_options.minConnections,
std::min(_requests.size() + _checkedOutPool.size(), _parent->_options.maxConnections));
}
;
when set mongos‘s startup parameters "ShardingTaskExecutorPoolMinSize" 2,"target()" will be more than 1. I log the value.
|
|
Hi bingfeng198878@163.com,
Thanks for your added details. I have a few things I'd like to clarify:
I'm not sure what you mean by
when mongos want to connect a mongod for the first time with "IsMaster" command ,shut down the mongod
The mongos will return its own role, but will not connect to another mongod instance. Is there a different behavior you're trying to mimic by doing this?
And in "ConnectionPool::SpecificPool::spawnConnections" function, "target()" is more than 1.
How are you making sure that "target()" is more than 1?
Thanks,
Edwin
|
|
Hi Edwin Zhou,
I think the core step is : when mongos want to connect a mongod for the first time with "IsMaster" command ,shut down the mongod. And in "ConnectionPool::SpecificPool::spawnConnections" function, "target()" is more than 1.
If the above conditions are met, this problem will occur regardless of whether the mongod is the master node or the slave node.
|
|
Hi bingfeng198878@163.com,
I'm currently unable to reproduce your issue but I'll provide my reproduction steps below so you evaluate what can be missing.
Using MongoDB 3.6.3, I launched a PSS sharded replica set with 1 mongos.
I launched the mongos with the following parameters:
--maxConns 1000 --setParameter ShardingTaskExecutorPoolMaxSize=1 --setParameter taskExecutorPoolSize=1 --setParameter ShardingTaskExecutorPoolMinSize=2 --localThreshold 3 --fork
|
I then insert 10mil documents
db = db.getSiblingDB("test")
|
coll = db.getCollection("test_col")
|
coll.drop();
|
bulk = coll.initializeUnorderedBulkOp();
|
for (i = 0; i < 10000000; i++) {
|
bulk.insert({ _id: i, localField: i });
|
}
|
bulk.execute();
|
and perform a find query with readPref set to "secondary"
db = db.getSiblingDB("test")
|
coll = db.getCollection("test_col")
|
db.getMongo().setReadPref("secondary");
|
coll.find({ localField: 1 })
|
and shutdown the secondary thats performing the op
mongo "localhost:27019/admin" --eval "db.shutdownServer()"
|
Looking into mongos.log, I don't see many log lines printed out. On the other hand, the primary mongod.log is printing out these 11 lines every 2 seconds:
2021-02-02T13:14:19.088-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27020
|
2021-02-02T13:14:19.090-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27020 - HostUnreachable: Connection refused
|
2021-02-02T13:14:19.090-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27020 due to failed operation on a connection
|
2021-02-02T13:14:19.090-0500 I REPL_HB [replexec-1] Error in heartbeat (requestId: 982) to localhost:27020, response status: HostUnreachable: Connection refused
|
2021-02-02T13:14:19.090-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27020
|
2021-02-02T13:14:19.091-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27020 - HostUnreachable: Connection refused
|
2021-02-02T13:14:19.091-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27020 due to failed operation on a connection
|
2021-02-02T13:14:19.091-0500 I REPL_HB [replexec-3] Error in heartbeat (requestId: 984) to localhost:27020, response status: HostUnreachable: Connection refused
|
2021-02-02T13:14:19.091-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Connecting to localhost:27020
|
2021-02-02T13:14:19.092-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:27020 - HostUnreachable: Connection refused
|
2021-02-02T13:14:19.092-0500 I ASIO [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:27020 due to failed operation on a connection
|
2021-02-02T13:14:19.092-0500 I REPL_HB [replexec-3] Error in heartbeat (requestId: 986) to localhost:27020, response status: HostUnreachable: Connection refused
|
However, that doesn't cover the issue you're seeing in the mongos.log file. Is there anything missing in my reproduction steps that could help narrow down whats causing the issue you're observing?
Thanks,
Edwin
|
|
Hi Edwin Zhou,
in production:
1.I use some query request to secondary
2.shutdown secondary
sometimes,mongos will has this problem,but it‘s not inevitable.
So, I use debug and add breakpoint to reproduction the problem.
|
|
Hi bingfeng198878@163.com,
Unfortunately, without a $dbpath covering the timeline of the incident, it is difficult to diagnose why excessive log lines are being printed in production. Are you able to provide a series of steps or events leading up to the incident in production?
Best,
Edwin
|
|
Hi Edwin Zhou,
It's a problem in production, I reproduce it in debug. But I can't provide “dbpath” in production.In production , we have a flood of requests to "secondary", when "secondary" shut down, some mongos will have the problem and other mongos won't.
|
|
Hi bingfeng198878@163.com,
We typically diagnose issues that are in production. Is this something you're able to reproduce this issue without setting any breakpoints? Can you reproduce this issue in production?
Thanks,
Edwin
|
|
Hi Edwin Zhou,
I don't understand. What do you mean.
I appear the issue on MacOS with Xcode. Step 2 ,I make a breakpoint at "NetworkInterfaceASIO::_asyncRunCommand" funciton ,"asyncSendMessage(cmd.conn().stream(), &cmd.toSend(), std::move(sendMessageCallback));".
For example:
I query data like: db.uuid.find({userId:1}) , I know "userId:1" should to shard1 and port 16451(a secondary), when the breakpoint was hit,and "op->request().target.port()" is equal 16451,kill the process with listen port 16451.
|
|
Hi bingfeng198878@163.com,
I requested it as it helps to pinpoint where and when the incident occurs, and gives us context surrounding the incident you're observing. However, I'm unable to find the incident that you're referring to in your data.
Could you clarify step 2 of your reproduction step?
2.make breakpoint at NetworkInterfaceASIO::_asyncRunCommand ‘s
|
asyncSendMessage(cmd.conn().stream(), &cmd.toSend(), std::move(sendMessageCallback));
|
I'm not sure here where you are setting your breakpoint. Could you go into more detail?
Thanks,
Edwin
|
|
Hi Edwin Zhou,
I have upload dbpath, but it's not important,any db can appear.
|
|
Hi bingfeng198878@163.com,
Would you please archive (tar or zip) the $dbpath/diagnostic.data directory (the contents are described here) and attach it to this ticket?
Kind regards,
Edwin
|
|
supplement: mongos run command ./mongos --configdb rs_config/127.0.0.1:15350,127.0.0.1:15351,127.0.0.1:15352 --bind_ip 0.0.0.0 --port 2910 --logpath /Users/xinli/mongodb_dir/mongos/logs --maxConns 1000 --setParameter ShardingTaskExecutorPoolMaxSize=1 --setParameter taskExecutorPoolSize=1 --setParameter ShardingTaskExecutorPoolMinSize=2 --localThreshold 3
if "ShardingTaskExecutorPoolMinSize" set 1,the problem won't appear.
|
Generated at Thu Feb 08 05:32:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.