[SERVER-7850] Config server floods log with: CMD fsync: sync:1 lock:0 Created: 05/Dec/12  Updated: 15/Feb/13  Resolved: 09/Jan/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: James Blackburn Assignee: Andre de Frere
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-5442 use journal commit instead of fsync c... Closed
Related
Operating System: Linux
Participants:

 Description   

We have a sharded cluster of 5 mongod's. They are started like this:

dbpath = /mnt/mongo/rs0
shardsvr = true
maxConns = 20000
keyFile = keyfile
 
# Logging
logpath = logs/mongod-cn60.log
 
# Replica set
replSet = rs0
oplogSize = 1024

These are connected to by 96 nodes running mongos.

There are three config servers are run like:

dbpath = /mnt/mongo/cs0
configsvr = true
keyFile = keyfile
 
# Logging
logpath = configserver-cn60.log

We see the config server log file flooded with:

Wed Dec  5 18:51:44 [conn583] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn584] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn813] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn756] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn289] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn828] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn561] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn244] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn823] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn168] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn651] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn1286] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn630] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn773] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn230] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn583] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn756] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn813] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn584] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn1320] CMD fsync: sync:1 lock:0
Wed Dec  5 18:51:44 [conn126] CMD fsync: sync:1 lock:0

The config server has continuous 30% usage with no load on the cluster.

It could be that each mongos is trying to balance, and this causes the config server to be permanently loaded. Disabling balancing reduces the load considerably to 5-10% CPU (cluster still not loaded). However we still see this fsync command every 30s or so from every connection.

Is this expected? Can we enable balancing on only one mongos?



 Comments   
Comment by James Blackburn [ 07/Jan/13 ]

Ok, I've voted on SERVER-5442. I guess I'll have to patch mongo locally here to suppress this particular log line.

Comment by Andre de Frere [ 10/Dec/12 ]

Hi James,

There is currently a ticket to reduce the amount of logging that the config servers do around the fsync command. You can vote and comment on that issue at SERVER-5442.

You can reduce the impact of this by rotating the logs periodically, or by using the --syslog command line option to handle the log files in the same manner as the systems syslog.

High CPU on the config server should not translate into reduced performance for accessing data via the mongos. The mongos maintains a cached copy of the config database which it will use in lieu of directly accessing the config database. The CPU load you are seeing is likely brought about by connection handling for the number of mongos nodes, rather than load on the cluster. Balancing operations do not take place on the config servers themselves, so increased load on the config server should not impact the performance on balancing and migration. The config data is generally small, and therefore rapid fsync operations should not have undue effects on other users on that node - that being said, isolating the config servers to their own hosts would eliminate the possibility of both effects on other users, and the potential of config servers being brought down due external circumstances.

Comment by James Blackburn [ 07/Dec/12 ]

Hi Andre,

Thanks for the reply.

From my POV the problems this causes are:

1) Excessive log files: 300M per config server for 4 days of uptime.
We're seeing the same '[conn302] CMD fsync: sync:1 lock:0' logged to all 3 config servers
Giving ~1GB of log files per 4 days => 7GB per month. This is excessive, isn't it?
2) Background load on the configservers when the cluster is otherwise idle.
We haven't yet put much data in this cluster. We do want balancing enabled. It seems that the background load is very high when using a mongos on each node.

Other worries:

  • Will it scale?
    • Does high cpu usage on the config servers translate to reduced performance for accessing data via the mongos?
    • Will the cpu load on the config servers increase as the cluster is used, and actually needs balancing, such that data access is impacted?
  • Does the constant fsync() happening on the config server impact filesystem performance for other users on the same box?
Comment by Andre de Frere [ 07/Dec/12 ]

Hi James,

Seeing the config server logs contain mostly fsync commands is expected behaviour. This is due to the way that updates to the config database are persisted.

The config servers will incur some load, even with apparently little else going on in the cluster. This is because every mongos will need to connect to the server during the normal course of operations in order to update their cached config. This load will be increased by the number of mongos nodes in your environment.

If the balancing is switched off in the environment, this means that the chance of get a staleConfigDetected (and therefore requiring new config) will be reduced to approximately zero. This means that no new config will be required, meaning that the load on the config server will be greatly reduced.

When balancing starts, the mongos responsible will take out a distributed lock, preventing any of the other mongos nodes from starting balancing operations. You do not need to enable balancing on only one node, this is already accounted for.

The additional load that you will see on one config server (over the other two) will be due to the way the mongos processes connect to the config servers. They will use the first config server in the -configdb string they are given when they are started. Given that this string needs to be exactly the same across all the mongos processes, this means that they will all connect to the first node in the -configdb string first.

Could you let us know what information you were looking for in the config server logs?

Comment by James Blackburn [ 06/Dec/12 ]

Can I provide any more information to aid in debugging this?

Generated at Thu Feb 08 03:15:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.