[SERVER-28712] serverStatus collect extraInfo may be slow? Created: 10/Apr/17  Updated: 27/Oct/23  Resolved: 19/May/17

Status: Closed
Project: Core Server
Component/s: Performance
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: deyukong Assignee: Mark Agarunov
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diagnostic.data.7z     Zip Archive mongod.log.2017-04-10T02-19-56.zip    
Issue Links:
Related
related to SERVER-28693 Use getrusage to collect major page f... Closed
Operating System: ALL
Participants:

 Description   

In mongod's slow log. I got so many lines like below:
Mon Apr 10 11:05:01.426 I COMMAND [ftdc] serverStatus was very slow:

{ after basic: 0, after asserts: 0, after connections: 0, after extra_info: 1410, after globalLock: 1410, after locks: 1410, after network: 1410, after opcounters: 1410, after opcountersRepl: 1410, after repl: 1410, after storageEngine: 1410, after tcmalloc: 1410, after wiredTiger: 1410, at end: 1420 }

there are so many lines like above.
from the ftdc flag, I know it is generated from the diagonal thread, but this is not important.

What is important is that time all grows after getting extra_info. Which I found is to run mallinfo() on linux after digging into the code.

I doubt that mallinfo may be slow in some situations because all the slow logs I got is caused by the time-cost after extra_info.



 Comments   
Comment by deyukong [ 07/Dec/17 ]

We have found the problem.
a high-loaded mongod with thounds of connections(or threads, both the same), huge amount of memory (64GB is enough).
If you keep doing 'cat /proc/mongod_pid/smaps', you will see that mongod getting slow logs.
On our machines, there is a monitor-agent that reads every process' memory stat by reading /proc/pid/smaps, every time agent does this, mongod hangs.
Reasons more intrinsical is inside linux kernal.

If someone meets this situation, you should strace the monitor-agent to see if mongod's smaps is being read.

Comment by deyukong [ 20/May/17 ]

Perhaps more consideration should be taken.
There are two doubts that are still unclear
1) why are all the serverStatus slowlog output all the same, time grows from extra_info which is collecting mallinfo in tcmalloc ?
2) Maunally run db.runCommand(

{'serverStatus':1}

) can trigger slow serverStatus, followed by a huge amount of other slow queries.
I think you should test the behavior of mongod with tcmalloc with high load, (eg, 5000 active threads and 200GB or more memory). in this situation, you will see tcmalloc's costing most of the cputime by perf top I strongly doubt that tcmalloc's mallinfo may block memory allocation and release, That may be the reason each serverStatus will be followed by large amounts of other slowquerys, some of which may be very simple key-value operation and should never be slow.

Comment by Mark Agarunov [ 12/Apr/17 ]

Hello wolf_kdy,

Thank you for providing the files. Looking over the diagnostic data and logs, it appears that as there is more load placed on the server, as a result serverStatus takes longer to return and times out. I don't see anything that would indicate a bug, after the load on the server is increased, the performance seems to be relatively consistent. If you are not seeing a performance degradation, this appears to be working as expected.

Thanks,
Mark

Comment by deyukong [ 12/Apr/17 ]

@Mark Agarunov
I've sent you an email which contains what you need

Comment by Mark Agarunov [ 11/Apr/17 ]

Hello wolf_kdy,

Thank you for the report. To better investigate the behavior you've described, I'd like to request some additional information. Could you please provide the following:

  • The complete logs from mongod when this issue is present,
  • An archive (tar or zip) of the $dbpath/diagnostic.data directory

Thanks,
Mark

Generated at Thu Feb 08 04:18:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.