[SERVER-48072] serverStatus very slow only at end Created: 09/May/20  Updated: 27/Oct/23  Resolved: 14/May/20

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.3
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: phoenix Liu Assignee: Dmitry Agranat
Resolution: Community Answered Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
Backwards Compatibility: Fully Compatible
Participants:

 Description   

hi, I recently encountered a problem about serverStatus.

in mongod's log, i can find a log like below with a lot of other slow logs(normal query&update, etc)

 

 

Fri May 1 14:01:23.248 I COMMAND [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after logicalSessionRecordCache: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after shardingStatistics: 0, after storageEngine: 0, after tcmalloc: 0, after transactions: 0, after wiredTiger: 0, at end: 8256 }

 

in source code, after all sections handled, before print this 'very slow' log,it handles with metrics and warnings. i don't understand why it took so long.

 

more details:

  1.  it occurs 0-2 times a day, without any regularity in time;
  2.  system load has not changed significantly; and the load is not heavy considering the resources;
  3.  various resources(CPU, Mem, Disk, Network) do not seem to be the bottleneck;
  4.  it happened both on primary & secondary nodes;

 

my question:

1.  what does is mean when 'serverStatus' slow only at end? is it reasonable?

2.  under what scenarios may cause this problem?

3. how can I locate this problem, is there any recommendation?

 

Thank you!



 Comments   
Comment by phoenix Liu [ 14/May/20 ]

Hi Dima,

Thanks for reply. I will try your suggestion and give a feedback.

But still I'm curious how did you come to this conclusion that it's about NUMA conf in machine. Could you please explain it with more details?

Here is our machine's NUMA config (before any changes):

$numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59
node 0 size: 391842 MB
node 0 free: 232789 MB
node 1 cpus: 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79
node 1 size: 393216 MB
node 1 free: 236437 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10

 //there is no numa_miss
$numastat
                node0        node1 
numa_hit      108945205393 107846766988 
numa_miss                0            0 
numa_foreign             0            0 
interleave_hit 79007171479  79055797917 
local_node     61169948407  76556848237 
other_node     47775256986  31289918751

// about zone_reclaim_mode, it's the suggested correct value already
$cat /proc/sys/vm/zone_reclaim_mode 
0

And also, i checked all mongod/mongos service, there is no 'numad' daemon at all.

We didn't start mongod with 'numactl --interleave=all'. But in this case, memory does not seem to be bottleneck. How can NUMA config affect it?

 

Thanks,

phoenix

 

Comment by Dmitry Agranat [ 14/May/20 ]

Hi phoenixxliu@tencent.com, thank you for providing the requested information.

After analyzing the data, this appears to be an environment problem we've seen in the past. To resolve this issue, please do the following:

  • Configure NUMA according to our Production Notes
  • If the above does not solve the issue, ensure no numad daemon is running on a server where MongoDB is running

As this issue is not related to MongoDB, I will go ahead and close this issue. Again, thank you for your report.

Regards,
Dima

Comment by phoenix Liu [ 11/May/20 ]

Hi Dima,

Sorry that i  missed some info in my reply.

It will only cause very short system freezes(only a few seconds) and application will be affected and get 'timeout' error.

 

from mongostat result, we can see these:

  1. for primary, the data is loss for 5s
  2. insert query update delete getmore command dirty used flushes vsize res qrw arw net_in net_out conn  repl time
        21   396    370      2        988        2178|0  2.7% 80.0%       0 21.9G 15.0G 0|0  1|0  2.09m   6.10m  244   PRI May 11 00:43:38.962
        10    98     95      1        253         578|0  0.6% 80.0%       0 21.9G 15.1G 0|0  2|0   544k   1.58m  244   PRI May 11 00:43:43.036
        74  1899   1204      8       1069        8874|0  0.7% 80.1%       0 22.2G 15.1G 0|0  3|0  4.33m   15.6m  518   PRI May 11 00:43:43.383
        84  1074    983     *0       1517        6438|0  0.8% 80.0%       0 22.3G 15.1G 0|0  3|0  4.33m   14.4m  613   PRI May 11 00:43:43.962
        61   443    468      2       1056        2738|0  0.9% 80.1%       0 22.3G 15.1G 0|0  2|0  2.45m   7.17m  613   PRI May 11 00:43:44.962

  1. for secondary, the data is zero which means the rs sync encountered some problems

       *33    37    *331     *0       0    44|0  1.2% 80.0%       0  141G 52.1G  0|0 1|0  4.10k   99.7k  106  SEC May 11 00:43:38.039
       *22    31    *297     *3       0    38|0  1.3% 80.0%       0  141G 52.1G  0|0 1|0  3.66k   93.8k  106  SEC May 11 00:43:39.053
       *38    32    *287     *1       0    68|0  1.3% 80.0%       0  141G 52.1G  0|0 1|0  6.87k    118k  106  SEC May 11 00:43:40.036
        *0    37      *0     *0       0    46|0  1.3% 80.0%       0  141G 52.1G  0|0 1|0  4.74k    102k  106  SEC May 11 00:43:41.036
        *0    31      *0     *0       0    39|0  1.3% 80.0%       0  141G 52.1G  0|0 1|0  4.00k   96.0k  106  SEC May 11 00:43:42.037
       *15    32    *150     *1       0    40|0  1.4% 80.0%       0  141G 52.1G  0|0 1|0  4.00k   96.0k  106  SEC May 11 00:43:43.037
       *46    36    *639     *0       0    43|0  1.4% 80.0%       0  141G 52.1G  0|0 1|0  4.07k   99.0k  106  SEC May 11 00:43:44.042
    

BTW, The time zone is GMT+8.  And the newest one happened in Mon May 11 00:43:42(GMT +8), you may check this first.

Thanks,

Phoenix

Comment by Dmitry Agranat [ 11/May/20 ]

Thanks phoenixxliu@tencent.com, we will take a look at the uploaded data.

Could you please clarify if this is causing any observable performance problems?
The times that you've provided in your last comment, what is the time zone?

Thanks,
Dima

Comment by phoenix Liu [ 11/May/20 ]

Hi Dima,

Thanks for the reply.

I have uploaded the relevant files as required, Considering that /diagnostic.data  on May 1 is no longer available any more, you may focus on the following points in time: Sun May 10 10:25:34/Sun May 10 06:50:25

Thanks,

Phoenix

Comment by Dmitry Agranat [ 10/May/20 ]

Hi phoenixxliu@tencent.com,

Thank you for the report.

Could you please clarify is this is causing any observable performance problems and provide the exact time & timestamp when this issue occurs?

Would you also archive (tar or zip) the mongod.log files and the $dbpath/diagnostic.data directory (the contents are described here) and upload them to this support uploader location?

Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

Thanks,
Dima

Generated at Thu Feb 08 05:16:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.