[SERVER-42062] tcmalloc is locking mongo Created: 03/Jul/19  Updated: 16/Nov/21  Resolved: 29/Mar/20

Status: Closed
Project: Core Server
Component/s: Performance, Stability
Affects Version/s: 3.4.16
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Sébastien Puyet Assignee: Dmitry Agranat
Resolution: Incomplete Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File 10min.png     PNG File 1510.png     PNG File Screen Shot 2019-07-10 at 1.34.32 PM.png     PNG File Screenshot from 2019-07-03 16-12-36.png     PNG File Screenshot from 2019-07-16 17-48-53.png     PNG File Screenshot from 2019-07-16 17-49-20.png    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Participants:

 Description   

Hello,

We're having some trouble on our current production servers with a lot logs lines like this one:

2019-07-03T06:40:05.648+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 10, after locks: 10, after network: 10, after opLatencies: 10, after opcounters: 10, after opcountersRepl: 10, after repl: 10, after security: 10, after storageEngine: 10, after tcmalloc: 1645, after wiredTiger: 1655, at end: 1655 }
2019-07-03T09:10:05.637+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1633, after wiredTiger: 1633, at end: 1643 }
2019-07-03T10:50:05.676+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 0, after wiredTiger: 0, at end: 1674 }
2019-07-03T13:00:05.708+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1704, after wiredTiger: 1704, at end: 1714 }
2019-07-03T14:40:10.106+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1098, after wiredTiger: 1098, at end: 1108 }
 
2019-07-03T06:40:05.648+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 10, after locks: 10, after network: 10, after opLatencies: 10, after opcounters: 10, after opcountersRepl: 10, after repl: 10, after security: 10, after storageEngine: 10, after tcmalloc: 1645, after wiredTiger: 1655, at end: 1655 }
2019-07-03T09:10:05.637+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1633, after wiredTiger: 1633, at end: 1643 }
2019-07-03T10:50:05.676+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 0, after wiredTiger: 0, at end: 1674 }
2019-07-03T13:00:05.708+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1704, after wiredTiger: 1704, at end: 1714 }
2019-07-03T14:40:10.106+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1098, after wiredTiger: 1098, at end: 1108 }

A lot of time seems to be spent in "tcmalloc" FWICS.

We found a lot of tickets with the similar issue:
SERVER-34027
SERVER-31417
SERVER-28712
and this detailed Circle CI postmortem: https://discuss.circleci.com/t/postmortem-march-26-april-10-workflow-delay-incidents/30060

After reading this we tried to have more details about the locks with the serverStatus command, before, during and after the lock:

dimelo_shard:PRIMARY> print(new Date); db.serverStatus().tcmalloc // before
Tue Jul 02 2019 14:40:02 GMT+0200 (CEST)
{
	"generic" : {
		"current_allocated_bytes" : NumberLong("117552083592"),
		"heap_size" : NumberLong("150148108288")
	},
	"tcmalloc" : {
		"pageheap_free_bytes" : NumberLong("18349326336"),
		"pageheap_unmapped_bytes" : NumberLong("11275706368"),
		"max_total_thread_cache_bytes" : NumberLong(1073741824),
		"current_total_thread_cache_bytes" : 130347296,
		"total_free_bytes" : NumberLong("2970869232"),
		"central_cache_free_bytes" : NumberLong("2840585832"),
		"transfer_cache_free_bytes" : 58496,
		"thread_cache_free_bytes" : 130259936,
		"aggressive_memory_decommit" : 0,
		"pageheap_committed_bytes" : NumberLong("138872401920"),
		"pageheap_scavenge_count" : 189129607,
		"pageheap_commit_count" : 295850012,
		"pageheap_total_commit_bytes" : NumberLong("69973171851264"),
		"pageheap_decommit_count" : 189129607,
		"pageheap_total_decommit_bytes" : NumberLong("69834299449344"),
		"pageheap_reserve_count" : 25899,
		"pageheap_total_reserve_bytes" : NumberLong("150148108288"),
		"spinlock_total_delay_ns" : NumberLong("185564381857153"),
		"formattedString" : "------------------------------------------------\nMALLOC:   117585679288 (112138.4 MiB) Bytes in use by application\nMALLOC: +  18315825152 (17467.3 MiB) Bytes in page heap freelist\nMALLOC: +   2840582824 ( 2709.0 MiB) Bytes in central cache freelist\nMALLOC: +        58496 (    0.1 MiB) Bytes in transfer cache freelist\nMALLOC: +    130256160 (  124.2 MiB) Bytes in thread cache freelists\nMALLOC: +    638378240 (  608.8 MiB) Bytes in malloc metadata\nMALLOC:   ------------\nMALLOC: = 139510780160 (133047.8 MiB) Actual memory used (physical + swap)\nMALLOC: +  11275706368 (10753.4 MiB) Bytes released to OS (aka unmapped)\nMALLOC:   ------------\nMALLOC: = 150786486528 (143801.2 MiB) Virtual address space used\nMALLOC:\nMALLOC:        6546367              Spans in use\nMALLOC:           1903              Thread heaps in use\nMALLOC:           4096              Tcmalloc page size\n------------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes released to the OS take up virtual address space but no physical memory.\n"
	}
}
dimelo_shard:PRIMARY> print(new Date); db.serverStatus().tcmalloc //during
Tue Jul 02 2019 14:40:03 GMT+0200 (CEST)
{
	"generic" : {
		"current_allocated_bytes" : NumberLong("117626126808"),
		"heap_size" : NumberLong("150148108288")
	},
	"tcmalloc" : {
		"pageheap_free_bytes" : NumberLong("18269425664"),
		"pageheap_unmapped_bytes" : NumberLong("11280478208"),
		"max_total_thread_cache_bytes" : NumberLong(1073741824),
		"current_total_thread_cache_bytes" : 131302496,
		"total_free_bytes" : NumberLong("2967829528"),
		"central_cache_free_bytes" : NumberLong("2836261848"),
		"transfer_cache_free_bytes" : 193376,
		"thread_cache_free_bytes" : 131495112,
		"aggressive_memory_decommit" : 0,
		"pageheap_committed_bytes" : NumberLong("138869190656"),
		"pageheap_scavenge_count" : 189129663,
		"pageheap_commit_count" : 295850032,
		"pageheap_total_commit_bytes" : NumberLong("69973182099456"),
		"pageheap_decommit_count" : 189129663,
		"pageheap_total_decommit_bytes" : NumberLong("69834312380416"),
		"pageheap_reserve_count" : 25899,
		"pageheap_total_reserve_bytes" : NumberLong("150148108288"),
		"spinlock_total_delay_ns" : NumberLong("185588157938035"),
		"formattedString" : "------------------------------------------------\nMALLOC:   117631224992 (112181.9 MiB) Bytes in use by application\nMALLOC: +  18270486528 (17424.1 MiB) Bytes in page heap freelist\nMALLOC: +   2836202536 ( 2704.8 MiB) Bytes in central cache freelist\nMALLOC: +       187232 (    0.2 MiB) Bytes in transfer cache freelist\nMALLOC: +    131617752 (  125.5 MiB) Bytes in thread cache freelists\nMALLOC: +    638378240 (  608.8 MiB) Bytes in malloc metadata\nMALLOC:   ------------\nMALLOC: = 139508097280 (133045.3 MiB) Actual memory used (physical + swap)\nMALLOC: +  11278389248 (10755.9 MiB) Bytes released to OS (aka unmapped)\nMALLOC:   ------------\nMALLOC: = 150786486528 (143801.2 MiB) Virtual address space used\nMALLOC:\nMALLOC:        6548780              Spans in use\nMALLOC:           1909              Thread heaps in use\nMALLOC:           4096              Tcmalloc page size\n------------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes released to the OS take up virtual address space but no physical memory.\n"
	}
}
dimelo_shard:PRIMARY> print(new Date); db.serverStatus().tcmalloc // after
Tue Jul 02 2019 14:40:08 GMT+0200 (CEST)
{
	"generic" : {
		"current_allocated_bytes" : NumberLong("117641336656"),
		"heap_size" : NumberLong("150148108288")
	},
	"tcmalloc" : {
		"pageheap_free_bytes" : NumberLong("18237587456"),
		"pageheap_unmapped_bytes" : NumberLong("11311992832"),
		"max_total_thread_cache_bytes" : NumberLong(1073741824),
		"current_total_thread_cache_bytes" : 132614088,
		"total_free_bytes" : NumberLong("2957264528"),
		"central_cache_free_bytes" : NumberLong("2824540336"),
		"transfer_cache_free_bytes" : 109152,
		"thread_cache_free_bytes" : 132612808,
		"aggressive_memory_decommit" : 0,
		"pageheap_committed_bytes" : NumberLong("138836115456"),
		"pageheap_scavenge_count" : 189129809,
		"pageheap_commit_count" : 295850054,
		"pageheap_total_commit_bytes" : NumberLong("69973193326592"),
		"pageheap_decommit_count" : 189129809,
		"pageheap_total_decommit_bytes" : NumberLong("69834357211136"),
		"pageheap_reserve_count" : 25899,
		"pageheap_total_reserve_bytes" : NumberLong("150148108288"),
		"spinlock_total_delay_ns" : NumberLong("185588613663259"),
		"formattedString" : "------------------------------------------------\nMALLOC:   117641320648 (112191.5 MiB) Bytes in use by application\nMALLOC: +  18237534208 (17392.7 MiB) Bytes in page heap freelist\nMALLOC: +   2824532144 ( 2693.7 MiB) Bytes in central cache freelist\nMALLOC: +       109152 (    0.1 MiB) Bytes in transfer cache freelist\nMALLOC: +    132619304 (  126.5 MiB) Bytes in thread cache freelists\nMALLOC: +    638378240 (  608.8 MiB) Bytes in malloc metadata\nMALLOC:   ------------\nMALLOC: = 139474493696 (133013.2 MiB) Actual memory used (physical + swap)\nMALLOC: +  11311992832 (10788.0 MiB) Bytes released to OS (aka unmapped)\nMALLOC:   ------------\nMALLOC: = 150786486528 (143801.2 MiB) Virtual address space used\nMALLOC:\nMALLOC:        6545889              Spans in use\nMALLOC:           1951              Thread heaps in use\nMALLOC:           4096              Tcmalloc page size\n------------------------------------------------\nCall ReleaseFreeMemory() to release freelist memory to the OS (via madvise()).\nBytes released to the OS take up virtual address space but no physical memory.\n"
	}
}
dimelo_shard:PRIMARY> (after.tcmalloc.spinlock_total_delay_ns - during.tcmalloc.spinlock_total_delay_ns) / (1000*1000*1000)
0.455725224
dimelo_shard:PRIMARY> (during.tcmalloc.spinlock_total_delay_ns - before.tcmalloc.spinlock_total_delay_ns) / (1000*1000*1000)
23.776080882

The last command is showing how much time (~24 seconds) clients have been locked between the before and the after command.

MongoDB version: 3.4.16
I've a diagnostic.data archive to help you debug this if you want.



 Comments   
Comment by Dmitry Agranat [ 29/Mar/20 ]

Hi spuyet,

We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Dima

Comment by Kelly Lewis [ 19/Nov/19 ]

Hi spuyet, please let us know if you're still experiencing this issue. Thank you!

Comment by Dmitry Agranat [ 26/Sep/19 ]

Hi Sébastien,

Did you have a chance to review your application again for any clues what might be running repeatedly every 10 minutes?

Thank you,
Dima

Comment by Sébastien Puyet [ 16/Aug/19 ]

Ok, as I said we didn't find anything particular each 10 minutes running on our side. We're going to check again if we find something interesting.

Comment by Bruce Lucas (Inactive) [ 01/Aug/19 ]

Hi Sébastien,

In the data from 07-16 we can see the same 10-minute periodicity that Joseph identified previously (times in following chart are UTC):

The striking thing is that these incidents occur exactly 10 minutes apart, and within a few seconds of the 10 minute mark. Looking over the log lines quoted above we see that most of the slow serverStatus messages also follow this pattern. There is nothing in mongod that follows that pattern, so this probably originates from something external to mongod.

We haven't been able to identify any clear change in workload with this period. Have you been able to identify anything in your application that follows this pattern of doing something every 10 minutes on the 10 minute mark?

Can we also take a look at syslog to see if there's anything at the system level to explain this? The spikes in TCP retransmissions could indicate a system issue, although they might also just be a result of temporary network congestion due to a large number of delayed operations all completing at the same time (see following).

Within each incident we see a striking pattern, for example the following from the 15:10 incident

> egrep -o '....-..-..T..:..:....'  mongo-16-07-2019.log | sort | uniq -c
   1 2019-07-16T15:10:02.7
   1 2019-07-16T15:10:02.8
   3 2019-07-16T15:10:02.9
   1 2019-07-16T15:10:03.1
   1 2019-07-16T15:10:03.2
   1 2019-07-16T15:10:03.3
 276 2019-07-16T15:10:05.1
   1 2019-07-16T15:10:05.2
   2 2019-07-16T15:10:05.3
 367 2019-07-16T15:10:07.1
   2 2019-07-16T15:10:07.2
   2 2019-07-16T15:10:07.3
   4 2019-07-16T15:10:07.4
   1 2019-07-16T15:10:07.5
 331 2019-07-16T15:10:09.2
  13 2019-07-16T15:10:09.3
   2 2019-07-16T15:10:09.4
   5 2019-07-16T15:10:09.5
   3 2019-07-16T15:10:09.6
 377 2019-07-16T15:10:11.4
   6 2019-07-16T15:10:11.5
   7 2019-07-16T15:10:11.6
   1 2019-07-16T15:10:11.7
 358 2019-07-16T15:10:13.5
   5 2019-07-16T15:10:13.6
   1 2019-07-16T15:10:13.7
   5 2019-07-16T15:10:13.8
   4 2019-07-16T15:10:13.9
 364 2019-07-16T15:10:15.7
  18 2019-07-16T15:10:15.8
   5 2019-07-16T15:10:15.9
   4 2019-07-16T15:10:16.0
   3 2019-07-16T15:10:16.1
   1 2019-07-16T15:10:16.2
 388 2019-07-16T15:10:18.0
   4 2019-07-16T15:10:18.1
   2 2019-07-16T15:10:18.2
   4 2019-07-16T15:10:18.3
   5 2019-07-16T15:10:18.4
   5 2019-07-16T15:10:18.5
   8 2019-07-16T15:10:18.6

We see a couple of seconds of nothing being logged, followed by a large number of operations all completing within a tenth of a second, and this is repeated several times.

We can get a clearer picture by processing the start times and duration of each logged query to determine the number of queries in progress at each time. The following chart covers about 20 seconds during the incident at 15:10.

We see operations suddenly begin queuing up but continuing to arrive, and then suddenly all complete after a couple of seconds. Each of the groups shows a slow serverStatus that was as you noted blocked while collecting tcmalloc metrics, and this together with the tcmalloc spinlock metrics suggest that something is holding a lock internal to tcmalloc, likely the pageheap lock, for a couple of seconds, and this is repeated several times. This isn't a pattern we've seen before and we don't see anything in the tcmalloc metrics to clearly indicate resource contention.

To help us with the diagnosis can you

  • look at the application for workloads that are repeated at 10-minute intervals
  • upload the syslog covering the 07-16 incident
Comment by Sébastien Puyet [ 31/Jul/19 ]

Dmitry, I've just uploaded the 16th July logs.

Comment by Dmitry Agranat [ 31/Jul/19 ]

Hi spuyet,

We noticed that one line is reporting high globalLock time at 2019-07-16T15:10:18.022+0200. I've just uploaded the slow log file and the diagnostic data file of today through the secure portal.

I had a look at the attached log "15-07-2019.mongodb.log.gz" and unfortunately, it does not cover the reported time:

$ mloginfo 15-07-2019.mongodb.log
     source: 15-07-2019.mongodb.log
       host: unknown
      start: 2019 Jul 15 00:05:07.987
        end: 2019 Jul 16 00:05:22.286

An extract from the log covering 10 minutes before and 10 minutes after the reported event would be enough.

Thank you,
Dima

Comment by Sébastien Puyet [ 29/Jul/19 ]

Hello Dmitry, Hello Joseph,

Any news on your side about this issue ? We still have occurrences on ou production servers.

Comment by Sébastien Puyet [ 16/Jul/19 ]

We just had new issue with one of our mongo server (I will create a separate ticket for this) so we choose to promote our secondary server as master (at 15:06), 4 minutes later (15:10), the newly promoted server had a big slowdown:


ftdc logged multiple times during the slowdown:

$ cat /home/log/mongo/mongodb.log | grep serverStatus
2019-07-16T00:08:01.665+0200 I COMMAND  [conn1909368] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24424 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 394894 } } } protocol:op_command 396ms
2019-07-16T02:07:00.918+0200 I COMMAND  [conn1910496] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24420 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 153597 } } } protocol:op_command 156ms
2019-07-16T03:51:00.975+0200 I COMMAND  [conn1911157] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24420 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 491734 } } } protocol:op_command 494ms
2019-07-16T09:05:30.415+0200 I COMMAND  [conn1829437] command admin.$cmd command: serverStatus { serverStatus: 1, recordStats: 0 } numYields:0 reslen:24443 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 411639 } } } protocol:op_query 414ms
2019-07-16T09:09:01.745+0200 I COMMAND  [conn1913170] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24428 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 150483 } } } protocol:op_command 152ms
2019-07-16T09:23:01.643+0200 I COMMAND  [conn1913255] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24428 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 155218 } } } protocol:op_command 156ms
2019-07-16T09:55:30.891+0200 I COMMAND  [conn1916422] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24428 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 554031 } } } protocol:op_command 556ms
2019-07-16T10:01:00.133+0200 I COMMAND  [conn1829437] command admin.$cmd command: serverStatus { serverStatus: 1, recordStats: 0 } numYields:0 reslen:24443 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 128862 } } } protocol:op_query 132ms
2019-07-16T10:14:01.689+0200 I COMMAND  [conn1916578] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24428 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 574633 } } } protocol:op_command 577ms
2019-07-16T10:19:30.129+0200 I COMMAND  [conn1829437] command admin.$cmd command: serverStatus { serverStatus: 1, recordStats: 0 } numYields:0 reslen:24443 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 126086 } } } protocol:op_query 128ms
2019-07-16T11:06:00.982+0200 I COMMAND  [conn1920034] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24428 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 813395 } } } protocol:op_command 816ms
2019-07-16T11:11:30.911+0200 I COMMAND  [conn1920069] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24428 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 386187 } } } protocol:op_command 389ms
2019-07-16T11:30:30.925+0200 I COMMAND  [conn1920208] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24428 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 392233 } } } protocol:op_command 394ms
2019-07-16T12:54:00.901+0200 I COMMAND  [conn1920783] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24428 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 321369 } } } protocol:op_command 323ms
2019-07-16T13:54:00.919+0200 I COMMAND  [conn1921199] command admin.$cmd appName: "MongoDB Shell" command: serverStatus { serverStatus: 1.0 } numYields:0 reslen:24428 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 539777 } } } protocol:op_command 542ms
2019-07-16T15:10:05.111+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1106, after wiredTiger: 1106, at end: 1106 }
2019-07-16T15:10:07.167+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 10, after locks: 10, after network: 10, after opLatencies: 10, after opcounters: 10, after opcountersRepl: 10, after repl: 10, after security: 10, after storageEngine: 10, after tcmalloc: 1172, after wiredTiger: 1172, at end: 1172 }
2019-07-16T15:10:09.296+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 10, after globalLock: 10, after locks: 10, after network: 10, after opLatencies: 10, after opcounters: 10, after opcountersRepl: 10, after repl: 10, after security: 10, after storageEngine: 10, after tcmalloc: 1301, after wiredTiger: 1301, at end: 1301 }
2019-07-16T15:10:11.447+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1442, after wiredTiger: 1442, at end: 1442 }
2019-07-16T15:10:13.583+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1585, after wiredTiger: 1585, at end: 1585 }
2019-07-16T15:10:15.776+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 10, after locks: 10, after network: 10, after opLatencies: 10, after opcounters: 10, after opcountersRepl: 10, after repl: 10, after security: 10, after storageEngine: 10, after tcmalloc: 1775, after wiredTiger: 1775, at end: 1775 }
2019-07-16T15:10:18.022+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 1016, after locks: 1016, after network: 1016, after opLatencies: 1016, after opcounters: 1016, after opcountersRepl: 1016, after repl: 1016, after security: 1016, after storageEngine: 1016, after tcmalloc: 1016, after wiredTiger: 1016, at end: 1016 }
2019-07-16T15:20:05.099+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1090, after wiredTiger: 1090, at end: 1100 }
2019-07-16T15:30:05.036+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1029, after wiredTiger: 1029, at end: 1029 }
2019-07-16T16:10:05.050+0200 I COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after backgroundFlushing: 0, after connections: 0, after dur: 0, after extra_info: 0, after globalLock: 0, after locks: 0, after network: 0, after opLatencies: 0, after opcounters: 0, after opcountersRepl: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 1049, after wiredTiger: 1049, at end: 1049 } 

We noticed that one line is reporting high globalLock time at 2019-07-16T15:10:18.022+0200. I've just uploaded the slow log file and the diagnostic data file of today through the secure portal.

Comment by Sébastien Puyet [ 15/Jul/19 ]

Hello Joseph,

I just uploaded 3 new files on the secure portal containing execution of the currentOp() command before, during and after a lock, hope that helps !

 

Comment by Eric Sedor [ 11/Jul/19 ]

Thanks spuyet, that helps. I'll let you know what we find.

Comment by Sébastien Puyet [ 11/Jul/19 ]

Thanks for the diagnostic data. Can you please double-check the log archive files and/or to clarify how we should expect to unzip them? I am getting "tar: Unrecognized archive format" for the log files with tar -xvzf. But the diagnostic data extracts fine.

My bad, that's wrong naming, those files are .gz files.

This doesn't appear associated with large swings in allocated memory, or with other activity that I can yet identify. There are connection spikes but they are very likely symptoms of the latency.

When mongo starts to lock, all our web processes are stuck then the autoscaler spawn more web processes generating more connections to mongo, that's most probably a consequence and not a cause.

Is there anything in your system happening with this 10 minute periodicity?

At the first look we found nothing able to trigger this each 10 minute, we'll have a second look if you don't find anything in logs.

Comment by Eric Sedor [ 10/Jul/19 ]

spuyet,

Thanks for the diagnostic data. Can you please double-check the log archive files and/or to clarify how we should expect to unzip them? I am getting "tar: Unrecognized archive format" for the log files with tar -xvzf. But the diagnostic data extracts fine.

So far:

I can definitely see the tcmalloc spinlock delay that is definitely increasing the latency of reads by slowing the rate at which data can enter cache, and there seems to be a spike every 10 minutes.

This doesn't appear associated with large swings in allocated memory, or with other activity that I can yet identify. There are connection spikes but they are very likely symptoms of the latency.

Is there anything in your system happening with this 10 minute periodicity?

Comment by Sébastien Puyet [ 03/Jul/19 ]

Thanks Daniel for your quick answer. I've uploaded the diagnostic data for both servers (5 last days) and the slow queries log files for yesterday. if you found something interesting in the diagnostic data for another day, let me know and I'll upload the corresponding slow log file.

Comment by Danny Hatcher (Inactive) [ 03/Jul/19 ]

Hi Sébastien,

Thanks for the effort you've put into diagnosing this already; it is much appreciated. I've generated a Secure Upload Portal that you can use to upload files. Only MongoDB engineers will be able to see the files stored within. As you mentioned, please upload the "diagnostic.data" archives from every server involved in these issues as well as the mongod logs from the same servers covering the problem timeframe.

Comment by Sébastien Puyet [ 03/Jul/19 ]

Here is the impact on our current production :

An interesting thing to notice is that we have two different replica set on different servers and both of them are impacted by this issue. I will upload the two diagnostic.data to the portal to let you compare once I'll have a link.

Generated at Thu Feb 08 04:59:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.