[SERVER-42062] tcmalloc is locking mongo Created: 03/Jul/19 Updated: 16/Nov/21 Resolved: 29/Mar/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Performance, Stability |
| Affects Version/s: | 3.4.16 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Sébastien Puyet | Assignee: | Dmitry Agranat |
| Resolution: | Incomplete | Votes: | 2 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Participants: |
| Description |
|
Hello, We're having some trouble on our current production servers with a lot logs lines like this one:
A lot of time seems to be spent in "tcmalloc" FWICS. We found a lot of tickets with the similar issue: After reading this we tried to have more details about the locks with the serverStatus command, before, during and after the lock:
The last command is showing how much time (~24 seconds) clients have been locked between the before and the after command. MongoDB version: 3.4.16 |
| Comments |
| Comment by Dmitry Agranat [ 29/Mar/20 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Hi spuyet, We haven’t heard back from you for some time, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelly Lewis [ 19/Nov/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Hi spuyet, please let us know if you're still experiencing this issue. Thank you! | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 26/Sep/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Hi Sébastien, Did you have a chance to review your application again for any clues what might be running repeatedly every 10 minutes? Thank you, | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Sébastien Puyet [ 16/Aug/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Ok, as I said we didn't find anything particular each 10 minutes running on our side. We're going to check again if we find something interesting. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 01/Aug/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Hi Sébastien, In the data from 07-16 we can see the same 10-minute periodicity that Joseph identified previously (times in following chart are UTC):
The striking thing is that these incidents occur exactly 10 minutes apart, and within a few seconds of the 10 minute mark. Looking over the log lines quoted above we see that most of the slow serverStatus messages also follow this pattern. There is nothing in mongod that follows that pattern, so this probably originates from something external to mongod. We haven't been able to identify any clear change in workload with this period. Have you been able to identify anything in your application that follows this pattern of doing something every 10 minutes on the 10 minute mark? Can we also take a look at syslog to see if there's anything at the system level to explain this? The spikes in TCP retransmissions could indicate a system issue, although they might also just be a result of temporary network congestion due to a large number of delayed operations all completing at the same time (see following). Within each incident we see a striking pattern, for example the following from the 15:10 incident
We see a couple of seconds of nothing being logged, followed by a large number of operations all completing within a tenth of a second, and this is repeated several times. We can get a clearer picture by processing the start times and duration of each logged query to determine the number of queries in progress at each time. The following chart covers about 20 seconds during the incident at 15:10.
We see operations suddenly begin queuing up but continuing to arrive, and then suddenly all complete after a couple of seconds. Each of the groups shows a slow serverStatus that was as you noted blocked while collecting tcmalloc metrics, and this together with the tcmalloc spinlock metrics suggest that something is holding a lock internal to tcmalloc, likely the pageheap lock, for a couple of seconds, and this is repeated several times. This isn't a pattern we've seen before and we don't see anything in the tcmalloc metrics to clearly indicate resource contention. To help us with the diagnosis can you
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Sébastien Puyet [ 31/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Dmitry, I've just uploaded the 16th July logs. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Dmitry Agranat [ 31/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Hi spuyet,
I had a look at the attached log "15-07-2019.mongodb.log.gz" and unfortunately, it does not cover the reported time:
An extract from the log covering 10 minutes before and 10 minutes after the reported event would be enough. Thank you, | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Sébastien Puyet [ 29/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Hello Dmitry, Hello Joseph, Any news on your side about this issue ? We still have occurrences on ou production servers. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Sébastien Puyet [ 16/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
We just had new issue with one of our mongo server (I will create a separate ticket for this) so we choose to promote our secondary server as master (at 15:06), 4 minutes later (15:10), the newly promoted server had a big slowdown:
ftdc logged multiple times during the slowdown:
We noticed that one line is reporting high globalLock time at 2019-07-16T15:10:18.022+0200. I've just uploaded the slow log file and the diagnostic data file of today through the secure portal. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Sébastien Puyet [ 15/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Hello Joseph, I just uploaded 3 new files on the secure portal containing execution of the currentOp() command before, during and after a lock, hope that helps !
| ||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 11/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Thanks spuyet, that helps. I'll let you know what we find. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Sébastien Puyet [ 11/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
My bad, that's wrong naming, those files are .gz files.
When mongo starts to lock, all our web processes are stuck then the autoscaler spawn more web processes generating more connections to mongo, that's most probably a consequence and not a cause.
At the first look we found nothing able to trigger this each 10 minute, we'll have a second look if you don't find anything in logs. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Eric Sedor [ 10/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the diagnostic data. Can you please double-check the log archive files and/or to clarify how we should expect to unzip them? I am getting "tar: Unrecognized archive format" for the log files with tar -xvzf. But the diagnostic data extracts fine. So far: I can definitely see the tcmalloc spinlock delay that is definitely increasing the latency of reads by slowing the rate at which data can enter cache, and there seems to be a spike every 10 minutes. This doesn't appear associated with large swings in allocated memory, or with other activity that I can yet identify. There are connection spikes but they are very likely symptoms of the latency. Is there anything in your system happening with this 10 minute periodicity? | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Sébastien Puyet [ 03/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Thanks Daniel for your quick answer. I've uploaded the diagnostic data for both servers (5 last days) and the slow queries log files for yesterday. if you found something interesting in the diagnostic data for another day, let me know and I'll upload the corresponding slow log file. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Danny Hatcher (Inactive) [ 03/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Hi Sébastien, Thanks for the effort you've put into diagnosing this already; it is much appreciated. I've generated a Secure Upload Portal that you can use to upload files. Only MongoDB engineers will be able to see the files stored within. As you mentioned, please upload the "diagnostic.data" archives from every server involved in these issues as well as the mongod logs from the same servers covering the problem timeframe. | ||||||||||||||||||||||||||||||||||||||||||
| Comment by Sébastien Puyet [ 03/Jul/19 ] | ||||||||||||||||||||||||||||||||||||||||||
|
Here is the impact on our current production An interesting thing to notice is that we have two different replica set on different servers and both of them are impacted by this issue. I will upload the two diagnostic.data to the portal to let you compare once I'll have a link. |