[SERVER-36516] Secondary mongod becomes unresponsive occasionally Created: 07/Aug/18 Updated: 27/Oct/23 Resolved: 25/Oct/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.4.16 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthias Eichstaedt | Assignee: | Backlog - Triage Team |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Server Triage
|
| Operating System: | ALL |
| Participants: |
| Description |
|
MongoDB server version: 3.4.16 Primary and 2 secondaries, a few times a week Read requests become unresponsive on one of the secondaries. In addition, the secondary becomes unresponsive to operations such as "/etc/init.d/mongod restart" mongod never comes to a complete stop:
The output of the command "db.currentOp(true)" is posted below when mongod was stuck.
|
| Comments |
| Comment by Kelsey Schubert [ 25/Oct/18 ] | |
|
We haven’t heard back from you for some time since the upgrade, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket. Regards, | |
| Comment by Matthias Eichstaedt [ 03/Oct/18 ] | |
|
Hi Eric, Matthias | |
| Comment by Eric Milkie [ 19/Sep/18 ] | |
|
The new gdb trace is similar to the other ones from version 3.6. One thread is still inserting documents and everything else is blocked behind it. | |
| Comment by Matthias Eichstaedt [ 18/Sep/18 ] | |
|
Hi Eric, thanks for your updates. 1) I have just observed this issue again and ran the gdb command again. The 2) When the replication stalls, is there any way to identify which insert 3) We are going to upgrade to 4.0.2 in the next few days. Matthias | |
| Comment by Eric Milkie [ 18/Sep/18 ] | |
|
I don't think | |
| Comment by Matthias Eichstaedt [ 17/Sep/18 ] | |
|
Hi Eric, Could this issue be caused by this jira: Otherwise, is there any further debug information we can provide? Are there mongo::repl::SyncTail::multiApply(mongo::OperationContext*, Matthias | |
| Comment by Matthias Eichstaedt [ 06/Sep/18 ] | |
|
Thanks for the update, Nick. I just saw the same issue again and uploaded the gdb dump file (gdb_20180906.txt). Matthias | |
| Comment by Nick Brewer [ 05/Sep/18 ] | |
|
matthias.eichstaedt Thanks for the thorough observations. We're looking into this now, and we should have an update for you soon. -Nick | |
| Comment by Matthias Eichstaedt [ 05/Sep/18 ] | |
|
Hi Nick, gdb command:
Uploaded files to the secure area: | |
| Comment by Matthias Eichstaedt [ 03/Sep/18 ] | |
|
We have upgraded our mongodb in an attempt to get around this issue. However, we just saw an issue where the replication with a secondary is slow/stuck. The system is lightly loaded (in the order of 20qps). We observed that the replication seems to be slow or stuck. The host in question is almost 1 hour behind the primary. | |
| Comment by Matthias Eichstaedt [ 21/Aug/18 ] | |
|
Thanks, Nick. Installed gdb and waiting for this issue to occur again. | |
| Comment by Nick Brewer [ 21/Aug/18 ] | |
|
matthias.eichstaedt We've looked over the data you provided but unfortunately it's still not enough to give us a clear picture of where the issue is occurring. To get a better idea, we'd like to have you run gdb on the mongod until it becomes unresponsive again. If you could issue the following command in a screen/tmux session:
You can upload the outputted gdb.txt to the secure portal. Thanks, | |
| Comment by Matthias Eichstaedt [ 17/Aug/18 ] | |
|
Thanks for the update. | |
| Comment by Nick Brewer [ 17/Aug/18 ] | |
|
matthias.eichstaedt Just wanted to let you know we're still looking into this - should have some more details soon. -Nick | |
| Comment by Matthias Eichstaedt [ 10/Aug/18 ] | |
|
I uploaded another set of logs for a different incident at around 20180810 12:21:37 | |
| Comment by Matthias Eichstaedt [ 08/Aug/18 ] | |
|
I uploaded diagnostic.data.tgz and mongod.log.2. | |
| Comment by Nick Brewer [ 08/Aug/18 ] | |
|
matthias.eichstaedt Thanks for your report. Could you archive (tar or zip) and upload the $dbpath/diagnostic.data directory, as well as the log, from the affected secondary? If you'd prefer, you can upload this information to our secure portal. Information shared there is only available to MongoDB employees, and is automatically removed after a period of time. Thanks, | |
| Comment by Matthias Eichstaedt [ 08/Aug/18 ] | |
|
Another random observation: |