[SERVER-36516] Secondary mongod becomes unresponsive occasionally Created: 07/Aug/18  Updated: 27/Oct/23  Resolved: 25/Oct/18

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.4.16
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Matthias Eichstaedt Assignee: Backlog - Triage Team
Resolution: Gone away Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Assigned Teams:
Server Triage
Operating System: ALL
Participants:

 Description   

MongoDB server version: 3.4.16

Primary and 2 secondaries, a few times a week Read requests become unresponsive on one of the secondaries. In addition, the secondary becomes unresponsive to operations such as "/etc/init.d/mongod restart"

mongod never comes to a complete stop:

2018-08-06T07:07:08.621-0700 I COMMAND  [conn1040] terminating, shutdown command received
2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] shutdown: going to close listening sockets...
2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] closing listening socket: 7
2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] closing listening socket: 8
2018-08-06T07:07:08.621-0700 I NETWORK  [conn1040] removing socket file: /tmp/mongodb-27018.sock
2018-08-06T07:07:08.622-0700 I NETWORK  [conn1040] shutdown: going to flush diaglog...
2018-08-06T07:07:08.622-0700 I REPL     [conn1040] shutting down replication subsystems
2018-08-06T07:07:08.622-0700 I REPL     [conn1040] Stopping replication reporter thread
2018-08-06T07:07:08.622-0700 I REPL     [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-b-shard3-1:27018: CallbackCanceled: Reporter no longer valid
2018-08-06T07:07:08.622-0700 D REPL     [SyncSourceFeedback] The replication progress command (replSetUpdatePosition) failed and will be retried: CallbackCanceled: Reporter no longer valid
2018-08-06T07:07:08.622-0700 I REPL     [conn1040] Stopping replication fetcher thread
2018-08-06T07:07:08.624-0700 I REPL     [conn1040] Stopping replication applier thread
2018-08-06T07:07:08.624-0700 D REPL     [replication-3] oplog query cancelled
2018-08-06T07:07:08.624-0700 I ASIO     [NetworkInterfaceASIO-RS-0] Ending connection to host mongo-b-shard3-1.melodis.com:27018 due to bad connection status; 1 connections to that host remain open
2018-08-06T07:07:08.624-0700 D REPL     [rsBackgroundSync] fetcher stopped reading remote oplog on mongo-b-shard3-1.melodis.com:27018
2018-08-06T07:07:08.624-0700 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re
-evaluating our sync source.
2018-08-06T07:07:08.820-0700 I COMMAND  [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0,
 term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
2018-08-06T07:07:08.822-0700 I COMMAND  [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1,
 term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
2018-08-06T07:07:08.861-0700 I COMMAND  [conn980] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:09.126-0700 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 
2018-08-06T07:07:09.126-0700 D REPL     [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:11.126Z
2018-08-06T07:07:09.362-0700 I COMMAND  [conn976] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:09.863-0700 I COMMAND  [conn972] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:10.364-0700 I COMMAND  [conn966] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:10.517-0700 D REPL     [ReplicationExecutor] Canceling election timeout callback at 2018-08-06T07:07:19.824-0700
2018-08-06T07:07:10.518-0700 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:0, msg: 
2018-08-06T07:07:10.518-0700 D REPL     [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-1.melodis.com:27018 at 2018-08-06T14:07:12.517Z
2018-08-06T07:07:10.827-0700 I COMMAND  [conn3] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-1.melodis.com:27018", fromId: 0,
 term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
2018-08-06T07:07:10.829-0700 I COMMAND  [conn4] command admin.$cmd command: replSetHeartbeat { replSetHeartbeat: "mongo-b-shard3", configVersion: 7, from: "mongo-b-shard3-2.melodis.com:27018", fromId: 1,
 term: 9 } numYields:0 reslen:444 locks:{} protocol:op_command 0ms
2018-08-06T07:07:10.900-0700 I COMMAND  [conn961] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
2018-08-06T07:07:11.132-0700 D REPL     [ReplicationExecutor] setUpValues: heartbeat response good for member _id:1, msg: 
2018-08-06T07:07:11.132-0700 D REPL     [ReplicationExecutor] Scheduling heartbeat to mongo-b-shard3-2.melodis.com:27018 at 2018-08-06T14:07:13.132Z
2018-08-06T07:07:11.366-0700 I COMMAND  [conn956] command admin.$cmd command: isMaster { ismaster: 1 } numYields:0 reslen:620 locks:{} protocol:op_query 0ms
<more log content>
 

The output of the command "db.currentOp(true)" is posted below when mongod was stuck.
The suspect operations here are "rsSync" (threadid 140303695083264) and "repl writer worker 11" (threadid 140303548225280), see 2 snippets below.
rsSync holds an exclusive global write lock (W) and repl writer worker 11 holds a shared global write lock (w). The documentation
states "an exclusive (X) lock cannot coexist with any other modes, and a shared (S) lock can only coexists with intent shared (IS) locks".
https://docs.mongodb.com/manual/faq/concurrency/#faq-concurrency-locking

    {
                        "desc" : "rsSync",
                        "threadId" : "140303695083264",
                        "active" : true,
                        "opid" : 204493,
                        "secs_running" : 554,
                        "microsecs_running" : NumberLong(554584803),
                        "op" : "none",
                        "ns" : "local.replset.minvalid",
                        "query" : {
                                
                        },
                        "numYields" : 0,
                        "locks" : {
                                "Global" : "W"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "w" : NumberLong(1),
                                                "W" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "r" : NumberLong(1),
                                                "W" : NumberLong(1)
                                        }
                                }
                        }
                },

           {
                        "desc" : "repl writer worker 11",
                        "threadId" : "140303548225280",
                        "active" : true,
                        "opid" : 204510,
                        "op" : "none",
                        "ns" : "",
                        "query" : {
                                
                        },
                        "numYields" : 0,
                        "locks" : {
                                "Global" : "w",
                                "local" : "w",
                                "Collection" : "w"
                        },
                        "waitingForLock" : false,
                        "lockStats" : {
                                "Global" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "Database" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                },
                                "oplog" : {
                                        "acquireCount" : {
                                                "w" : NumberLong(1)
                                        }
                                }
                        }
                },



 Comments   
Comment by Kelsey Schubert [ 25/Oct/18 ]

Hi matthias.eichstaedt,

We haven’t heard back from you for some time since the upgrade, so I’m going to mark this ticket as resolved. If this is still an issue for you, please provide additional information and we will reopen the ticket.

Regards,
Kelsey

Comment by Matthias Eichstaedt [ 03/Oct/18 ]

Hi Eric,
on 9/27 we upgraded our mongodb hosts from CentOS 6.6 to 6.10. This
includes upgrading glibc (specifically the mutex implementation).
I have not seen the hanging issue since then but it's only been 5 days so
far. Will let you know how it goes.

Matthias

Comment by Eric Milkie [ 19/Sep/18 ]

The new gdb trace is similar to the other ones from version 3.6. One thread is still inserting documents and everything else is blocked behind it.
It could be helpful to see what operations the secondary was applying at the moment of the hang. During a hang, if you could issue a query on the hanging node to fetch the last several documents of the oplog, that query would eventually complete when the hang stops and we would get a good indication of which operations are causing this issue. One thing that might trigger this is inserting documents that contain very large arrays, and such arrays also participate in a large number of indexes; this would have a multiplicative effect on the number of writes per operation.

Comment by Matthias Eichstaedt [ 18/Sep/18 ]

Hi Eric,

thanks for your updates.

1) I have just observed this issue again and ran the gdb command again. The
output is in the secure area, called gdb_20180918.txt.
Please note that we are running 4.0.1 at this point. So the gdb_20180918.txt
dump is from 4.0.1.

2) When the replication stalls, is there any way to identify which insert
the writer thread is working on? Maybe we can get some idea from this
specific insert.

3) We are going to upgrade to 4.0.2 in the next few days.

Matthias

Comment by Eric Milkie [ 18/Sep/18 ]

I don't think SERVER-36225 is involved here.
If you upgrade to 4.0.2, you will benefit from the ability to do reads while replication is writing, so even if the replication stalls still happen, your secondary reads will not be blocked.
After discussing with another engineer, I was unable to explain why replication is stalling for you. The multiApply function is waiting for all the writer threads to finish writing, and in the traces you provided, there is always one thread that is still doing inserts.

Comment by Matthias Eichstaedt [ 17/Sep/18 ]

Hi Eric,
thanks for looking into this issue.

Could this issue be caused by this jira:
https://jira.mongodb.org/browse/SERVER-36225
If so, we could resolve our issue just by upgrading to 4.0.2.

Otherwise, is there any further debug information we can provide? Are there
any techniques to gain further insights into multiApply() and why it would
not proceed?

mongo::repl::SyncTail::multiApply(mongo::OperationContext*,
std::vector<mongo::repl::OplogEntry,
std::allocator<mongo::repl::OplogEntry> >) ()

Matthias

Comment by Matthias Eichstaedt [ 06/Sep/18 ]

Thanks for the update, Nick.

I just saw the same issue again and uploaded the gdb dump file (gdb_20180906.txt).

Matthias

Comment by Nick Brewer [ 05/Sep/18 ]

matthias.eichstaedt Thanks for the thorough observations. We're looking into this now, and we should have an update for you soon.

-Nick

Comment by Matthias Eichstaedt [ 05/Sep/18 ]

Hi Nick,
we observed another instance of our issue on Sept 5th:
What we saw today is the same as what we saw on Monday (Sept 3):
1. Replication from the primary (which is in a different datacenter) to the secondary became “stuck". The host fell further and further behind. It seems that one of the internal replication threads (rsSync) obtained some global lock and then went off to do something for an extended period of time while holding the global lock. Certain operations (e.g. db.serverStatus()) are blocked waiting for that lock to be released. So when we typed “db.serverStatus()” into the mongo shell, the prompt never came back. However, reading of documents in the collections are not affected. (We tried to type db[“<our_collection_name>”].find() and were able to see documents being returned.) This is probably because, in MongoDB 4.0.x, reading of documents from a secondary no longer requires obtaining a lock that is also obtained by the replication process (which is how things work in MongoDB 3.x.x). Instead, reading of documents are now done by reading (slightly older) snapshots of the database, allowing read operations (by our application) and write operations (by the replication process) to proceed in parallel without blocking each other.
2. However, even though the application instances were able to continue to read chunks from the stuck secondary, the significant replication lag meant that recently-inserted chunks (inserted into the primary) could not be found from the secondary. But, our application was designed so that, if it cannot find a chunk from a secondary, it will make an attempt to find it from the primary. In this incident (and also on the previous incident on Monday), that mechanism kicked in and the application instances were able to successfully find the missing chunks from the primary.
3. After we ran the gdb command, the secondary and the replication process recovered. Last time (on Monday), the same thing happened: after we ran the gdb command, the secondary recovered. Apparently, running the gdb command has the side-effect of fixing the problem (which is bizarre).

gdb command:

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' >gdb.txt

Uploaded files to the secure area:
db-currentOp-mongo-b-shard3-3-Sept-5-at-7-10am-pacific.txt
gdb_20180905.txt

Comment by Matthias Eichstaedt [ 03/Sep/18 ]

We have upgraded our mongodb in an attempt to get around this issue.
MongoDB server version: 4.0.1

However, we just saw an issue where the replication with a secondary is slow/stuck. The system is lightly loaded (in the order of 20qps).
I am uploading the output of the gdb command above. We have also run the db.currentOp() command and captured the output, also uploaded.

We observed that the replication seems to be slow or stuck. The host in question is almost 1 hour behind the primary.
Since the host in question is falling far behind the primary, our application is now compensating for the significant replication delay by reading from another datacenter.
Our client hosts are printing warning messages saying that they are reading the chunks from the primary.

Comment by Matthias Eichstaedt [ 21/Aug/18 ]

Thanks, Nick. Installed gdb and waiting for this issue to occur again.

Comment by Nick Brewer [ 21/Aug/18 ]

matthias.eichstaedt We've looked over the data you provided but unfortunately it's still not enough to give us a clear picture of where the issue is occurring. To get a better idea, we'd like to have you run gdb on the mongod until it becomes unresponsive again. If you could issue the following command in a screen/tmux session:

gdb -p $(pidof mongod) -batch -ex 'thread apply all bt' >gdb.txt

You can upload the outputted gdb.txt to the secure portal.

Thanks,
Nick

Comment by Matthias Eichstaedt [ 17/Aug/18 ]

Thanks for the update.

Comment by Nick Brewer [ 17/Aug/18 ]

matthias.eichstaedt Just wanted to let you know we're still looking into this - should have some more details soon.

-Nick

Comment by Matthias Eichstaedt [ 10/Aug/18 ]

I uploaded another set of logs for a different incident at around 20180810 12:21:37

Comment by Matthias Eichstaedt [ 08/Aug/18 ]

I uploaded diagnostic.data.tgz and mongod.log.2.
I believe the issue started around 2018-08-06T06:51:00

Comment by Nick Brewer [ 08/Aug/18 ]

matthias.eichstaedt Thanks for your report. Could you archive (tar or zip) and upload the $dbpath/diagnostic.data directory, as well as the log, from the affected secondary? If you'd prefer, you can upload this information to our secure portal. Information shared there is only available to MongoDB employees, and is automatically removed after a period of time.

Thanks,
Nick

Comment by Matthias Eichstaedt [ 08/Aug/18 ]

Another random observation:
The number of connections on the secondary stays constant at around 21 connections for the hours leading up to the hanging issue.
At 2018-08-06T06:51:01.833-0700 The number of connections (was stable at 21) starts to increase, seemingly with every query there is a new additional connection. The connection count reaches 566 when the process gets killed manually (because mongod restart didn't complete successfully).

Generated at Thu Feb 08 04:43:20 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.