[SERVER-29854] Secondary members no more syncing until we restart it (oplog blocked) Created: 26/Jun/17  Updated: 07/Dec/17  Resolved: 07/Nov/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.3, 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Slawomir Lukiewski Assignee: Mark Agarunov
Resolution: Done Votes: 2
Labels: MAREF, SWNA
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File ftdc2.png     PNG File stuck.png    
Issue Links:
Related
related to SERVER-31368 Log time spent waiting for other shar... Closed
Operating System: ALL
Participants:

 Description   

Hello,

Since we upgraded our cluster from 3.2.8 to 3.4.3, regularly secondary members of our most loaded shard stopping syncing to primary. When that happens, the field "oplog last event time" returned by command rs.printReplicationInfo() stays indefinitely blocked. The secondary member just fall behind and we need to restart it in order to free the oplog, allowing to catch up the primary member.

When it happens, just before the service shutdown, in logs we can see this error :

2017-06-11T05:13:46.410+0200 I REPL     [replication-7923] Error returned from oplog query while canceling query: ExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 39149258ms, timeout was set to 65000ms
2017-06-11T05:13:46.411+0200 I REPL     [rsBackgroundSync] Replication producer stopped after oplog fetcher finished returning a batch from our sync source.  Abandoning this batch of oplog entries and re-evaluating our sync source.

The "took 39149258ms, timeout was set to 65000ms" is strange, seems that the oplog query was longer that the timeout. The shard on witch it happens receives huge amount of read and writes so it's not shocking for us to see a oplog query to fail. But the secondary should be able to retry it.

And sometimes when the oplog is blocked and we restarting the concerned member, it produces theses messages in logs :

2017-06-25T17:35:32.914+0200 I CONTROL  [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2017-06-25T17:35:32.914+0200 I NETWORK  [signalProcessingThread] shutdown: going to close listening sockets...
2017-06-25T17:35:32.914+0200 I NETWORK  [signalProcessingThread] closing listening socket: 7
2017-06-25T17:35:32.915+0200 I NETWORK  [signalProcessingThread] shutdown: going to flush diaglog...
2017-06-25T17:35:32.915+0200 I REPL     [signalProcessingThread] shutting down replication subsystems
2017-06-25T17:35:32.916+0200 I REPL     [signalProcessingThread] Stopping replication reporter thread
2017-06-25T17:35:57.051+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.30:27025 as failed :: caused by :: Location17382: Can't use connection pool during shutdown
2017-06-25T17:35:57.052+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.3:27025 as failed :: caused by :: Location17382: Can't use connection pool during shutdown
2017-06-25T17:35:57.052+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.28:27025 as failed :: caused by :: Location17382: Can't use connection pool during shutdown
2017-06-25T17:35:57.052+0200 W NETWORK  [ReplicaSetMonitor-TaskExecutor-0] No primary detected for set csReplSet
2017-06-25T17:35:57.052+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] All nodes for set csReplSet are down. This has happened for 1 checks in a row.
2017-06-25T17:35:57.329+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.18:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown
2017-06-25T17:35:57.329+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.16:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown
2017-06-25T17:35:57.329+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.17:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown
2017-06-25T17:35:57.329+0200 W NETWORK  [ReplicaSetMonitor-TaskExecutor-0] No primary detected for set sfx6
2017-06-25T17:35:57.329+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] All nodes for set sfx6 are down. This has happened for 1 checks in a row.
2017-06-25T17:35:57.839+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.30:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown
2017-06-25T17:35:57.839+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.29:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown
2017-06-25T17:35:57.839+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] Marking host 172.16.18.28:27017 as failed :: caused by :: Location17382: Can't use connection pool during shutdown
2017-06-25T17:35:57.839+0200 W NETWORK  [ReplicaSetMonitor-TaskExecutor-0] No primary detected for set sfx10
2017-06-25T17:35:57.839+0200 I NETWORK  [ReplicaSetMonitor-TaskExecutor-0] All nodes for set sfx10 are down. This has happened for 1 checks in a row.
[...]

In that case only a "kill -9" let the member to shutdown.

Our Mongo cluster details :

  • Many shards + config replica set, each formed by 3 members (1 primary + 2 secondary)
  • 2 mongos
  • Balancer is disabled
  • Package version 3.4.4, OS: Debian 8 Jessie
  • Servers: 6 cores Xeon CPU, 64GB RAM, ~3To SSD, ext4 file system
  • ~ 40 collections in 1 DB
  • Many writes and reads

Thank you in advance for your help.

Best regards,
Slawomir



 Comments   
Comment by Mark Agarunov [ 07/Nov/17 ]

Hello slluk-sa,

Thank you for the additional information. After investigating the provided data, this appears to be due to a network issue, where slow communication between mongod nodes will keep locks from yielding and cause the behavior you're seeing. We've opened SERVER-31368 for additional logging to be added to better diagnose this issue in the future.

Thanks,
Mark

Comment by Slawomir Lukiewski [ 27/Jun/17 ]

Hi Thomas,

Thank you for your answer.
For each of the 3 members of the concerned shard / replica set, I uploaded 2 files :

  • An archive of diagnostic.data folder (seems to contain data from June 20).
  • The mongod log file from June 18 to June 26.

You should look at events of June 23 at night, on member with name ending with 24. That day we got the problem twice (the second time with failed mongod shutdown).
Chronology :

  • 02:29 : "oplog last event time" blocked (based on our monitoring)
  • 05:08 : mongod restart (2017-06-23T05:08:15.647+0200 in log), then started to catch up primary
  • 06:00 : "oplog last event time" blocked again
  • 07:27 : mongod restart try (2017-06-23T07:27:08.761+0200 in log), mongod has not stopped, lots of "No primary detected for..." errors in log
  • 07:33 : kill -9 mongod, then start, oplog catch up primary...

Best regards,
Slawomir

Comment by Kelsey Schubert [ 26/Jun/17 ]

Hi slluk-sa,

Thanks for reporting this issue. We'll need some additional information to better understand what is going on here. To help us investigate, would you please provide the complete mongod log files covering this issue as well as an archive of the diagnostic.data for the three nodes in the affected replica set?

I've created a secure upload portal for you to provide these files. Files uploaded to the portal are visible only to MongoDB employees investigating the issue and are routinely deleted after some time.

Kind regards,
Thomas

Generated at Thu Feb 08 04:21:59 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.