[SERVER-23027] Unrecovering replication delay and crashing of server Created: 09/Mar/16  Updated: 06/May/16  Resolved: 06/May/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.3
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: Varun Vijayaraghavan Assignee: Kelsey Schubert
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File replication-delay-drop.png     Text File server-status-slow.txt    
Participants:

 Description   

We've been having a set of intermittent issues while performing some upgrades to our cluster. I have reproduced it so not filing as a bug yet.

We're a) converting a standalone mongo instance into a replica set in phases, b) upgrading the to bigger AWS instances with higher disk IOPS, and c) using mongo 3.2.3 for the new instances (the initial standalone instance is at 3.0.8).

There are 5 instances in total which include the old primary, 3 new secondaries and 1 arbiter. They are all running on WiredTiger.

There are some properties of the cluster that are worth noting.

  • There is no compression being used.
  • There are about 1100 collections in one of the databases.
  • The old primary has a higher priority than the others - in order to try and ensure it remains the primary until all the clients are phased over.
  • The oplog on the PRIMARY instance is configured to be 40GB - increased ~10x from the initial value.

We're noticing that SECONDARIES are getting into a state of increasing replication delay.[see server-status-slow file for logs around this time] And after several hours of replication delays - one of the secondaries simply crashed. ~Around this time, we were performing fairly heavy writes on the PRIMARY. The disk read "IOPS" on the primary as reported by AWS was 1000 IOPS, with the max being 1500 IOPS. And writes were at a ~500 IOPS.

In one case (ip-10-0-0-233), the "fixed" the replication delay by restarting the server. The replication delay immediately dropped to 0. [see replication-delay-drop image]

In another secondary, restarting did not fix the replication delay, it was not able to find a server from which it could replicate safely. The log message contained
"2016-03-09T05:29:31.103+0000 W REPL [rsBackgroundSync] we are too stale to use ip-10-0-0-233:27017 as a sync source
2016-03-09T05:29:31.103+0000 E REPL [rsBackgroundSync] too stale to catch up – entering maintenance mode"

We were never able to recover the crashed secondary. Every restart of the server resulted in it crashing again with a message that looked like this:
"Assertion: 10334:BSONObj size: 17646640 (0x10D4430) is invalid. Size must be between 0 and 16793600(16MB) First element: id: 301015268469"

This is impeding important operational tasks we need to do, so we'd really like some insight as to what could have caused this.

Let me know if there is any other information I can provide that would be useful.

Unfortunately, I don't have the logs for the crashed mongo instance. I can attach logs for the other instance. That said the same issue happened a few days ago on another instance and if necessary I might be able to dig that up.



 Comments   
Comment by Kelsey Schubert [ 06/May/16 ]

Hi varun@x.ai,

We haven’t heard back from you for some time, so I’m
going to mark this ticket as resolved. If this is still an issue for
you, please provide additional
information and we will reopen the ticket.

Regards,
Thomas

Comment by Kelsey Schubert [ 11/Apr/16 ]

Hi varun@x.ai,

We still need additional information to diagnose the problem. If this is still an issue for you, can you please upload the diagnostic.data and logs for the affected nodes?

Thank you,
Thomas

Comment by Kelsey Schubert [ 18/Mar/16 ]

Hi varun@x.ai,

Sorry for the delay getting back to you. The behavior that you are describing is actually the result of two different issues.

First, let us discuss the secondary which has the error message:

"Assertion: 10334:BSONObj size: 17646640 (0x10D4430) is invalid. Size must be between 0 and 16793600(16MB) First element: id: 301015268469"

This error message indicates that a document on the secondary has suffered disk corruption. Determining the exact cause of this corruption is generally not worthwhile. However, if data corruption issues persist, I would recommend a thorough integrity check of the affected node's disk drives.

To address this issue, please execute a clean resync on the affected node.

Second, I would like to discuss the growing replication delay until server restart. To continue to investigate this issue, please answer the following questions:

  1. Can you please upload the contents of diagnostic.data directory as well as the logs for the affected node when it is experiencing this issue?
  2. Which replication protocol are you using?

Please also consider upgrading to 3.2.4 which contains fixes including SERVER-22276. These fixes may improve the behavior that you are observing.

Kind regards,
Thomas

Generated at Thu Feb 08 04:02:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.