Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-23027

Unrecovering replication delay and crashing of server

    • Type: Icon: Question Question
    • Resolution: Incomplete
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.2.3
    • Component/s: Replication
    • None
    • None
    • 0
    • None
    • None
    • None
    • None
    • None
    • None

      We've been having a set of intermittent issues while performing some upgrades to our cluster. I have reproduced it so not filing as a bug yet.

      We're a) converting a standalone mongo instance into a replica set in phases, b) upgrading the to bigger AWS instances with higher disk IOPS, and c) using mongo 3.2.3 for the new instances (the initial standalone instance is at 3.0.8).

      There are 5 instances in total which include the old primary, 3 new secondaries and 1 arbiter. They are all running on WiredTiger.

      There are some properties of the cluster that are worth noting.

      • There is no compression being used.
      • There are about 1100 collections in one of the databases.
      • The old primary has a higher priority than the others - in order to try and ensure it remains the primary until all the clients are phased over.
      • The oplog on the PRIMARY instance is configured to be 40GB - increased ~10x from the initial value.

      We're noticing that SECONDARIES are getting into a state of increasing replication delay.[see server-status-slow file for logs around this time] And after several hours of replication delays - one of the secondaries simply crashed. ~Around this time, we were performing fairly heavy writes on the PRIMARY. The disk read "IOPS" on the primary as reported by AWS was 1000 IOPS, with the max being 1500 IOPS. And writes were at a ~500 IOPS.

      In one case (ip-10-0-0-233), the "fixed" the replication delay by restarting the server. The replication delay immediately dropped to 0. [see replication-delay-drop image]

      In another secondary, restarting did not fix the replication delay, it was not able to find a server from which it could replicate safely. The log message contained
      "2016-03-09T05:29:31.103+0000 W REPL [rsBackgroundSync] we are too stale to use ip-10-0-0-233:27017 as a sync source
      2016-03-09T05:29:31.103+0000 E REPL [rsBackgroundSync] too stale to catch up – entering maintenance mode"

      We were never able to recover the crashed secondary. Every restart of the server resulted in it crashing again with a message that looked like this:
      "Assertion: 10334:BSONObj size: 17646640 (0x10D4430) is invalid. Size must be between 0 and 16793600(16MB) First element: id: 301015268469"

      This is impeding important operational tasks we need to do, so we'd really like some insight as to what could have caused this.

      Let me know if there is any other information I can provide that would be useful.

      Unfortunately, I don't have the logs for the crashed mongo instance. I can attach logs for the other instance. That said the same issue happened a few days ago on another instance and if necessary I might be able to dig that up.

        1. server-status-slow.txt
          6 kB
        2. replication-delay-drop.png
          replication-delay-drop.png
          15 kB

            Assignee:
            kelsey.schubert@mongodb.com Kelsey Schubert
            Reporter:
            varun@x.ai Varun Vijayaraghavan
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: