Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-25773

Upgrade from 3.0.9 to 3.2.9 discards oplog and can't catch up

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Blocker - P1
    • Resolution: Duplicate
    • Affects Version/s: 3.0.9, 3.2.9
    • Fix Version/s: None
    • Component/s: Admin, Replication, WiredTiger
    • Labels:
      None
    • Operating System:
      ALL

      Description

      Hi,

      We had indicate previously in SERVER-25255 that we re having trouble upgrading.

      Today we have tried upgrade again from 3.0.9 to 3.2.9 and left it so it can start up. It took about 4 hours to start :

      2016-08-24T09:55:07.425+0000 I NETWORK  [websvr] admin web console waiting for connections on port 28017
      2016-08-24T09:55:07.431+0000 I REPL     [initandlisten] Did not find local voted for document at startup;  NoMatchingDocument: Did not find replica set lastVote document in local.replset.election
      2016-08-24T13:58:39.180+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker
      

      Unfortunately for us looks like mongo decided to disregard all the date in the oplog and cannot sync as it's to stale.

      2016-08-24T14:48:43.085+0000 I REPL     [ReplicationExecutor] syncing from: db-node3.dlg.ie.aws.thefloow.net:27017
      2016-08-24T14:48:43.098+0000 W REPL     [rsBackgroundSync] we are too stale to use db-node3.dlg.ie.aws.thefloow.net:27017 as a sync source
      2016-08-24T14:48:43.098+0000 I REPL     [ReplicationExecutor] syncing from: db-node2.dlg.ie.aws.thefloow.net:27017
      2016-08-24T14:48:43.100+0000 I REPL     [ReplicationExecutor] could not find member to sync from
      2016-08-24T14:48:43.100+0000 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
      2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] our last optime : (term: -1, timestamp: Aug 14 16:54:26:63)
      2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] oldest available is (term: -1, timestamp: Aug 14 20:54:47:1bf)
      2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
      2016-08-24T14:48:43.100+0000 I REPL     [ReplicationExecutor] going into maintenance mode with 0 other maintenance mode tasks in progress
      

      To be perfectly clear this node was warmed up and in production without issue before we attempted this upgrade. Oplog db size was very big as well :

      db.getReplicationInfo()
      {
        "logSizeMB": 614400,
        "usedMB": 117.9,
        "timeDiff": 0,
        "timeDiffHours": 0,
        "tFirst": "Sun Aug 14 2016 16:54:26 GMT+0000 (UTC)",
        "tLast": "Sun Aug 14 2016 16:54:26 GMT+0000 (UTC)",
        "now": "Wed Aug 24 2016 14:49:15 GMT+0000 (UTC)"
      }
      

      As you can see Mongo decided somehow that oplog have to be cleared. In comparison see below same info from other replica member :

      floow:PRIMARY> db.getReplicationInfo()
      {
      	"logSizeMB" : 614400,
      	"usedMB" : 614401.13,
      	"timeDiff" : 838673,
      	"timeDiffHours" : 232.96,
      	"tFirst" : "Sun Aug 14 2016 21:54:51 GMT+0000 (UTC)",
      	"tLast" : "Wed Aug 24 2016 14:52:44 GMT+0000 (UTC)",
      	"now" : "Wed Aug 24 2016 14:52:44 GMT+0000 (UTC)"
      }
      

      I can provide log from that period but there is nothing indicating any unusual behaviour, no errors.

      2016-08-24T13:58:48.827+0000 I REPL     [ReplicationExecutor] syncing from: db-node2.dlg.ie.aws.thefloow.net:27017
      2016-08-24T13:58:48.829+0000 W REPL     [rsBackgroundSync] we are too stale to use db-node2.dlg.ie.aws.thefloow.net:27017 as a sync source
      2016-08-24T13:58:48.829+0000 I REPL     [ReplicationExecutor] syncing from: db-node3.dlg.ie.aws.thefloow.net:27017
      2016-08-24T13:58:48.852+0000 W REPL     [rsBackgroundSync] we are too stale to use db-node3.dlg.ie.aws.thefloow.net:27017 as a sync source
      2016-08-24T13:58:48.852+0000 I REPL     [ReplicationExecutor] could not find member to sync from
      2016-08-24T13:58:48.852+0000 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
      2016-08-24T13:58:48.852+0000 I REPL     [rsBackgroundSync] our last optime : (term: -1, timestamp: Aug 14 16:54:26:63)
      2016-08-24T13:58:48.852+0000 I REPL     [rsBackgroundSync] oldest available is (term: -1, timestamp: Aug 14 20:10:58:f)
      

      In current state we cannot reliably upgrade our database to 3.2.x as this results in de-sync.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                1 Vote for this issue
                Watchers:
                9 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: