[SERVER-25773] Upgrade from 3.0.9 to 3.2.9 discards oplog and can't catch up Created: 24/Aug/16  Updated: 27/Aug/16  Resolved: 26/Aug/16

Status: Closed
Project: Core Server
Component/s: Admin, Replication, WiredTiger
Affects Version/s: 3.0.9, 3.2.9
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Bartosz Debski Assignee: Scott Hernandez (Inactive)
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-25353 Clean shutdown should leave secondari... Closed
Related
is related to SERVER-25255 After upgrading to 3.2.8 (from 3.0.9)... Closed
Operating System: ALL
Participants:

 Description   

Hi,

We had indicate previously in SERVER-25255 that we re having trouble upgrading.

Today we have tried upgrade again from 3.0.9 to 3.2.9 and left it so it can start up. It took about 4 hours to start :

2016-08-24T09:55:07.425+0000 I NETWORK  [websvr] admin web console waiting for connections on port 28017
2016-08-24T09:55:07.431+0000 I REPL     [initandlisten] Did not find local voted for document at startup;  NoMatchingDocument: Did not find replica set lastVote document in local.replset.election
2016-08-24T13:58:39.180+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker

Unfortunately for us looks like mongo decided to disregard all the date in the oplog and cannot sync as it's to stale.

2016-08-24T14:48:43.085+0000 I REPL     [ReplicationExecutor] syncing from: db-node3.dlg.ie.aws.thefloow.net:27017
2016-08-24T14:48:43.098+0000 W REPL     [rsBackgroundSync] we are too stale to use db-node3.dlg.ie.aws.thefloow.net:27017 as a sync source
2016-08-24T14:48:43.098+0000 I REPL     [ReplicationExecutor] syncing from: db-node2.dlg.ie.aws.thefloow.net:27017
2016-08-24T14:48:43.100+0000 I REPL     [ReplicationExecutor] could not find member to sync from
2016-08-24T14:48:43.100+0000 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] our last optime : (term: -1, timestamp: Aug 14 16:54:26:63)
2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] oldest available is (term: -1, timestamp: Aug 14 20:54:47:1bf)
2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2016-08-24T14:48:43.100+0000 I REPL     [ReplicationExecutor] going into maintenance mode with 0 other maintenance mode tasks in progress

To be perfectly clear this node was warmed up and in production without issue before we attempted this upgrade. Oplog db size was very big as well :

db.getReplicationInfo()
{
  "logSizeMB": 614400,
  "usedMB": 117.9,
  "timeDiff": 0,
  "timeDiffHours": 0,
  "tFirst": "Sun Aug 14 2016 16:54:26 GMT+0000 (UTC)",
  "tLast": "Sun Aug 14 2016 16:54:26 GMT+0000 (UTC)",
  "now": "Wed Aug 24 2016 14:49:15 GMT+0000 (UTC)"
}

As you can see Mongo decided somehow that oplog have to be cleared. In comparison see below same info from other replica member :

floow:PRIMARY> db.getReplicationInfo()
{
	"logSizeMB" : 614400,
	"usedMB" : 614401.13,
	"timeDiff" : 838673,
	"timeDiffHours" : 232.96,
	"tFirst" : "Sun Aug 14 2016 21:54:51 GMT+0000 (UTC)",
	"tLast" : "Wed Aug 24 2016 14:52:44 GMT+0000 (UTC)",
	"now" : "Wed Aug 24 2016 14:52:44 GMT+0000 (UTC)"
}

I can provide log from that period but there is nothing indicating any unusual behaviour, no errors.

2016-08-24T13:58:48.827+0000 I REPL     [ReplicationExecutor] syncing from: db-node2.dlg.ie.aws.thefloow.net:27017
2016-08-24T13:58:48.829+0000 W REPL     [rsBackgroundSync] we are too stale to use db-node2.dlg.ie.aws.thefloow.net:27017 as a sync source
2016-08-24T13:58:48.829+0000 I REPL     [ReplicationExecutor] syncing from: db-node3.dlg.ie.aws.thefloow.net:27017
2016-08-24T13:58:48.852+0000 W REPL     [rsBackgroundSync] we are too stale to use db-node3.dlg.ie.aws.thefloow.net:27017 as a sync source
2016-08-24T13:58:48.852+0000 I REPL     [ReplicationExecutor] could not find member to sync from
2016-08-24T13:58:48.852+0000 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
2016-08-24T13:58:48.852+0000 I REPL     [rsBackgroundSync] our last optime : (term: -1, timestamp: Aug 14 16:54:26:63)
2016-08-24T13:58:48.852+0000 I REPL     [rsBackgroundSync] oldest available is (term: -1, timestamp: Aug 14 20:10:58:f)

In current state we cannot reliably upgrade our database to 3.2.x as this results in de-sync.



 Comments   
Comment by Paul Ridgway [ 27/Aug/16 ]

Thanks, what result would we expect to see in a good or bad situation?

Comment by Scott Hernandez (Inactive) [ 26/Aug/16 ]

Yeah, you can run this query and we can take a look at the output.

printjson(db.getSiblingDB("local").replset.minvalid.find().toArray());

I'm going to close this now that we have a culprit. Feel free to add a comment or file a new issue if something else comes up.

Comment by Paul Ridgway [ 26/Aug/16 ]

Looks like we should (could) try a node not affected by a prior upgrade attempt - anyway to identify such a bug (failed/reverted upgrade)?

Comment by Scott Hernandez (Inactive) [ 26/Aug/16 ]

It looks like you were effected by a bug during your first downgrade (from 3.2.x < 3.2.9), resulted in a marker being left which caused the oplog to be truncated to the first entry after the second upgrade. If we had a snapshot of data between those events we could confirm that, but everything points to it: SERVER-25353

The logs show this:

2016-08-24T09:54:12.560+0000 I STORAGE  [initandlisten] Sampling from the oplog between Aug 14 16:54:26:63 and Aug 24 09:49:24:be to determine where to place markers for truncation
...
2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] our last optime : (term: -1, timestamp: Aug 14 16:54:26:63)

Comment by Bartosz Debski [ 25/Aug/16 ]

Last snapshot I have is for this particular node is from 6th of July.

I have uploaded mongodb.log-20160825.gz via secure upload. In this file there is also a post-upgrade startup logged with verbosity 5 which might help a bit.

Comment by Scott Hernandez (Inactive) [ 25/Aug/16 ]

The logs from before the shutdown and upgrade may be useful (starting before 2016-08-24T09:55:07); can you upload them please?

Do you take file-system backups by any chance? If so, would you have a copy of the files before the upgrade?

Comment by Bartosz Debski [ 25/Aug/16 ]

1 & 2
There was no maintenance immediately on the node before or after 14th . Last maintenance was on 25 July when we tried first time to upgrade and we then raised SERVER-25255. On 25th we did kill db as it was trying to start ( same procedure that took 4 hours this time, we did not wait that long then), then downgraded back to 3.0.9 and node has came back without problems, caught up with replication. Since then there were no problems or no maintenance jobs till 24th Aug when we started the upgrade again.

Prior to this last time when this node was down was in 17th May and it was for a storage upgrade/resize , again no complications after node came back online.
Node in question also have priority 0, to make sure it will not get elected as primary in case we will run into any post upgrade performance issues.

3. As for logs, unfortunately oldest log we can provide is from 18th. If that's any good then we can provide logs from 18th till 24th.

Comment by Scott Hernandez (Inactive) [ 24/Aug/16 ]

Can you provide a few things to help us understand what could have happened:

  1. Did anything happen to that host on the Aug 14th?
  2. A timeline of upgrade/downgrades for this node, or any unclean shutdowns over the last few months.
  3. if possible, logs since Aug 14th (or really the day before).
Comment by Bartosz Debski [ 24/Aug/16 ]

hmm, thought edit on description will be are enabled to make this issue easier to read but no, so apologies for a bit of chaos.

Generated at Thu Feb 08 04:10:10 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.