Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Blocker - P1
Fix Version/s: None
Affects Version/s: 3.0.9, 3.2.9
Component/s: Admin, Replication, WiredTiger
Labels:
None

Operating System:
ALL
Confidence Status:
None
Work Order:
0
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Hi,

We had indicate previously in ~~SERVER-25255~~ that we re having trouble upgrading.

Today we have tried upgrade again from 3.0.9 to 3.2.9 and left it so it can start up. It took about 4 hours to start :

2016-08-24T09:55:07.425+0000 I NETWORK  [websvr] admin web console waiting for connections on port 28017
2016-08-24T09:55:07.431+0000 I REPL     [initandlisten] Did not find local voted for document at startup;  NoMatchingDocument: Did not find replica set lastVote document in local.replset.election
2016-08-24T13:58:39.180+0000 I NETWORK  [HostnameCanonicalizationWorker] Starting hostname canonicalization worker

Unfortunately for us looks like mongo decided to disregard all the date in the oplog and cannot sync as it's to stale.

2016-08-24T14:48:43.085+0000 I REPL     [ReplicationExecutor] syncing from: db-node3.dlg.ie.aws.thefloow.net:27017
2016-08-24T14:48:43.098+0000 W REPL     [rsBackgroundSync] we are too stale to use db-node3.dlg.ie.aws.thefloow.net:27017 as a sync source
2016-08-24T14:48:43.098+0000 I REPL     [ReplicationExecutor] syncing from: db-node2.dlg.ie.aws.thefloow.net:27017
2016-08-24T14:48:43.100+0000 I REPL     [ReplicationExecutor] could not find member to sync from
2016-08-24T14:48:43.100+0000 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] our last optime : (term: -1, timestamp: Aug 14 16:54:26:63)
2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] oldest available is (term: -1, timestamp: Aug 14 20:54:47:1bf)
2016-08-24T14:48:43.100+0000 I REPL     [rsBackgroundSync] See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember
2016-08-24T14:48:43.100+0000 I REPL     [ReplicationExecutor] going into maintenance mode with 0 other maintenance mode tasks in progress

To be perfectly clear this node was warmed up and in production without issue before we attempted this upgrade. Oplog db size was very big as well :

db.getReplicationInfo()
{
  "logSizeMB": 614400,
  "usedMB": 117.9,
  "timeDiff": 0,
  "timeDiffHours": 0,
  "tFirst": "Sun Aug 14 2016 16:54:26 GMT+0000 (UTC)",
  "tLast": "Sun Aug 14 2016 16:54:26 GMT+0000 (UTC)",
  "now": "Wed Aug 24 2016 14:49:15 GMT+0000 (UTC)"
}

As you can see Mongo decided somehow that oplog have to be cleared. In comparison see below same info from other replica member :

floow:PRIMARY> db.getReplicationInfo()
{
	"logSizeMB" : 614400,
	"usedMB" : 614401.13,
	"timeDiff" : 838673,
	"timeDiffHours" : 232.96,
	"tFirst" : "Sun Aug 14 2016 21:54:51 GMT+0000 (UTC)",
	"tLast" : "Wed Aug 24 2016 14:52:44 GMT+0000 (UTC)",
	"now" : "Wed Aug 24 2016 14:52:44 GMT+0000 (UTC)"
}

I can provide log from that period but there is nothing indicating any unusual behaviour, no errors.

2016-08-24T13:58:48.827+0000 I REPL     [ReplicationExecutor] syncing from: db-node2.dlg.ie.aws.thefloow.net:27017
2016-08-24T13:58:48.829+0000 W REPL     [rsBackgroundSync] we are too stale to use db-node2.dlg.ie.aws.thefloow.net:27017 as a sync source
2016-08-24T13:58:48.829+0000 I REPL     [ReplicationExecutor] syncing from: db-node3.dlg.ie.aws.thefloow.net:27017
2016-08-24T13:58:48.852+0000 W REPL     [rsBackgroundSync] we are too stale to use db-node3.dlg.ie.aws.thefloow.net:27017 as a sync source
2016-08-24T13:58:48.852+0000 I REPL     [ReplicationExecutor] could not find member to sync from
2016-08-24T13:58:48.852+0000 E REPL     [rsBackgroundSync] too stale to catch up -- entering maintenance mode
2016-08-24T13:58:48.852+0000 I REPL     [rsBackgroundSync] our last optime : (term: -1, timestamp: Aug 14 16:54:26:63)
2016-08-24T13:58:48.852+0000 I REPL     [rsBackgroundSync] oldest available is (term: -1, timestamp: Aug 14 20:10:58:f)

In current state we cannot reliably upgrade our database to 3.2.x as this results in de-sync.

duplicates

SERVER-25353 Clean shutdown should leave secondaries without the batch-in-progress flag

Closed

is related to

SERVER-25255 After upgrading to 3.2.8 (from 3.0.9) startup appears to hang with lots of disk reads to the local collection.

Closed

Assignee:: Scott Hernandez (Inactive)
Reporter:: Bartosz Debski
Participants:: Bartosz Debski, Paul Ridgway, Scott Hernandez
Votes:: 1 Vote for this issue
Watchers:: 9 Start watching this issue

Created:: Aug 24 2016 03:04:15 PM UTC
Updated:: Aug 27 2016 09:14:17 AM UTC
Resolved:: Aug 26 2016 10:06:01 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates