[SERVER-25773] Upgrade from 3.0.9 to 3.2.9 discards oplog and can't catch up Created: 24/Aug/16 Updated: 27/Aug/16 Resolved: 26/Aug/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Admin, Replication, WiredTiger |
| Affects Version/s: | 3.0.9, 3.2.9 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Bartosz Debski | Assignee: | Scott Hernandez (Inactive) |
| Resolution: | Duplicate | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Hi, We had indicate previously in Today we have tried upgrade again from 3.0.9 to 3.2.9 and left it so it can start up. It took about 4 hours to start :
Unfortunately for us looks like mongo decided to disregard all the date in the oplog and cannot sync as it's to stale.
To be perfectly clear this node was warmed up and in production without issue before we attempted this upgrade. Oplog db size was very big as well :
As you can see Mongo decided somehow that oplog have to be cleared. In comparison see below same info from other replica member :
I can provide log from that period but there is nothing indicating any unusual behaviour, no errors.
In current state we cannot reliably upgrade our database to 3.2.x as this results in de-sync. |
| Comments |
| Comment by Paul Ridgway [ 27/Aug/16 ] | |||
|
Thanks, what result would we expect to see in a good or bad situation? | |||
| Comment by Scott Hernandez (Inactive) [ 26/Aug/16 ] | |||
|
Yeah, you can run this query and we can take a look at the output.
I'm going to close this now that we have a culprit. Feel free to add a comment or file a new issue if something else comes up. | |||
| Comment by Paul Ridgway [ 26/Aug/16 ] | |||
|
Looks like we should (could) try a node not affected by a prior upgrade attempt - anyway to identify such a bug (failed/reverted upgrade)? | |||
| Comment by Scott Hernandez (Inactive) [ 26/Aug/16 ] | |||
|
It looks like you were effected by a bug during your first downgrade (from 3.2.x < 3.2.9), resulted in a marker being left which caused the oplog to be truncated to the first entry after the second upgrade. If we had a snapshot of data between those events we could confirm that, but everything points to it: The logs show this:
| |||
| Comment by Bartosz Debski [ 25/Aug/16 ] | |||
|
Last snapshot I have is for this particular node is from 6th of July. I have uploaded mongodb.log-20160825.gz via secure upload. In this file there is also a post-upgrade startup logged with verbosity 5 which might help a bit. | |||
| Comment by Scott Hernandez (Inactive) [ 25/Aug/16 ] | |||
|
The logs from before the shutdown and upgrade may be useful (starting before 2016-08-24T09:55:07); can you upload them please? Do you take file-system backups by any chance? If so, would you have a copy of the files before the upgrade? | |||
| Comment by Bartosz Debski [ 25/Aug/16 ] | |||
|
1 & 2 Prior to this last time when this node was down was in 17th May and it was for a storage upgrade/resize , again no complications after node came back online. 3. As for logs, unfortunately oldest log we can provide is from 18th. If that's any good then we can provide logs from 18th till 24th. | |||
| Comment by Scott Hernandez (Inactive) [ 24/Aug/16 ] | |||
|
Can you provide a few things to help us understand what could have happened:
| |||
| Comment by Bartosz Debski [ 24/Aug/16 ] | |||
|
hmm, thought edit on description will be are enabled to make this issue easier to read but no, so apologies for a bit of chaos. |