[SERVER-23045] Auto Sync of a failed node Failed Created: 10/Mar/16 Updated: 11/Mar/16 Resolved: 11/Mar/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | James Mangold | Assignee: | Scott Hernandez (Inactive) |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Backwards Compatibility: | Fully Compatible |
| Operating System: | ALL |
| Steps To Reproduce: | For what I see, |
| Participants: |
| Description |
|
Hi, We had a node of a 3 node replica set fail with a bad disk and we had to stop the node. After 2 days or so, we got the node back online. We no longer had the dataDir, so we had to resync automatically (which we have done before). It took 18 hours or so to resync. Then it appears to have started reading from the oplog and then the member became too stale to recover. So basically, we resynced for 18 hours (~500gb) and then it died when reading the oplog because the oplog contained less that 18 hours of data, it seems. I have no idea why this would be the case. Maybe someone can shed some light on this. We could never size the oplog to hold all we would ever need to resync (new data the was loaded) since a node went down. What if it was down for 2 weeks. This wasn't even a lot of data... What if it took a week to resync 5TB? Not sure how this works. I am going to try and attach the logs |
| Comments |
| Comment by Scott Hernandez (Inactive) [ 11/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, there is work planned to improve this. Today there is too much variability based on throughput and possible fluctuations. I'm going to close this now since it is not a bug nor feature request. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Mangold [ 11/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Yes, but that is also a moving target, no? Let's say I have 5TB to sync. It took 18 hours to sync 500GB, that tells me I need an oplog of 180 hours of data. What if it was more? What if there was a lot of network traffic, or any other variable that would cause the replication to take longer than 180 hours (which is crazy to begin with | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 11/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
James, Maybe it wasn't clear, but one only needs an oplog to be large enough to span the time of the initial sync process, not for how long the node was "down/unavailable". The general rule of thumb is to have an oplog that is large enough for any required/unexpected maintenance or the initial sync process. For example, if initial sync take 24 hours then have an oplog of 36/48 hours would give enough time for spikes and such. If you process 1TB of inserts in a day, then you need an oplog greater than 1TB if the initial sync process takes a day. There are definitely cases where you might be running at/near capacity in terms of throughput where you could not do an initial sync due to having to replay the oplog. --Scott | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Mangold [ 11/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks Scott, -James | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 11/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Thanks for the info. I believe it answer my last remaining questions, and clarifies what happened, and why. In short, the oplog on the nodes are 200GB which is just a bit too small for the amount of time/data needed to do a full initial sync. This is because of the requirements of the initial sync process which which needs an oplog on the source of the initial sync to be large enough to not roll-over (or put another way – to hold all the data from the start till the end of the initial sync) during the initial sync process. Here are some of the relevant docs related to this: These features requests and improvements may be able to help address this kind of failure in the future (possibly for the next release): | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by James Mangold [ 11/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Scott,
— db.printReplicationInfo()
NODE9 (the failed node):
NODE10:
---The system was definitely active during the recovery process, which is what I would expect if I opted to perform an auto resync, as opposed to stopping the entire shard. Stopping the entire shard would not be an acceptable recovery for us, as our capacity would be dimished. Thanks, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 11/Mar/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
James, this sounds like the cloning of the data and building indexes took quite a while before the oplog could be used to finish the process, which is where the problem occurred. I've got a few questions about what happened and the files you attached.
|