[SERVER-23045] Auto Sync of a failed node Failed Created: 10/Mar/16  Updated: 11/Mar/16  Resolved: 11/Mar/16

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.2.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: James Mangold Assignee: Scott Hernandez (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File diagnostic.data.tar.gz     File mongodb.log.2016-03-08T01-34-34.gz     File mongodb.log.2016-03-08T01-38-02.gz     File mongodb.log.2016-03-09T01-38-04.gz     File mongodb.log.2016-03-10T01-38-04.gz     File mongodb.log.gz    
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

For what I see,
-load up a replica set with a lot of data
-take one node off line and wipe the disk
-THIS part I am not clear on:
EITHER - wait until the beginning of the oplog is after the time that the node went down
OR - wait until the oplog is certain to wrap while events are being replayed from it.

Participants:

 Description   

Hi, We had a node of a 3 node replica set fail with a bad disk and we had to stop the node. After 2 days or so, we got the node back online. We no longer had the dataDir, so we had to resync automatically (which we have done before). It took 18 hours or so to resync. Then it appears to have started reading from the oplog and then the member became too stale to recover. So basically, we resynced for 18 hours (~500gb) and then it died when reading the oplog because the oplog contained less that 18 hours of data, it seems.

I have no idea why this would be the case. Maybe someone can shed some light on this. We could never size the oplog to hold all we would ever need to resync (new data the was loaded) since a node went down. What if it was down for 2 weeks. This wasn't even a lot of data... What if it took a week to resync 5TB? Not sure how this works. I am going to try and attach the logs



 Comments   
Comment by Scott Hernandez (Inactive) [ 11/Mar/16 ]

Yes, there is work planned to improve this. Today there is too much variability based on throughput and possible fluctuations.

I'm going to close this now since it is not a bug nor feature request.

Comment by James Mangold [ 11/Mar/16 ]

Yes, but that is also a moving target, no? Let's say I have 5TB to sync. It took 18 hours to sync 500GB, that tells me I need an oplog of 180 hours of data. What if it was more? What if there was a lot of network traffic, or any other variable that would cause the replication to take longer than 180 hours (which is crazy to begin with ). Couldn't the oplog roll in that case as well? It seems like there has to be a better way to do this, deterministically. As in - this won't render you in a loop of not waiting for a node to sync for a week, and then it fails because it took a week.

Comment by Scott Hernandez (Inactive) [ 11/Mar/16 ]

James,

Maybe it wasn't clear, but one only needs an oplog to be large enough to span the time of the initial sync process, not for how long the node was "down/unavailable".

The general rule of thumb is to have an oplog that is large enough for any required/unexpected maintenance or the initial sync process. For example, if initial sync take 24 hours then have an oplog of 36/48 hours would give enough time for spikes and such. If you process 1TB of inserts in a day, then you need an oplog greater than 1TB if the initial sync process takes a day.

There are definitely cases where you might be running at/near capacity in terms of throughput where you could not do an initial sync due to having to replay the oplog.

--Scott

Comment by James Mangold [ 11/Mar/16 ]

Thanks Scott,
So from what you are saying, the oplog would need to be sized to estimate the time of an initial sync? How would that ever be possible? It seems like a moving target. What if one member was down for 2 weeks waiting for parts? We could never have 2 weeks of data in the oplog of an active cluster. Furthermore, if we made the oplog let's say 1TB, don’t we also stand the chance of the oplog rolling because it took too long to replay the oplog? This seems like it could never work correctly. What if we loaded a TB per day and had a replicant member down for 2 days? We could never get that member back unless we stopped the entire shard, and this lowered our real-time throughput.

-James

Comment by Scott Hernandez (Inactive) [ 11/Mar/16 ]

Thanks for the info. I believe it answer my last remaining questions, and clarifies what happened, and why.

In short, the oplog on the nodes are 200GB which is just a bit too small for the amount of time/data needed to do a full initial sync. This is because of the requirements of the initial sync process which which needs an oplog on the source of the initial sync to be large enough to not roll-over (or put another way – to hold all the data from the start till the end of the initial sync) during the initial sync process.

Here are some of the relevant docs related to this:
https://docs.mongodb.org/manual/core/replica-set-sync/#initial-sync
https://docs.mongodb.org/manual/core/capped-collections/#automatic-removal-of-oldest-documents

These features requests and improvements may be able to help address this kind of failure in the future (possibly for the next release): SERVER-7200, SERVER-10751, SERVER-18041

Comment by James Mangold [ 11/Mar/16 ]

Hi Scott,
Replies in order:
---Yes, all of the files I sent were from the node that went down and had to resync
---The disk was wiped because a disk in the array failed, so it was a blank machine. Install mongo, then immediately stop it to soft link the oplog and journal to a different volume. Restart Mongo and allow it to sync fresh from its chosen member
---rs.conf() from the failed node (We have not changed the config):

MongoDB Enterprise SEC-DEV-SHRD_2:RECOVERING> rs.conf()
{
        "_id" : "SEC-DEV-SHRD_2",
        "version" : 1,
        "protocolVersion" : NumberLong(1),
        "members" : [
                {
                        "_id" : 0,
                        "host" : "[NODE8]:27020",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
 
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 1,
                        "host" : "[NODE9]:27020",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
 
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                },
                {
                        "_id" : 2,
                        "host" : "[NODE10]:27020",
                        "arbiterOnly" : false,
                        "buildIndexes" : true,
                        "hidden" : false,
                        "priority" : 1,
                        "tags" : {
 
                        },
                        "slaveDelay" : NumberLong(0),
                        "votes" : 1
                }
        ],
        "settings" : {
                "chainingAllowed" : true,
                "heartbeatIntervalMillis" : 2000,
                "heartbeatTimeoutSecs" : 10,
                "electionTimeoutMillis" : 10000,
                "getLastErrorModes" : {
 
                },
                "getLastErrorDefaults" : {
                        "w" : 1,
                        "wtimeout" : 0
                }
        }
}

— db.printReplicationInfo()
NODE8:

configured oplog size:   200000MB
log length start to end: 75709secs (21.03hrs)
oplog first event time:  Thu Mar 10 2016 11:21:29 GMT-0500 (EST)
oplog last event time:   Fri Mar 11 2016 08:23:18 GMT-0500 (EST)
now:                     Fri Mar 11 2016 08:23:18 GMT-0500 (EST)

NODE9 (the failed node):

configured oplog size:   200000MB
log length start to end: 0secs (0hrs)
oplog first event time:  Mon Mar 07 2016 20:38:09 GMT-0500 (EST)
oplog last event time:   Mon Mar 07 2016 20:38:09 GMT-0500 (EST)
now:                     Fri Mar 11 2016 08:22:29 GMT-0500 (EST)

NODE10:

configured oplog size:   200000MB
log length start to end: 75719secs (21.03hrs)
oplog first event time:  Thu Mar 10 2016 11:21:43 GMT-0500 (EST)
oplog last event time:   Fri Mar 11 2016 08:23:42 GMT-0500 (EST)
now:                     Fri Mar 11 2016 08:23:42 GMT-0500 (EST)

---The system was definitely active during the recovery process, which is what I would expect if I opted to perform an auto resync, as opposed to stopping the entire shard. Stopping the entire shard would not be an acceptable recovery for us, as our capacity would be dimished.

Thanks,
James

Comment by Scott Hernandez (Inactive) [ 11/Mar/16 ]

James, this sounds like the cloning of the data and building indexes took quite a while before the oplog could be used to finish the process, which is where the problem occurred.

I've got a few questions about what happened and the files you attached.

  • Are all the files from the node which you had to resync?
  • How did you resync automatically? What steps were taken?
  • What is your replica set config ("rs.conf()" in the shell)? Has it changed since the incident or did it change before?
  • What does "db.printReplicationInfo()" show for all your members?
  • How active was the system during the time when the initial sync ran? Were there lots of inserts and writes?
Generated at Thu Feb 08 04:02:12 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.