[SERVER-22389] Replica set member with data from backup fails to sync Created: 01/Feb/16 Updated: 09/Jun/16 Resolved: 04/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 3.2.1 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Yoni Douek | Assignee: | Scott Hernandez (Inactive) |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
Our replica contains a primary, secondary and arbiter. Data is backed up periodically from the secondary using the recommended method (fsync lock & snapshotting). When we create a new member based on the backup - and add it to the set - it will not sync, showing this - endlessly without doing anything:
This happen for very recent backups (a few minutes old). Note that the data dir contains data which is identical to our secondary node, which does sync correctly, even if stopped. Happens with protocol version 0 and 1. A log with verbose level 5 is attached. We can easily reproduce it. Please advise, the only way for us to add a member is to have a full initial sync which is unrealistic (our DB size is 8TB). |
| Comments |
| Comment by Ramon Fernandez Marina [ 13/Feb/16 ] | |
|
yonido, yesterday we released a 3.2.2-rc2 release candidate containing a fix for Thanks, | |
| Comment by Yoni Douek [ 10/Feb/16 ] | |
|
Wanted to stress out how severe this bug is - So, in pv0 - high load = replica out of sync, forever. | |
| Comment by Scott Hernandez (Inactive) [ 04/Feb/16 ] | |
|
We have now committed the change in the timeout for 3.2.2, and 3.3.2. The old timeout in 3.0 was 30 seconds. | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
What was the previous value? You can consider making it dynamic per query, based on the time distance between the optimes. | |
| Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ] | |
|
That is good to hear. We are thinking of going with 1 minute. | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
Thanks. We tried 60 seconds (=30 seconds timeout) with no luck, but 120 (=60 seconds timeout) worked : ) It always worked on 3.0 - wanted to double verify that the previous setting was indeed 30 seconds. | |
| Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ] | |
|
I've filed a new issue to address the code changes we are going to make to address this: The timeout for a pv1 find query is the electionTime/2, so 5 seconds, where the timeout for pv0 is 2s. When replication issues the query we use a few special options to speed it up, so your query doesn't reflect the performance replication will see. Two special options are used: AwaitData, OplogReplay | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
I've tried to run the query manually:
it's running for more than 5 minutes now. So I'm not sure that the switch from 30 to 5 seconds is the only thing here... There's also no index on "ts" in the oplog. | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
How is election timeout related? | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
OK fair enough, we'll try to create a very tight backup for now. We'll appreciate a fix on the next release. Thanks for the super fast responsiveness! | |
| Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ] | |
|
We will be looking at solutions. There are two workaround:
If I discover more as I look at the code I'll post it here. | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
Great, anything we can do with the existing build? change the maxTimeMS for this query somehow? Not sure we'll be able to get the backup fast enough so that the query will return that fast. | |
| Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ] | |
|
Yep, looks that way. In 3.2 we have decreased the timeout from 30 seconds to 5 second for oplog queries, which is needed for liveness information and additional features so that is the underlying issue. Thanks for the help and logs. I will see what we can do about this, but I think at the moment this is going to be limited to MMAP sync sources (where you need to go into the middle of the oplog) because WiredTiger has an index internally, so this won't be a problem. | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
I think this is the source of the problem:
Looks like it takes 5 seconds to read the oplog in the source node. Maybe this happens when the member is too far behind and has a lot to read (it has a few hours to sync in our case), and therefore happens only in backups that are far behind. This didn't happen in 3.0 though... Diagnostic data from the source node attached, let me know if u need anything else. Thanks! | |
| Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ] | |
|
Okay, that was somewhat useful and points to the upstream server (the sync source of in.db2m1.roi.test.mydomain.com:27017) as not being able to return any data within the timeout.
Can you post a section of the logs from that server? If the problem is still happening any section should do as this is probably happening constantly (every few seconds). Also, can you provide the diagnostic.data directory also? Thanks, | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
Attached - a log from the version u provided (thanks!) with log level 2. Thanks, | |
| Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ] | |
|
Yes, some of this is off-topic, and I don't want to distract but here are some answers: A member will not be chosen to sync from if there are no new writes out there, so it is expected for a node starting up to do that until it sees new writes. Not sure if you have constant writes or not during/after the backup/restore/new-member. Changing the protocolVersion (via a reconfig) will cause a write and as a consequence, new sync sources to be evaluated and choosen, possibly. The term will only exist once the primary switches to protocolVersion 1, and each member will only record documents in "replset.election" once it has switched to electionProtocol 1 and been involved in an election. So it sounds very much like this is both expected and by design – not a bug. Feel free to file more issues if you think you have found a bug, but for now let's try to diagnose your sync source problems. So far everything you have supplied seems to point to some issue in your environment related to that. | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
Another unrelated question, and sorry for being off topic, we also had a small glitch here that caused a wrong member election. I may open a different bug. From reading logs it looks like it's related to a wrong replication 'term' number in some of the members. Are you aware that when upgrading an RS that was created on 3.0 and updating its protocolVersion, the collection replset.election is not created and does not exist (and therefore term data doesn't exist) - unlike an RS that is initiated in 3.2+ in which it does exist? Is this by design? | |
| Comment by Yoni Douek [ 03/Feb/16 ] | |
|
I'm on it right now, will update soon. I did find another bug though, maybe it's related? 1. Replica with no protocolVersion (not updated yet), running 3.2.1. This practically means that you can't add nodes running 3.2 without updating the protocolVersion. | |
| Comment by Scott Hernandez (Inactive) [ 03/Feb/16 ] | |
|
yonido, were you able to run the version I supplied? If so, what happened and can you provide the logs? | |
| Comment by Scott Hernandez (Inactive) [ 02/Feb/16 ] | |
|
I was not able to reproduce what you are seeing, either part. Can you post the logs from the initial sync member? I have put together a new mongod (64-bit linux) with extra logging around some areas at logLevel 2 which should provide more information, can you run that once the node gets "stuck"? This is not a build that should be used once we get the extra logging or if the issue goes away, so once we have that info please shut the process down and go back to the real mongod file. The filename attached is mongod-bsync-logging.gz Also, can you try to add the member to the replica set before you take the snapshot and start the node? This will ensure that the replica set config has that member in the config when it first comes online – just an idea which might show more info in the logs. | |
| Comment by Yoni Douek [ 02/Feb/16 ] | |
|
Hey, Very bad news, this happens also when we add a node that is completely empty and finishes the initial sync. When it starts syncing from that point - it behaves in a similar way. So we're basically stuck. Please help. | |
| Comment by Yoni Douek [ 01/Feb/16 ] | |
|
1. Secondary. 2. No. 3+4. Attached (it was added / removed several times in that timeframe, LMK if u need anything cleaner). 5. Disk level snapshot, using AWS's EBS snapshot, so everything is copied (i know it's weird...). Our hunch is that you added some identifier which is saved externally? Another hint is that the in log entry that is attached, the "optimes" array does not include the primary node, whereas in secondaries that do sync correctly - it does. | |
| Comment by Scott Hernandez (Inactive) [ 01/Feb/16 ] | |
|
Understood, I will try to produce with my own reproduction. We already have tests which check this functionality, fsync+lock and copy files to add a new node from, so I'll try to change things which might be different like on your system. Unless I can easily reproduce this I expect I will have a few questions come soon, but I let me start with a few basic ones, if you have a minute:
Thanks in advance, | |
| Comment by Yoni Douek [ 01/Feb/16 ] | |
|
Hey Scott, We don't have a test to share, we do have a test environment with 3 shards, and this happens on each one every time so we can reproduce it easily if you need anything from us (logs, run a special version). We add members quite often (every ~2 weeks) to replace existing ones / maintenance - and used it more than 50 times on 3.0 with no issues, so this is new. | |
| Comment by Scott Hernandez (Inactive) [ 01/Feb/16 ] | |
|
Hello Yoni, Do you have a test that reproduces this, or do you mean that when you take a backup and add a new node with that data it happens frequently? And if the latter, how often do you add a new member this way? --Scott | |
| Comment by Yoni Douek [ 01/Feb/16 ] | |
|
Also adding our rs.conf(). The new member is the last one. |